MedGemma Breast Tumor Classification Fine-Tuning: Step-by-Step Guide
Disclaimer
This guide is provided for informational and educational purposes only and is not a substitute for professional medical advice, diagnosis, or treatment.
MedGemma should not be used without proper validation, adaptation, and/or meaningful modification for your specific application. Model outputs are not intended for direct clinical decision-making, diagnosis, or patient management. Even in cases where images or text closely match training data, results can be inaccurate. All outputs must be independently verified, clinically correlated, and investigated through established research and development methodologies.
---
Introduction
Artificial intelligence (AI) is transforming healthcare. But how do you teach a general-purpose AI model the specialized skills of a pathologist?
This guide starts at the prototype stage in a Jupyter notebook and walks you through fine-tuning the Gemma 3 variant MedGemma to classify breast cancer histopathology images.
We'll cover:
- Dataset preparation
- Model configuration
- Stability pitfalls (and fixes)
- Fine-tuning with LoRA
- Baseline vs. fine-tuned performance
---
> Tip: While developing AI workflows in prototyping or production, you may want to connect model training, evaluation, and publishing into a single pipeline with multi-platform distribution.
>
> Tools like AiToEarn provide open-source, integrated workflows for generating, publishing, and monetizing AI content across Douyin, Bilibili, YouTube, LinkedIn, and more.
---
Goal, Model, and Dataset
We aim to classify microscope images of breast tissue into eight categories:
- 4 benign types
- 4 malignant types
This mirrors a critical diagnostic task performed by pathologists.
Model
We’re using MedGemma — an open model from Google built for the medical community.
MedGemma consists of:
- Vision component (MedSigLIP) — pre-trained on de-identified medical images, including histopathology slides.
- Language component — trained on diverse medical text corpora.
A suitable variant is google/medgemma-4b-it, which responds well to structured prompts.
---
Dataset — BreakHis
We'll use the BreakHis dataset, containing thousands of microscope images from 82 patients, captured at 40X, 100X, 200X, and 400X magnifications.
License: Non-commercial research use.
---
Hardware Setup
Fine-tuning a 4B parameter model requires a powerful GPU. We used:
- NVIDIA A100 (40 GB VRAM) on Vertex AI Workbench
- Tensor Cores optimized for modern data formats
---
Crucial Lesson: `float16` vs. `bfloat16`
Using `float16` caused NaN numerical overflows during training. Switching to `bfloat16` fixed stability issues.
model_kwargs = dict(
torch_dtype=torch.bfloat16, # Wide range prevents overflow
device_map="auto",
attn_implementation="sdpa",
)---
Step-by-Step Implementation
Step 1: Install Dependencies
!pip install --upgrade --quiet transformers datasets evaluate peft trl scikit-learn---
Step 2: Authentication (Securely)
Use Google Cloud Secret Manager for tokens in production.
For prototyping, use Hugging Face’s notebook login:
from huggingface_hub import notebook_login
notebook_login()---
Step 3: Load and Filter BreakHis Dataset
We focus on:
- Fold 1
- 100X magnification
!pip install -q kagglehub
import kagglehub, pandas as pd
path = kagglehub.dataset_download("ambarish/breakhis")
folds = pd.read_csv(f'{path}/Folds.csv')
folds_100x = folds[(folds['mag'] == 100) & (folds['fold'] == 1)]
train_df = folds_100x[folds_100x.grp == 'train']
test_df = folds_100x[folds_100x.grp == 'test']---
Step 4: Balance the Dataset
def balance(df):
benign = df[df['filename'].str.contains('benign')]
malignant = df[df['filename'].str.contains('malignant')]
count = min(len(benign), len(malignant))
return pd.concat([benign.sample(count, random_state=42),
malignant.sample(count, random_state=42)])
train_df = balance(train_df)
test_df = balance(test_df)---
Step 5: Convert to Hugging Face Dataset
We define class labels and mapping:
CLASS_NAMES = [
'benign_adenosis','benign_fibroadenoma','benign_phyllodes_tumor','benign_tubular_adenoma',
'malignant_ductal_carcinoma','malignant_lobular_carcinoma','malignant_mucinous_carcinoma','malignant_papillary_carcinoma'
]
def get_label(filename):
filename = filename.lower()
for i, cname in enumerate(CLASS_NAMES):
if cname.split('_')[1] in filename:
return i
return -1---
Step 6: Prompt Engineering
We ask the model to return only the numeric class ID:
PROMPT = """Analyze this image and classify (0-7 only):
0: benign_adenosis
...
"""
def format_data(example):
example["messages"] = [
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": PROMPT}]},
{"role": "assistant", "content": [{"type": "text", "text": str(example["label"])}]},
]
return example---
Step 7: Load Model & Processor
from transformers import AutoModelForImageTextToText, AutoProcessor
MODEL_ID = "google/medgemma-4b-it"
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **model_kwargs)
processor = AutoProcessor.from_pretrained(MODEL_ID)
processor.tokenizer.padding_side = "right"---
Step 8: Baseline Evaluation
import evaluate, re
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
def postprocess(text):
m=re.search(r'\b([0-7])\b', text)
return int(m.group(1)) if m else -1---
Step 9: Fine-Tuning with LoRA
from peft import LoraConfig
peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.1, bias="none",
target_modules="all-linear", task_type="CAUSAL_LM")We use gradient checkpointing, bfloat16, and paged_adamw_8bit optimizer for efficiency.
---
Step 10: Train
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="medgemma-breastcancer-finetuned",
num_train_epochs=5,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
bf16=True,
)
trainer = SFTTrainer(model=model, peft_config=peft_config,
train_dataset=formatted_train, eval_dataset=formatted_eval,
processing_class=processor)
trainer.train()
trainer.save_model()---
Step 11: Final Evaluation
Reload base model, merge LoRA weights, and rerun evaluation.
---
Results
| Task | Model | Accuracy | F1 (Weighted/Malignant) |
|-------------|-------------|----------|-------------------------|
| 8-Class | Baseline | 32.6% | 0.241 |
| | Fine-tuned | 87.2% | 0.865 |
| Binary | Baseline | 59.6% | 0.639 |
| | Fine-tuned | 99.0% | 0.991 |
---
Key Takeaways
- bfloat16 prevents overflow in large medical AI models.
- LoRA allows efficient fine-tuning with small hardware.
- Balanced datasets are critical in medical imaging.
---
Next Steps
- Migrate workflow to Cloud Run jobs for production.
- Automate KPI tracking and content publishing with AiToEarn.
---
References:
- Spanhol, F. A., et al. A dataset for breast cancer histopathological image classification. IEEE T-BME, vol. 63, no. 7, 2016.
---
Do you want me to create a condensed cheatsheet version of this guide so you can quickly reuse the fine-tuning setup? That would make the workflow easy to replicate in future projects.