AI Playbooks

MedGemma Breast Tumor Classification Fine-Tuning: Step-by-Step Guide

Honghao Wang

19 Nov 2025 — 3 min read

Disclaimer

This guide is provided for informational and educational purposes only and is not a substitute for professional medical advice, diagnosis, or treatment.

MedGemma should not be used without proper validation, adaptation, and/or meaningful modification for your specific application. Model outputs are not intended for direct clinical decision-making, diagnosis, or patient management. Even in cases where images or text closely match training data, results can be inaccurate. All outputs must be independently verified, clinically correlated, and investigated through established research and development methodologies.

---

Introduction

Artificial intelligence (AI) is transforming healthcare. But how do you teach a general-purpose AI model the specialized skills of a pathologist?

This guide starts at the prototype stage in a Jupyter notebook and walks you through fine-tuning the Gemma 3 variant MedGemma to classify breast cancer histopathology images.

We'll cover:

Dataset preparation
Model configuration
Stability pitfalls (and fixes)
Fine-tuning with LoRA
Baseline vs. fine-tuned performance

---

> Tip: While developing AI workflows in prototyping or production, you may want to connect model training, evaluation, and publishing into a single pipeline with multi-platform distribution.

> Tools like AiToEarn provide open-source, integrated workflows for generating, publishing, and monetizing AI content across Douyin, Bilibili, YouTube, LinkedIn, and more.

---

Goal, Model, and Dataset

We aim to classify microscope images of breast tissue into eight categories:

4 benign types
4 malignant types

This mirrors a critical diagnostic task performed by pathologists.

Model

We’re using MedGemma — an open model from Google built for the medical community.

MedGemma consists of:

Vision component (MedSigLIP) — pre-trained on de-identified medical images, including histopathology slides.
Language component — trained on diverse medical text corpora.

A suitable variant is google/medgemma-4b-it, which responds well to structured prompts.

---

Dataset — BreakHis

We'll use the BreakHis dataset, containing thousands of microscope images from 82 patients, captured at 40X, 100X, 200X, and 400X magnifications.

License: Non-commercial research use.

---

Hardware Setup

Fine-tuning a 4B parameter model requires a powerful GPU. We used:

NVIDIA A100 (40 GB VRAM) on Vertex AI Workbench
Tensor Cores optimized for modern data formats

---

Crucial Lesson: `float16` vs. `bfloat16`

Using `float16` caused NaN numerical overflows during training. Switching to `bfloat16` fixed stability issues.

model_kwargs = dict(
    torch_dtype=torch.bfloat16,  # Wide range prevents overflow
    device_map="auto",
    attn_implementation="sdpa",
)

---

Step-by-Step Implementation

Step 1: Install Dependencies

!pip install --upgrade --quiet transformers datasets evaluate peft trl scikit-learn

---

Step 2: Authentication (Securely)

Use Google Cloud Secret Manager for tokens in production.

For prototyping, use Hugging Face’s notebook login:

from huggingface_hub import notebook_login
notebook_login()

---

Step 3: Load and Filter BreakHis Dataset

We focus on:

Fold 1
100X magnification

!pip install -q kagglehub

import kagglehub, pandas as pd

path = kagglehub.dataset_download("ambarish/breakhis")
folds = pd.read_csv(f'{path}/Folds.csv')

folds_100x = folds[(folds['mag'] == 100) & (folds['fold'] == 1)]
train_df = folds_100x[folds_100x.grp == 'train']
test_df  = folds_100x[folds_100x.grp == 'test']

---

Step 4: Balance the Dataset

def balance(df):
    benign = df[df['filename'].str.contains('benign')]
    malignant = df[df['filename'].str.contains('malignant')]
    count = min(len(benign), len(malignant))
    return pd.concat([benign.sample(count, random_state=42),
                      malignant.sample(count, random_state=42)])

train_df = balance(train_df)
test_df  = balance(test_df)

---

Step 5: Convert to Hugging Face Dataset

We define class labels and mapping:

CLASS_NAMES = [
 'benign_adenosis','benign_fibroadenoma','benign_phyllodes_tumor','benign_tubular_adenoma',
 'malignant_ductal_carcinoma','malignant_lobular_carcinoma','malignant_mucinous_carcinoma','malignant_papillary_carcinoma'
]

def get_label(filename):
    filename = filename.lower()
    for i, cname in enumerate(CLASS_NAMES):
        if cname.split('_')[1] in filename:
            return i
    return -1

---

Step 6: Prompt Engineering

We ask the model to return only the numeric class ID:

PROMPT = """Analyze this image and classify (0-7 only):
0: benign_adenosis
...
"""

def format_data(example):
    example["messages"] = [
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": PROMPT}]},
        {"role": "assistant", "content": [{"type": "text", "text": str(example["label"])}]},
    ]
    return example

---

Step 7: Load Model & Processor

from transformers import AutoModelForImageTextToText, AutoProcessor

MODEL_ID = "google/medgemma-4b-it"
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **model_kwargs)
processor = AutoProcessor.from_pretrained(MODEL_ID)
processor.tokenizer.padding_side = "right"

---

Step 8: Baseline Evaluation

import evaluate, re

accuracy_metric = evaluate.load("accuracy")
f1_metric       = evaluate.load("f1")

def postprocess(text):
    m=re.search(r'\b([0-7])\b', text)
    return int(m.group(1)) if m else -1

---

Step 9: Fine-Tuning with LoRA

from peft import LoraConfig

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.1, bias="none",
                         target_modules="all-linear", task_type="CAUSAL_LM")

We use gradient checkpointing, bfloat16, and paged_adamw_8bit optimizer for efficiency.

---

Step 10: Train

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="medgemma-breastcancer-finetuned",
    num_train_epochs=5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    bf16=True,
)
trainer = SFTTrainer(model=model, peft_config=peft_config,
                     train_dataset=formatted_train, eval_dataset=formatted_eval,
                     processing_class=processor)
trainer.train()
trainer.save_model()

---

Step 11: Final Evaluation

Reload base model, merge LoRA weights, and rerun evaluation.

---

Results

|-------------|-------------|----------|-------------------------|

| 8-Class | Baseline | 32.6% | 0.241 |

| | Fine-tuned | 87.2% | 0.865 |

| Binary | Baseline | 59.6% | 0.639 |

| | Fine-tuned | 99.0% | 0.991 |

---

Key Takeaways

bfloat16 prevents overflow in large medical AI models.
LoRA allows efficient fine-tuning with small hardware.
Balanced datasets are critical in medical imaging.

---

Next Steps

Migrate workflow to Cloud Run jobs for production.
Automate KPI tracking and content publishing with AiToEarn.

---

References:

Spanhol, F. A., et al. A dataset for breast cancer histopathological image classification. IEEE T-BME, vol. 63, no. 7, 2016.

---

Do you want me to create a condensed cheatsheet version of this guide so you can quickly reuse the fine-tuning setup? That would make the workflow easy to replicate in future projects.