database migration

Zero Downtime! A Thrilling Real-World Migration of 1 Billion Financial Records

Honghao Wang

10 Nov 2025 — 3 min read

A Migration Story That Stays With You Forever

Some projects leave a permanent mark.

For me, it was those long nights spent migrating over 1 billion records from our old database to a new one — without a single second of downtime.

We were moving mission‑critical financial data: payments, orders, ledgers. One mistake could cause customers to lose money, dashboards to crash, and trust to disappear overnight.

This journey was more than just a technical challenge — it was a test of system design under extreme pressure, requiring deep knowledge of databases, hard trade‑offs, and precise decision-making.

---

Why We Had to Migrate

Our legacy database had served us well, but scale had changed the game.

Queries that used to take milliseconds now took seconds.
Batch processes (such as settlement) ran for hours.
We had already scaled vertically (stronger servers) and horizontally (read replicas) — still insufficient.
Schema rigidity made feature development feel like open-heart surgery.

After crossing 1 billion records, performance was unacceptable.

Downtime was not an option.

Migration was unavoidable.

---

Step 1 — Bulk Migrate the Cold Data

We started with cold data — historical transactions that were no longer updated.

Why this is tricky:

Exporting billions of rows consumes massive buffers, triggers disk spills, and risks crashes.
Inserts update indexes, slowing everything down.
Foreign key constraints require validation against other tables.

Our solution:

Split tables into chunks by primary key ranges (e.g., `ID 1–5 M`, `5 M–10 M`).
Disable secondary indexes and constraints during load.
Run multiple parallel worker processes.
After each chunk, run checksums to ensure exact data matches.

Lesson learned:

For massive migrations, avoid a single big move. Use chunks + parallel workers + idempotency.

---

Step 2 — Dual Writes to Handle Real-Time Traffic

While migrating old data, new transactions kept pouring in every second.

Our approach:

Application code performed dual writes to both old and new databases.
Failed writes to the new DB went into a Kafka retry queue.
Consumers retried until successful.
Each write had a unique ID to ensure idempotency.

Why it works:

Relational DBs write to a WAL (Write-Ahead Log) before committing. We ensured at least one successful write, then retried until both DBs matched.

> Lesson: Dual writing is a practical low-cost approach to distributed transactions. A retry queue saves you from partial failures.

---

Step 3 — Shadow Reads for Silent Validation

With both databases in sync, the question remained: Can we trust the new one?

Shadow reads strategy:

Customers read from the old DB.
The same queries execute silently on the new DB.
Results are compared and differences logged.

Findings:

Time zone differences (`TIMESTAMP WITHOUT TZ` vs `WITH TZ`).
NULLs turning into default values.
Different sorting due to UTF‑8 vs Latin1 collations.

> Lesson: Shadow traffic is essential; staging won’t surface real-world quirks.

---

Step 4 — The Switch (Cutover Night)

Data was synced, shadow reads passed — time for cutover.

Risks:

Cold buffer pools cause slow queries.
Unwarmed indexes hurt performance.
Background tasks may spike disk I/O.

Our plan:

Pre‑warm caches with synthetic queries.
Schedule cutover during lowest traffic (4:30 AM).
Enable feature flags to redirect reads to the new DB.
Keep dual writes active for safety.

First 10 minutes monitoring Grafana:

Latency — stable.
Error rate — zero.
Business metrics — all green.

> Lesson: Cutover requires pre‑warming, monitoring, and rollback readiness — not just flipping a switch.

---

Step 5 — Observability: The True Lifeline

Success came not from complex SQL, but from observability.

We monitored:

Replication lag.
Deadlocks.
Cache hit rates.
Shadow read mismatches.
Business KPIs — orders, revenue flow.

> Lesson: Migration is a monitoring problem disguised as a data problem.

---

Key Trade-offs We Made

Big Bang vs Phased Rollout

Big Bang: fast, irreversible.
Phased: slower, reversible.
✅ We chose phased.

ETL vs Dual Write

ETL: simpler, no real-time handling.
Dual Write: complex, safe.
✅ We chose dual write.

Index Creation Timing

During load: slows everything down.
After load: faster overall.
✅ We delayed index creation.

---

The Human Side of Migration

Business teams: “Can’t you just finish over the weekend?”
DBAs: lost sleep over possible corruption.
Developers monitored dashboards like a patient’s vital signs.

When success came, there were no cheers, only relieved silence.

---

Final Lessons for Zero-Downtime Migrations

Idempotence is mandatory.
Disable indexes/constraints during bulk loads.
Use dual writes with unique IDs.
Perform shadow reads under real traffic.
Warm caches before cutover.
Monitor internals, not just application metrics.
Always have a rollback plan.

---

Conclusion

Migrating 1 billion records taught us:

It’s not just a database problem — it’s a system design problem.

You migrate batch by batch, WAL entry by WAL entry, checksum by checksum.

Treat migration like building a distributed system, and zero downtime becomes a matter of design, not luck.

---

Source: https://medium.com/@himanshusingour7/how-we-migrated-db-1-to-db-2-1-billion-records-without-downtime-c034ce85d889