# 🛠 Large-Scale System Refactoring: Lessons from a 24-Month Journey
## 📚 Table of Contents
1. **Definition of “Refactoring”**
2. **Refactoring Decisions: When to Take Action**
3. **Scope of Refactoring: Migration or Reconstruction**
4. **Refactoring Strategies: Using the Strangler Pattern**
5. **Risk Control: Preventing Failures**
6. **Data Migration: Tackling the Hardest Challenge**
7. **Core Principles of Architecture Design**
8. **Review and Reflection**
---
## 📖 Introduction
In the previous article, [“Are Microservices a Tumor?”](https://mp.weixin.qq.com/s?__biz=MzI2NDU4OTExOQ==&mid=2247692823&idx=1&sn=7c54847bf978f65b774a26469cb4d083&scene=21#wechat_redirect), we discussed how architecture design and evolution must align with business needs over time.
This piece is a deep-dive, real-world case study of a **24-month refactoring project** — showing how to decide **when to migrate** and **when to rebuild** entirely.
### 📈 Key Outcomes
Over 24 months, the team achieved:
- **Availability**: ↑ from <99% to **99.99%**
- **Latency**: ↓ core path response times by 50%
- **Repo consolidation**: 200+ → 1 monorepo, codebase reduced 3M → 300K LOC
- **Cost savings**:
- Recommendation platform –58%
- Redis –70%
- Per 100M PV cost –95%
Guiding principle: **Conway’s Law** — org communication structure determines system architecture. This case focuses purely on technical architecture, assuming an **efficient, unconstrained organization**.
---
## 1️⃣ Definition of “Refactoring”
Refactoring is common in internet companies, yet misunderstood.
We can define it from **two perspectives**:
- **Micro (Code-level)**:
> Structured changes to improve internal code structure **without changing external behavior**.
Principle: Same input must yield same output before and after refactoring.
- **Macro (Business System-level)**:
> An engineering process improving architecture, code, and data models without altering external behavior, aiming to enhance non-functional metrics (maintainability, scalability, performance).
### 🎯 Invariants
- **Transaction systems**: Preserve flow integrity & idempotency.
- **Recommendation systems**: Preserve process integrity and KPI stability — invariants here are looser.
Principle:
> **Strength of invariance constraints** parallels strength of original system's data consistency requirements.
---
### 🌐 Modern Example
[AiToEarn官网](https://aitoearn.ai/) illustrates technical evolution aligned with business goals:
An **open-source AI content monetization platform** supporting cross-platform publishing (Douyin, Bilibili, Facebook, Instagram, YouTube, X, etc.) with AI model ranking ([AI模型排名](https://rank.aitoearn.ai)) and performance analytics.
---
### 💡 Why Refactor?
Two main triggers:
1. **Non-functional issues** impacting functionality.
2. **Business transformation** requiring new core capabilities.
**Case**: In late 2021, Tencent News was shifting to **personalized, engine-driven community content**. The legacy portal-era system couldn’t support it — technical debt was critical.
---
### ⚠️ The Situation Back Then
- Availability < 99%
- 200+ repos, 3M+ LOC
- 2000+ physical services
- High costs, slow development, poor scalability
Core pain points:
1. **Performance bottlenecks** — latency grew 500ms → 1200ms; poor resource reuse.
2. **Scalability issues** — monolithic tight coupling blocked innovation.
3. **High maintenance costs** — legacy code, resignations, low interpretability.
---
### 🎯 Refactoring Goals
- <800ms core chain latency
- Flexible strategy configs
- Availability > 99.9%
- Capacity for 3–5 years’ growth
---
## 2️⃣ Refactoring Decision: When to Take Action?
**Guidelines:**
1. 📊 **Technical Debt Interest**: If >30% of time is spent on legacy issues, refactor.
2. 🎯 **Business Goals Feasibility**: If core requirements can't be met or costs are prohibitive.
3. 👥 **Team Morale**: If talent retention is impacted.
**ROI Example**: Migration could cut ops costs by 70%, save millions annually, and boost metrics — compelling management case.
---
## 3️⃣ Scope: Migration vs. Rebuild
### 🔄 Migration
**Keep** business logic, **change** technical architecture:
- Ideal if logic is mature/stable
- Lower risk
- Faster wins
**Example**:
Unified recall services and configurable strategies → improved performance, minimal business disruption.
---
### 🏗️ Rebuild
**Redesign** business logic:
- Needed if logic is flawed
- Severe tech debt beyond migration cost
- Required for new capabilities
**Example**:
- Exposure deduplication (only count *real* exposures)
- Index platform overhaul
- Plugin recommendation upgrade for acquisition/reactivation goals
---
### 🎯 Hybrid Approach
Tencent News Recommendation System:
**Migration-first, partial rebuild** to control risk and solve long-standing pain points.
---
## 4️⃣ Refactoring Strategy: Strangler Fig Pattern
Chosen for:
- Business continuity
- Controlled risk
- Continuous value delivery
**Execution Phases**:
1. Proxy Layer (API Gateway)
2. Modular decomposition
3. Gradual traffic cutover (canary releases)
4. Decommission legacy modules after stability
**Data Validation**: Dual write + async comparison before traffic switch.
Pros:
- Risk control
- Easy rollback
Cons:
- Higher resource cost maintaining dual systems
---
## 5️⃣ Risk Control: Ensuring Zero Failure
### 🧪 Testing Layers
- Unit tests (≥80% coverage)
- Integration tests with traffic replay
- Online A/B testing
**Diff Platform** for invariance checks.
---
### 📈 Monitoring
Aggressive thresholds for latency, errors, CTR, dwell time, resource usage.
---
### 🔙 Rollback Plans
- One-click gateway traffic rollback
- DB backward compatibility
**Coordination**: “Refactor War Room” — all stakeholders synced biweekly.
---
## 6️⃣ Data Migration: The Hardest Challenge
Constraints:
- Large volume
- Zero downtime
- Data format changes
**Five-Step Online Migration**:
1. Full + incremental sync
2. Dual writes
3. Data validation
4. Gradual read switch
5. Write switch
**Pitfalls**:
- Incomplete validation → KPI drop
- Underestimating performance at scale
---
## 7️⃣ Principles of Architecture Design
1. **High Cohesion, Low Coupling**
- Interface-level coupling only
- Service separation: recall, rank, features, strategies
2. **Extensibility by Design**
- Strategy pattern for pluggable logic
3. **CAP Trade-offs**
- Prefer AP for recommendation systems
- Eventual consistency with async sync + caching
4. **Performance vs. Cost Balance**
- Two-stage ranking for latency goals
- Tiered caching based on data heat
5. **Observability**
- Monitoring, logs, tracing (OpenTelemetry)
- Unified “Diagnosis” platform for issue resolution
---
## 8️⃣ Review & Reflection
### 📊 Results
- **Availability**: 99.99%+, zero major incidents
- **Performance**: ↓ latency by 50%
- **Iteration speed**: doubled
- **Costs**: –58% platform, –70% Redis, –95% index per-100M PV
- **Governance**: Repo consolidation, LOC –90%, cluster count –80%, UT coverage ↑ to 60%
---
### 💭 Lessons Learned
1. Involve business teams early
2. Allow more time
3. Document & retain knowledge continuously
4. Tackle deep technical debt decisively
---
## 📝 Final Reminders
- Don’t over-rebuild to “show off” tech: base decisions on ROI & risk.
- Progressive refactoring is safer than big-bang rewrites.
- Align architecture lifespan (3–5 years) with business growth plans.
- Prevent problems proactively — **treat disease before it manifests**.
---
> *"The skillful warrior achieves no fame for wisdom nor merit for bravery"*
> *"Treading on frost foretells hard ice to come"*
---
### 🌐 Additional Note
Platforms like [AiToEarn官网](https://aitoearn.ai/) parallel these principles: unified architecture integrating AI content generation, cross-platform publishing (Douyin, Bilibili, Facebook, Instagram, YouTube, X), analytics, and model ranking. For architects, the ability to observe, optimize, and scale with minimal risk is critical.
---