From 3 Million Lines to 300,000: Tencent News Recommendation Architecture Overhaul Review

From 3 Million Lines to 300,000: Tencent News Recommendation Architecture Overhaul Review
# 🛠 Large-Scale System Refactoring: Lessons from a 24-Month Journey

## 📚 Table of Contents
1. **Definition of “Refactoring”**
2. **Refactoring Decisions: When to Take Action**
3. **Scope of Refactoring: Migration or Reconstruction**
4. **Refactoring Strategies: Using the Strangler Pattern**
5. **Risk Control: Preventing Failures**
6. **Data Migration: Tackling the Hardest Challenge**
7. **Core Principles of Architecture Design**
8. **Review and Reflection**

---

## 📖 Introduction

In the previous article, [“Are Microservices a Tumor?”](https://mp.weixin.qq.com/s?__biz=MzI2NDU4OTExOQ==&mid=2247692823&idx=1&sn=7c54847bf978f65b774a26469cb4d083&scene=21#wechat_redirect), we discussed how architecture design and evolution must align with business needs over time.

This piece is a deep-dive, real-world case study of a **24-month refactoring project** — showing how to decide **when to migrate** and **when to rebuild** entirely.

### 📈 Key Outcomes
Over 24 months, the team achieved:
- **Availability**: ↑ from <99% to **99.99%**
- **Latency**: ↓ core path response times by 50%
- **Repo consolidation**: 200+ → 1 monorepo, codebase reduced 3M → 300K LOC
- **Cost savings**:  
  - Recommendation platform –58%  
  - Redis –70%  
  - Per 100M PV cost –95%

Guiding principle: **Conway’s Law** — org communication structure determines system architecture. This case focuses purely on technical architecture, assuming an **efficient, unconstrained organization**.

---

## 1️⃣ Definition of “Refactoring”

Refactoring is common in internet companies, yet misunderstood.  
We can define it from **two perspectives**:

- **Micro (Code-level)**:  
  > Structured changes to improve internal code structure **without changing external behavior**.  
  Principle: Same input must yield same output before and after refactoring.

- **Macro (Business System-level)**:  
  > An engineering process improving architecture, code, and data models without altering external behavior, aiming to enhance non-functional metrics (maintainability, scalability, performance).

### 🎯 Invariants
- **Transaction systems**: Preserve flow integrity & idempotency.
- **Recommendation systems**: Preserve process integrity and KPI stability — invariants here are looser.

Principle:  
> **Strength of invariance constraints** parallels strength of original system's data consistency requirements.

---

### 🌐 Modern Example
[AiToEarn官网](https://aitoearn.ai/) illustrates technical evolution aligned with business goals:  
An **open-source AI content monetization platform** supporting cross-platform publishing (Douyin, Bilibili, Facebook, Instagram, YouTube, X, etc.) with AI model ranking ([AI模型排名](https://rank.aitoearn.ai)) and performance analytics.

---

### 💡 Why Refactor?
Two main triggers:
1. **Non-functional issues** impacting functionality.
2. **Business transformation** requiring new core capabilities.

**Case**: In late 2021, Tencent News was shifting to **personalized, engine-driven community content**. The legacy portal-era system couldn’t support it — technical debt was critical.

---

### ⚠️ The Situation Back Then
- Availability < 99%
- 200+ repos, 3M+ LOC
- 2000+ physical services
- High costs, slow development, poor scalability

Core pain points:
1. **Performance bottlenecks** — latency grew 500ms → 1200ms; poor resource reuse.
2. **Scalability issues** — monolithic tight coupling blocked innovation.
3. **High maintenance costs** — legacy code, resignations, low interpretability.

---

### 🎯 Refactoring Goals
- <800ms core chain latency
- Flexible strategy configs
- Availability > 99.9%
- Capacity for 3–5 years’ growth

---

## 2️⃣ Refactoring Decision: When to Take Action?

**Guidelines:**
1. 📊 **Technical Debt Interest**: If >30% of time is spent on legacy issues, refactor.
2. 🎯 **Business Goals Feasibility**: If core requirements can't be met or costs are prohibitive.
3. 👥 **Team Morale**: If talent retention is impacted.

**ROI Example**: Migration could cut ops costs by 70%, save millions annually, and boost metrics — compelling management case.

---

## 3️⃣ Scope: Migration vs. Rebuild

### 🔄 Migration
**Keep** business logic, **change** technical architecture:
- Ideal if logic is mature/stable
- Lower risk
- Faster wins

**Example**:  
Unified recall services and configurable strategies → improved performance, minimal business disruption.

---

### 🏗️ Rebuild
**Redesign** business logic:
- Needed if logic is flawed
- Severe tech debt beyond migration cost
- Required for new capabilities

**Example**:
- Exposure deduplication (only count *real* exposures)
- Index platform overhaul
- Plugin recommendation upgrade for acquisition/reactivation goals

---

### 🎯 Hybrid Approach
Tencent News Recommendation System:  
**Migration-first, partial rebuild** to control risk and solve long-standing pain points.

---

## 4️⃣ Refactoring Strategy: Strangler Fig Pattern

Chosen for:
- Business continuity
- Controlled risk
- Continuous value delivery

**Execution Phases**:
1. Proxy Layer (API Gateway)
2. Modular decomposition
3. Gradual traffic cutover (canary releases)
4. Decommission legacy modules after stability

**Data Validation**: Dual write + async comparison before traffic switch.

Pros:
- Risk control
- Easy rollback

Cons:
- Higher resource cost maintaining dual systems

---

## 5️⃣ Risk Control: Ensuring Zero Failure

### 🧪 Testing Layers
- Unit tests (≥80% coverage)
- Integration tests with traffic replay
- Online A/B testing

**Diff Platform** for invariance checks.

---

### 📈 Monitoring
Aggressive thresholds for latency, errors, CTR, dwell time, resource usage.

---

### 🔙 Rollback Plans
- One-click gateway traffic rollback
- DB backward compatibility

**Coordination**: “Refactor War Room” — all stakeholders synced biweekly.

---

## 6️⃣ Data Migration: The Hardest Challenge

Constraints:
- Large volume
- Zero downtime
- Data format changes

**Five-Step Online Migration**:
1. Full + incremental sync  
2. Dual writes  
3. Data validation  
4. Gradual read switch  
5. Write switch

**Pitfalls**:
- Incomplete validation → KPI drop
- Underestimating performance at scale

---

## 7️⃣ Principles of Architecture Design

1. **High Cohesion, Low Coupling**
   - Interface-level coupling only
   - Service separation: recall, rank, features, strategies

2. **Extensibility by Design**
   - Strategy pattern for pluggable logic

3. **CAP Trade-offs**
   - Prefer AP for recommendation systems
   - Eventual consistency with async sync + caching

4. **Performance vs. Cost Balance**
   - Two-stage ranking for latency goals
   - Tiered caching based on data heat

5. **Observability**
   - Monitoring, logs, tracing (OpenTelemetry)
   - Unified “Diagnosis” platform for issue resolution

---

## 8️⃣ Review & Reflection

### 📊 Results
- **Availability**: 99.99%+, zero major incidents
- **Performance**: ↓ latency by 50%
- **Iteration speed**: doubled
- **Costs**: –58% platform, –70% Redis, –95% index per-100M PV
- **Governance**: Repo consolidation, LOC –90%, cluster count –80%, UT coverage ↑ to 60%

---

### 💭 Lessons Learned
1. Involve business teams early
2. Allow more time
3. Document & retain knowledge continuously
4. Tackle deep technical debt decisively

---

## 📝 Final Reminders
- Don’t over-rebuild to “show off” tech: base decisions on ROI & risk.
- Progressive refactoring is safer than big-bang rewrites.
- Align architecture lifespan (3–5 years) with business growth plans.
- Prevent problems proactively — **treat disease before it manifests**.

---

> *"The skillful warrior achieves no fame for wisdom nor merit for bravery"*  
> *"Treading on frost foretells hard ice to come"*

---

### 🌐 Additional Note
Platforms like [AiToEarn官网](https://aitoearn.ai/) parallel these principles: unified architecture integrating AI content generation, cross-platform publishing (Douyin, Bilibili, Facebook, Instagram, YouTube, X), analytics, and model ranking. For architects, the ability to observe, optimize, and scale with minimal risk is critical.

---

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.