Kafka Optimization

How Tencent Music Cut Kafka Costs by Over 50% with AutoMQ

Honghao Wang

23 Oct 2025 — 4 min read

AutoMQ Case Study: Tencent Music’s Cloud-Native Kafka Upgrade

Date: 2025-10-23 14:52 Beijing

---

Editor’s Note

Tencent Music Entertainment Group, a leader in China’s online music and audio market, operates numerous popular mobile audio apps that generate massive volumes of user behavior and business data daily. This data fuels:

Precise recommendations
User growth strategies
Monetization initiatives

At the heart of this lies a powerful, stable, and efficient Kafka streaming system.

The Challenge

As Tencent Music’s business grew, operational complexity and cost in traditional self-managed Kafka environments became bottlenecks. To support surging data streams while controlling costs, the operations team explored next-generation Kafka solutions.

The Solution: AutoMQ, a cloud-native Kafka architecture, reduced cluster costs by over 50%, enabled second-level partition migration for rapid scaling, and cut operational overhead dramatically. This mirrored earlier success in decoupling compute and storage for their data warehouses.

---

Background

Tencent Music serves a broad, diverse user base in China, delivering music streaming and social entertainment experiences.

---

Technical Architecture

For massive user volumes, fast and reliable data flow, processing, and analytics are essential. Kafka forms the backbone of Tencent Music’s data infrastructure, achieving:

Upstream/downstream decoupling
Real-time, on-demand data consumption
Independent evolution of producers and consumers

Architecture Flow

From source collection to ingestion → Kafka streaming (via AutoMQ) → real-time processing → storage → business applications.

---

Key Pipeline Stages

1. Data Source

Observability Data: logs, metrics, traces
Analytical Data: metadata (songs, artists, copyright), user actions (playback, comments)

2. Data Ingestion

Unified access via Tencent Music’s internal Data Channel platform:

Preprocessing: segmentation, field filtering, security checks, dispatch
Collection mechanisms:
App SDKs for event tracking
Standard Kafka Producers for business services
VM Agents for system logs/metrics

3. Stream System (Kafka via AutoMQ)

Centralized AutoMQ clusters manage massive, multi-business data flows with high throughput and low latency.

4. Real-Time Computation

Using Flink for:

Aggregation & filtering
Real-time monitoring/alerting
Pre-cleaning data before storage

5. Data Storage

OLAP DBs for BI & analytics
Elasticsearch for fast search/troubleshooting

6. Data Applications

Observability: monitoring, diagnostics, event analysis
Business Analytics: recommendations, user insights, data modeling

---

Kafka Challenges

Escalating Costs

Issues with Kafka’s integrated architecture led to:

High idle resource reservations (30%–40% or more) for traffic peaks
Expensive local disk storage costs
Extra CPU usage from multi-replica synchronization

Operational Burden

Scaling up/down = lengthy “major surgery” (~1 day, heavy manual steps)
Hotspot mitigation requires manual write strategy adjustments
High risk of service impact during operations

---

Why AutoMQ

Key Benefits:

Rapid Elasticity — Stateless Brokers + object storage → partition migration in seconds
Lower Costs — Independent scaling of compute & storage, object storage replaces local disks
Kubernetes-Native — Fully stateless architecture for optimal scheduling & failover
Iceberg Support — Table Topic streams directly to Iceberg tables, cutting ETL overhead
Seamless Migration — Full Kafka protocol compatibility, zero app code changes

---

Evaluation & Migration Process

Phase 1 — Load Validation

June 2025: High-throughput test (large data, low QPS) → validated sustained performance with object storage
July 2025: High-QPS test (small messages, frequent connections) → validated metadata handling and small I/O aggregation

Outcome: AutoMQ met/exceeded production performance expectations → green light for migration.

---

Phase 2 — Production Migration

Three Steps:

Switch Producers — Roll out endpoint changes → new data flows into AutoMQ
Drain Old Cluster — Consumers process all historical data in old cluster
Switch Consumers — Move consumers to AutoMQ; start from earliest offset

---

Go-Live Results

6 AutoMQ clusters deployed
Peak write throughput: 1.6 GiB/s
Peak QPS: ~480K
>50% cost reduction (compute + storage)
Second-level scaling → +1 GiB/s capacity in tens of seconds
Dramatic improvement in operational agility & risk reduction

---

Future Outlook

Planned next steps:

Full migration of all Kafka clusters to AutoMQ
Implement Iceberg ingestion via Table Topic → streamlined data lake pipeline
Standardize AutoMQ as core infrastructure across Tencent Music
Migrate Kafka to Kubernetes → full cloud-native adoption

---

References:

Read Original Article

Open in WeChat

---

If you'd like, I can extend this into a detailed "Migration Playbook" with risk mitigation strategies and rollback plans for large-scale Kafka-to-AutoMQ transitions. Would you like me to prepare that next?

Top Game Development Frameworks

Impressive Architecture Designs in XXL-Job

The Excel That Gave Office Workers Headaches Got an AI Makeover… And I’m Hooked

Chinese AI Models Stun Silicon Valley: Airbnb Co‑Founder CEO Praises “Better, Faster, Cheaper” — Even Turns Down ChatGPT Collaboration