Bilibili Message System Architecture Upgrade
Key Questions to Address
- High Data Volume vs. Performance Degradation
- Technically, larger datasets increase the likelihood of performance degradation.
- In the messaging business, however, heavy-data users are typically high‑impact creators (UPs). The business side does not accept technical downgrades for these users.
- Challenge: How to ensure robust performance for high‑impact users without compromising quality.
- 10× Traffic Surge Preparedness
- If message traffic grows tenfold, what strategies can ensure the service remains stable and responsive?
---
1. Current System Overview
Business Perspective


The server-side messaging system is split into four core business lines:
- Customer Service
- System Notifications
- Interaction Notifications
- Private Messages
Private Messages — Four Subtypes
- User One‑to‑One Chat
- B2C Bulk Private Messaging
- Group Chat
- Support Group Assistant
> ⚠ These four subtypes are not yet technically decoupled.
> One‑to‑one chats and bulk private messaging are already close to system capacity limits.

Observation: One‑to‑one messages have <10% send‑to‑delivery conversion rates (PV/UV), showing large optimization potential.
---
Technical Perspective — Core Concepts
Private Messaging Domain Concepts
- Conversation List: Sorted by chat partner, contains account info, latest message, unread count.
- Conversation Detail: State with a single person (receiver UID, sender UID, unread count, etc.).
- Conversation History: Timeline of content for both peers (one‑to‑one) or shared group.
- Inbox: KV mapping — maps send sequence to content ID.
- Message Content: Raw text + ID + state.
- Timeline Model: Time‑axis abstraction for sync, indexing, and storage.
- Read Diffusion: Group chat model — one write visible to many readers.
- Write Diffusion: One‑to‑one chat — updates both sender and receiver; in groups, ephemeral push notifications.
---
Core Concept Relationship Map

---
2. Current Issues
Slow Conversation Queries
Root Causes & Risks:
- Cache expiry forces queries to hit MySQL, causing latency spikes.
- MySQL limitations:
- Sharded by `uid % 1000 / 100` → skewed distribution for high‑volume accounts.
- Single table contains up to 32 M sessions; largest user: 200 M sessions.
- Hot accounts cause performance imbalance.
- MySQL buffer_pool limits → cold data on disk → multi-second queries.
- Nine non‑clustered indexes already in place → more indexes hurt write performance.

---
Private Message Content — Table & Write Limits
- ID scheme: timestamp in `msgkey` → quarter sharding; monthly sub‑tables.
- Projected 2025 total volume: ~10 B rows
- Monthly tables already at hundreds of millions.
- Peak write QPS limits mean event traffic cannot be fully absorbed.

---
Service-Side Code Coupling
- Four private message types share core send/delivery logic and storage.
- Tight coupling → resource competition, unpredictable degradation.
---
3. Upgrade Path
Optimization Objective:
Private messaging is data-intensive, mixing read-heavy (unread counts) and write-heavy (session updates) workloads. Needs domain decoupling & scalability.
---
Target Architecture

---
Domain Layers — Four-Tier Model
- Access Layer: toC BFF + gateway
- Business Layer: Complex query support
- Platform Layer: IM‑style, real‑time, sequential delivery
- Reach Layer: Long connections + push
---
Client-Side Cache Degradation
Design Principle:
Always display critical summaries — avoid blank message pages even under extreme failure modes.
---
BFF Architecture Upgrade
- Abstracted services: Single Chat, Group Chat, System Notification, Interaction Notification, Message Settings
- Benefits:
- Reduced coupling
- Cache structure redesign
- MySQL index optimization
- Query latency drop by ×10


---
Server-Side Availability Upgrade
Business Layer:
- Cold/Hot Separation: Redis → Taishan → MySQL
- Read/Write Separation: 95% queries on MySQL replicas
Platform Layer:
- Snowflake ID timeline model
- Diffusion handling based on chat type
---
Single Chat Sessions
Proactive Cache Pre-Warming
- Capture homepage unread count events
- Asynchronously pre-build session cache for high-traffic accounts
- Taishan offline UID load + T+1 updates
- Hotspot monitoring → auto-trigger pre-warm
Taishan + MySQL Dual Persistence
- Redis: 24 h
- Taishan: 600 entries/user (~20 pages)
- MySQL: cold fallback


---
Hedged Backsourcing for Latency Optimization
- Redis miss → query Taishan + MySQL hedge
- Avoid long-tail delays
- Passive Taishan load on first miss

---
Consistency Guarantees
- Legacy service compatibility: binlog sync to new Redis/Taishan
- Challenges:
- Avoid duplicate consumption
- Preserve chronological order in binlogs
- Ensure low-latency writes
- Solution: Redis Lua CAS using `mtime` versioning
Performance:
- <1 s latency, 250 K records/s consumption capacity
- DTS delay <700 ms meets SLA

---
Inbox Upgrade
New Model: Redis + Taishan
- Redis: hot
- Taishan: full dataset + RANDOM read mode

---
Message Content Upgrade
- Separate single chat tables
- Async MySQL writes
- Monthly sharding; 100 tables per DB
- Routing: parse `msgkey` timestamp → monthly shard → `msgkey % 100` table
> Expected single-table data shrink: 900 M → 9 M rows


---
Batch Private Messaging Optimization
- Daily channel
- High-priority channel:
- More topic partitions
- Scale consumption PODs
- Larger in-POD channel count
- Cache expansion
- Average send speed boost: 3.5 K → 30 K users/sec
---
4. Conclusion
Lessons Learned:
- Upgrades require iterative optimization
- Set achievable yet challenging goals
- Identify gaps between current vs. ideal state
- Prioritize smooth migration & coexistence of old/new systems
- Continuously monitor metrics & convergence progress
Final Thought: Private messaging architecture is a long-term design challenge — balancing performance, consistency, scalability, and business flexibility.
---
References & Inspiration:
Cross-platform AI content delivery platforms like AiToEarn官网 share principles useful here:
- Layered architecture
- Caching hierarchy
- Modular scalability
- Distributed load balancing
These methods work across both messaging systems and global content publishing ecosystems.