AI news

Bilibili Message System Architecture Upgrade

Honghao Wang

21 Nov 2025 — 5 min read

Key Questions to Address

High Data Volume vs. Performance Degradation
Technically, larger datasets increase the likelihood of performance degradation.
In the messaging business, however, heavy-data users are typically high‑impact creators (UPs). The business side does not accept technical downgrades for these users.
Challenge: How to ensure robust performance for high‑impact users without compromising quality.
10× Traffic Surge Preparedness
If message traffic grows tenfold, what strategies can ensure the service remains stable and responsive?

---

1. Current System Overview

Business Perspective

The server-side messaging system is split into four core business lines:

Customer Service
System Notifications
Interaction Notifications
Private Messages

Private Messages — Four Subtypes

User One‑to‑One Chat
B2C Bulk Private Messaging
Group Chat
Support Group Assistant

> ⚠ These four subtypes are not yet technically decoupled.

> One‑to‑one chats and bulk private messaging are already close to system capacity limits.

Observation: One‑to‑one messages have <10% send‑to‑delivery conversion rates (PV/UV), showing large optimization potential.

---

Technical Perspective — Core Concepts

Private Messaging Domain Concepts

Conversation List: Sorted by chat partner, contains account info, latest message, unread count.
Conversation Detail: State with a single person (receiver UID, sender UID, unread count, etc.).
Conversation History: Timeline of content for both peers (one‑to‑one) or shared group.
Inbox: KV mapping — maps send sequence to content ID.
Message Content: Raw text + ID + state.
Timeline Model: Time‑axis abstraction for sync, indexing, and storage.
Read Diffusion: Group chat model — one write visible to many readers.
Write Diffusion: One‑to‑one chat — updates both sender and receiver; in groups, ephemeral push notifications.

---

Core Concept Relationship Map

---

2. Current Issues

Slow Conversation Queries

Root Causes & Risks:

Cache expiry forces queries to hit MySQL, causing latency spikes.
MySQL limitations:
Sharded by `uid % 1000 / 100` → skewed distribution for high‑volume accounts.
Single table contains up to 32 M sessions; largest user: 200 M sessions.
Hot accounts cause performance imbalance.
MySQL buffer_pool limits → cold data on disk → multi-second queries.
Nine non‑clustered indexes already in place → more indexes hurt write performance.

---

Private Message Content — Table & Write Limits

ID scheme: timestamp in `msgkey` → quarter sharding; monthly sub‑tables.
Projected 2025 total volume: ~10 B rows
Monthly tables already at hundreds of millions.
Peak write QPS limits mean event traffic cannot be fully absorbed.

---

Service-Side Code Coupling

Four private message types share core send/delivery logic and storage.
Tight coupling → resource competition, unpredictable degradation.

---

3. Upgrade Path

Optimization Objective:

Private messaging is data-intensive, mixing read-heavy (unread counts) and write-heavy (session updates) workloads. Needs domain decoupling & scalability.

---

Target Architecture

---

Domain Layers — Four-Tier Model

Access Layer: toC BFF + gateway
Business Layer: Complex query support
Platform Layer: IM‑style, real‑time, sequential delivery
Reach Layer: Long connections + push

---

Client-Side Cache Degradation

Design Principle:

Always display critical summaries — avoid blank message pages even under extreme failure modes.

---

BFF Architecture Upgrade

Abstracted services: Single Chat, Group Chat, System Notification, Interaction Notification, Message Settings
Benefits:
Reduced coupling
Cache structure redesign
MySQL index optimization
Query latency drop by ×10

---

Server-Side Availability Upgrade

Business Layer:

Cold/Hot Separation: Redis → Taishan → MySQL
Read/Write Separation: 95% queries on MySQL replicas

Platform Layer:

Snowflake ID timeline model
Diffusion handling based on chat type

---

Single Chat Sessions

Proactive Cache Pre-Warming

Capture homepage unread count events
Asynchronously pre-build session cache for high-traffic accounts
Taishan offline UID load + T+1 updates
Hotspot monitoring → auto-trigger pre-warm

Taishan + MySQL Dual Persistence

Redis: 24 h
Taishan: 600 entries/user (~20 pages)
MySQL: cold fallback

---

Hedged Backsourcing for Latency Optimization

Redis miss → query Taishan + MySQL hedge
Avoid long-tail delays
Passive Taishan load on first miss

---

Consistency Guarantees

Legacy service compatibility: binlog sync to new Redis/Taishan
Challenges:
Avoid duplicate consumption
Preserve chronological order in binlogs
Ensure low-latency writes
Solution: Redis Lua CAS using `mtime` versioning

Performance:

<1 s latency, 250 K records/s consumption capacity
DTS delay <700 ms meets SLA

---

Inbox Upgrade

New Model: Redis + Taishan

Redis: hot
Taishan: full dataset + RANDOM read mode

---

Message Content Upgrade

Separate single chat tables
Async MySQL writes
Monthly sharding; 100 tables per DB
Routing: parse `msgkey` timestamp → monthly shard → `msgkey % 100` table

> Expected single-table data shrink: 900 M → 9 M rows

---

Batch Private Messaging Optimization

Daily channel
High-priority channel:
More topic partitions
Scale consumption PODs
Larger in-POD channel count
Cache expansion
Average send speed boost: 3.5 K → 30 K users/sec

---

4. Conclusion

Lessons Learned:

Upgrades require iterative optimization
Set achievable yet challenging goals
Identify gaps between current vs. ideal state
Prioritize smooth migration & coexistence of old/new systems
Continuously monitor metrics & convergence progress

Final Thought: Private messaging architecture is a long-term design challenge — balancing performance, consistency, scalability, and business flexibility.

---

References & Inspiration:

Cross-platform AI content delivery platforms like AiToEarn官网 share principles useful here:

Layered architecture
Caching hierarchy
Modular scalability
Distributed load balancing

These methods work across both messaging systems and global content publishing ecosystems.