Bilibili Message System Architecture Upgrade

Bilibili Message System Architecture Upgrade

Key Questions to Address

  • High Data Volume vs. Performance Degradation
  • Technically, larger datasets increase the likelihood of performance degradation.
  • In the messaging business, however, heavy-data users are typically high‑impact creators (UPs). The business side does not accept technical downgrades for these users.
  • Challenge: How to ensure robust performance for high‑impact users without compromising quality.
  • 10× Traffic Surge Preparedness
  • If message traffic grows tenfold, what strategies can ensure the service remains stable and responsive?

---

1. Current System Overview

Business Perspective

image
image

The server-side messaging system is split into four core business lines:

  • Customer Service
  • System Notifications
  • Interaction Notifications
  • Private Messages

Private Messages — Four Subtypes

  • User One‑to‑One Chat
  • B2C Bulk Private Messaging
  • Group Chat
  • Support Group Assistant

> ⚠ These four subtypes are not yet technically decoupled.

> One‑to‑one chats and bulk private messaging are already close to system capacity limits.

image

Observation: One‑to‑one messages have <10% send‑to‑delivery conversion rates (PV/UV), showing large optimization potential.

---

Technical Perspective — Core Concepts

Private Messaging Domain Concepts

  • Conversation List: Sorted by chat partner, contains account info, latest message, unread count.
  • Conversation Detail: State with a single person (receiver UID, sender UID, unread count, etc.).
  • Conversation History: Timeline of content for both peers (one‑to‑one) or shared group.
  • Inbox: KV mapping — maps send sequence to content ID.
  • Message Content: Raw text + ID + state.
  • Timeline Model: Time‑axis abstraction for sync, indexing, and storage.
  • Read Diffusion: Group chat model — one write visible to many readers.
  • Write Diffusion: One‑to‑one chat — updates both sender and receiver; in groups, ephemeral push notifications.

---

Core Concept Relationship Map

image

---

2. Current Issues

Slow Conversation Queries

Root Causes & Risks:

  • Cache expiry forces queries to hit MySQL, causing latency spikes.
  • MySQL limitations:
  • Sharded by `uid % 1000 / 100` → skewed distribution for high‑volume accounts.
  • Single table contains up to 32 M sessions; largest user: 200 M sessions.
  • Hot accounts cause performance imbalance.
  • MySQL buffer_pool limits → cold data on disk → multi-second queries.
  • Nine non‑clustered indexes already in place → more indexes hurt write performance.
image

---

Private Message Content — Table & Write Limits

  • ID scheme: timestamp in `msgkey` → quarter sharding; monthly sub‑tables.
  • Projected 2025 total volume: ~10 B rows
  • Monthly tables already at hundreds of millions.
  • Peak write QPS limits mean event traffic cannot be fully absorbed.
image

---

Service-Side Code Coupling

  • Four private message types share core send/delivery logic and storage.
  • Tight coupling → resource competition, unpredictable degradation.

---

3. Upgrade Path

Optimization Objective:

Private messaging is data-intensive, mixing read-heavy (unread counts) and write-heavy (session updates) workloads. Needs domain decoupling & scalability.

---

Target Architecture

image

---

Domain Layers — Four-Tier Model

  • Access Layer: toC BFF + gateway
  • Business Layer: Complex query support
  • Platform Layer: IM‑style, real‑time, sequential delivery
  • Reach Layer: Long connections + push

---

Client-Side Cache Degradation

Design Principle:

Always display critical summaries — avoid blank message pages even under extreme failure modes.

---

BFF Architecture Upgrade

  • Abstracted services: Single Chat, Group Chat, System Notification, Interaction Notification, Message Settings
  • Benefits:
  • Reduced coupling
  • Cache structure redesign
  • MySQL index optimization
  • Query latency drop by ×10
image
image

---

Server-Side Availability Upgrade

Business Layer:

  • Cold/Hot Separation: Redis → Taishan → MySQL
  • Read/Write Separation: 95% queries on MySQL replicas

Platform Layer:

  • Snowflake ID timeline model
  • Diffusion handling based on chat type

---

Single Chat Sessions

Proactive Cache Pre-Warming

  • Capture homepage unread count events
  • Asynchronously pre-build session cache for high-traffic accounts
  • Taishan offline UID load + T+1 updates
  • Hotspot monitoring → auto-trigger pre-warm

Taishan + MySQL Dual Persistence

  • Redis: 24 h
  • Taishan: 600 entries/user (~20 pages)
  • MySQL: cold fallback
image
image

---

Hedged Backsourcing for Latency Optimization

  • Redis miss → query Taishan + MySQL hedge
  • Avoid long-tail delays
  • Passive Taishan load on first miss
image

---

Consistency Guarantees

  • Legacy service compatibility: binlog sync to new Redis/Taishan
  • Challenges:
  • Avoid duplicate consumption
  • Preserve chronological order in binlogs
  • Ensure low-latency writes
  • Solution: Redis Lua CAS using `mtime` versioning

Performance:

  • <1 s latency, 250 K records/s consumption capacity
  • DTS delay <700 ms meets SLA
image

---

Inbox Upgrade

New Model: Redis + Taishan

  • Redis: hot
  • Taishan: full dataset + RANDOM read mode
image

---

Message Content Upgrade

  • Separate single chat tables
  • Async MySQL writes
  • Monthly sharding; 100 tables per DB
  • Routing: parse `msgkey` timestamp → monthly shard → `msgkey % 100` table

> Expected single-table data shrink: 900 M → 9 M rows

image
image

---

Batch Private Messaging Optimization

  • Daily channel
  • High-priority channel:
  • More topic partitions
  • Scale consumption PODs
  • Larger in-POD channel count
  • Cache expansion
  • Average send speed boost: 3.5 K → 30 K users/sec

---

4. Conclusion

Lessons Learned:

  • Upgrades require iterative optimization
  • Set achievable yet challenging goals
  • Identify gaps between current vs. ideal state
  • Prioritize smooth migration & coexistence of old/new systems
  • Continuously monitor metrics & convergence progress

Final Thought: Private messaging architecture is a long-term design challenge — balancing performance, consistency, scalability, and business flexibility.

---

References & Inspiration:

Cross-platform AI content delivery platforms like AiToEarn官网 share principles useful here:

  • Layered architecture
  • Caching hierarchy
  • Modular scalability
  • Distributed load balancing

These methods work across both messaging systems and global content publishing ecosystems.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.