Beyond Size and Popularity: In-Depth Analysis of JIMDB’s “Hot Key” Active Management Solution

Beyond Size and Popularity: In-Depth Analysis of JIMDB’s “Hot Key” Active Management Solution
# JD Retail — November 11, 2025 — Beijing

![image](https://blog.aitoearn.ai/content/images/2025/11/img_001-313.jpg)

## High-Performance Distributed Cache: A New Paradigm

![image](https://blog.aitoearn.ai/content/images/2025/11/img_002-295.jpg)

**JIMDB** is a distributed cache service built on Redis, designed for **high performance**, **high availability**, and **online scalability**.  
It supports diverse JD Group business units and excels in **high-throughput** and **low-latency** scenarios.  
This article explores how JIMDB optimizes and manages *big-hot keys*.

---

## **The Challenge: Big Keys & Hot Keys**

High-performance caches like JIMDB/Redis are plagued by:

- **Big Keys** — large data values that consume memory excessively
- **Hot Keys** — high-access-frequency keys that overload shards

### Why It's a Problem
These issues can cause:

- Single-node resource exhaustion
- Increased latency
- Cascading failures

Traditional **static threshold** detection and reactive handling are no longer sufficient in high-QPS, complex environments.

---

## **The Innovation: Impact-Based Definition**

JIMDB redefines "problem keys" using the **Big-Hot Key** concept:

> **Big-Hot Key**: evaluated based on **real-time resource impact** — CPU usage + network bandwidth consumed — rather than static attributes.

This is a **paradigm shift**:  
From *"attribute-based"* → To *"impact-based"* detection.

---

## **Solution Framework**

JIMDB’s **multi-layer proactive governance** includes:

- **Server-side auto-caching**
- **Command request circuit breaking**
- **Blacklist mechanisms**
- **Intelligent client-side caching** with consistency checks

**Detection Engine**: A server-side module that **instantly identifies** risk operations.

---

## **Proven Results**

- In production: Server-side caching of serialized results reduced CPU usage **from 100% to 30%**
- Throughput boosted from **~6,700 OPS → ~12,000 OPS** (+~80%)  
- Dramatically cut stability risks

---

## Chapter 1 — Redefining Bottlenecks

### 1.1 Traditional Definitions

**Big Key**
- **Definition**: Value length > 1 MB (Strings), or > 5,000 elements (collections) / >100 MB memory
- **Risks**:
  - Memory imbalance / OOM
  - Event loop blocking
  - Increased persistence/replication delays

**Hot Key**
- **Definition**: QPS > 10,000
- **Risks**:
  - Single-node bottlenecks
  - Cache breakdown / avalanche when expired

---

### 1.2 Why Big-Hot Keys?

**Big-Hot Key** ≠ Big Key + Hot Key  
It focuses on **operation cost**: large data + frequent access → resource exhaustion.

#### Key Differences:
- **Context-dependent**
- Bound to **specific command + parameters**
- Detects "hidden bottlenecks" missed by Big/Hot Key definitions

**Example**:
- List with 8,000 elements, QPS=4,000 — not big/hot by thresholds but `LRANGE key 0 -1` saturates CPU.

---

**Comparative Table:**

| Type           | Feature                  | Criteria                         | Risk                         |
|----------------|--------------------------|-----------------------------------|------------------------------|
| Big Key        | Large volume              | Static size threshold             | Memory/slow ops              |
| Hot Key        | High frequency            | Static QPS threshold              | Shard bottlenecks            |
| Big-Hot Key    | Size + frequency impact   | Dynamic, real-time impact analysis| CPU/network exhaustion       |

---

## Chapter 2 — JIMDB Detection Engine

### 2.1 Three-Level Waterfall Detection

1. **Bandwidth bottleneck** — Instant traffic over threshold  
2. **Collection size bottleneck** — Elements > 100,000 or data > 1MB  
3. **CPU bottleneck** — Actual QPS > predicted sustainable QPS (ML model)

Stop checks once matched → minimal overhead.

---

### 2.2 Strategy Details

#### Strategy 1 — Bandwidth Bottleneck Detection
- Threshold: 70% of instance output limit (`config get output-limit`)
- Hot Key Index ≈ bytes/sec from command

Example: **140 MB/s** outbound → ~250 OPS per second  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_004-265.jpg)

---

#### Strategy 2 — Collection Size Risk
- Element count limit: 100,000
- Single-return size limit: 1 MB
- **Super Hot Key Index**: `INT_MAX` (high-priority warning)

---

#### Strategy 3 — CPU Prediction (ML Model)
- Combines **polynomial regression** & **linear interpolation**
- Predicts per-command sustainable QPS given element count + length
- **Hot Key Index**: grows exponentially beyond limit

![image](https://blog.aitoearn.ai/content/images/2025/11/img_009-178.jpg)

---

## Chapter 3 — Governance Suite

### 3.1 Server-Side Auto Mitigation
- **Auto-cache** serialized responses for repeated reads
- **Circuit breaker**: reject all risky key requests (disabled by default)

---

### 3.2 Client-Side Collaboration
- SDK caches value + version
- Server version check → only full value on change
- Strong consistency + reduced bandwidth

---

### 3.3 Manual Operations Toolkit
- **Manual blacklist**: reject requests without deleting data
- **High-availability management port**: control even if worker thread blocked
- **One-click emergency plans**: UI with instance status, hot key logs, quick actions

---

### 3.4 Cluster Profiling
Defines business attributes & disaster policies:
- Core level (priority)
- Persistence strategy
- Deployment mode
- Delay-read acceptance
- Hot key blockage policy

---

## Chapter 4 — Quantitative Performance Tests

### Environment:
- Physical machine, 8 cores / 16GB RAM

### Scenario 1 — CPU Hot Key
- Before: CPU 100%, OPS ~6,700, frequent alerts
- After: CPU ~30%, OPS ~12,000, stable

### Scenario 2 — Bandwidth Hot Key
- Before: Outbound traffic > limit, CPU near 100%, degraded service
- After: Circuit breaker instantly reduced load, recovery to normal

**Improvement Table**:

| Metric          | Before | After  | Change     |
|-----------------|--------|--------|------------|
| CPU Usage       | 100%   | ~30%   | ↓ 70%      |
| OPS             | 6,700  | 12,000 | ↑ ~79%     |
| Stability       | Unstable | Stable | High gain |

---

## Chapter 5 — Deployment & Roadmap

### Current Deployment
- Rolled out to thousands of instances
- >3,000 hot key detections prevented risk

### Upgrade Plan
- Server upgrades independent of client; minimal downtime (~10 min/shard)

---

## Chapter 6 — Industry Context

### Mainstream Approaches:
- More read replicas
- Proxy layers
- Client local caches
- Key sharding

**Issues**: high cost, complexity, inconsistency

---

**JIMDB Advantage**:  
Native integration at **platform level** → consistent, reliable, automated governance.

Benefits:
- Lower hardware & ops cost
- Faster development
- Higher system resilience

---

## Chapter 7 — Conclusion

Key takeaways:

1. **Impact-based definition** improves detection precision
2. **Intelligent detection + multi-layer governance** form effective risk control
3. **Wide deployment** validates stability & performance gains

---

### Recommended Reading
- [Generative Recommendations — Evolution & Challenges](https://mp.weixin.qq.com/s?__biz=MzU1MzE2NzIzMg==&mid=2247501334&idx=1&sn=9cfe399e7594ea833227c6a0994b5264&scene=21#wechat_redirect)
- [JoyCode SWE-bench Leaderboard Report](https://mp.weixin.qq.com/s?__biz=MzU1MzE2NzIzMg==&mid=2247501325&idx=1&sn=8de68e5fac3fb1f76e61049470add00e&scene=21#wechat_redirect)
- [Chart Rendering in ROMA](https://mp.weixin.qq.com/s?__biz=MzU1MzE2NzIzMg==&mid=2247501308&idx=1&sn=4ab58b99a43a866d50c766f9ed2df28c&scene=21#wechat_redirect)
- [Scenario-Based Insurance Claims](https://mp.weixin.qq.com/s?__biz=MzU1MzE2NzIzMg==&mid=2247501283&idx=1&sn=e191d994b673fa228f660b36b3bc75ff&scene=21#wechat_redirect)

---

### **Join Us!**
- **Position**: Distributed KV Storage R&D Engineer
- **Location**: Beijing Yizhuang
- **Requirements**:
  1. CS major, Linux + C/C++
  2. Concurrent/network programming
  3. Large-scale storage/DB experience preferred
  4. Familiar with Redis, Valkey, RocksDB, LevelDB, TiKV, HBase, NebulaGraph source code
  5. Strong engineering discipline

**Contact**: wangqiwang1@jd.com

[Read Original](2247501343) | [Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=5ed302d9&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzU1MzE2NzIzMg%3D%3D%26mid%3D2247501343%26idx%3D1%26sn%3Dbbb3f81dfff1b0c40272b5f73cae628d)

---

Read more

How AI Startups Can Effectively Analyze Competitors — Avoid the Feature List Trap and Redefine Your Battleground

How AI Startups Can Effectively Analyze Competitors — Avoid the Feature List Trap and Redefine Your Battleground

Competitive Analysis Is Not “Feature Comparison” — It’s Strategic Positioning This guide explains how AI startup teams can escape the trap of feature lists. Using concepts from user perception, product pacing, and capital narratives, we’ll build a cognitive framework for understanding competitors — and help you identify your differentiated battlefield

By Honghao Wang