AI news

Mega Sale Crash! A Forgotten Optimization Almost Wiped Out the Entire Department

Honghao Wang

28 Nov 2025 — 3 min read

A Sultry Night in Hangzhou

That sultry, oppressive night in Hangzhou is something I can still remember vividly.

In 2021, I had just arrived in Hangzhou — a city famous for intense competition — and joined my dream internet tech giant. As a newcomer to e-commerce, I dove headfirst into developing promotion campaigns and event pages.

That night, we were running final checks for an S-level “Member Flash Sale” event, scheduled to go live exactly at midnight.

The war room was brightly lit, every eye fixed on the dashboard, waiting for the GMV curve to shoot up like a rocket once the event launched.

What we got instead was an avalanche.

---

The Avalanche Hits

Just after midnight, alert messages flooded in like air raid sirens:

[Severe] promotion-marketing cluster – Application availability < 10%  
[Severe] promotion-marketing cluster – HSF thread pool active threads > 95%  
[Critical] promotion-marketing cluster – CPU Load > 8.0

I opened our internal monitoring system, SkyEye — the entire promotion-marketing cluster, hundreds of machines, were spiking in CPU and load in perfect synchrony, as if infected by a virus.

Impact:

Promotion hub service paralyzed
Premium member entry points downgraded
Event effectively vanished at launch

An S-level campaign, meticulously planned, died in the first second.

---

Act I: Futile Struggle

Troubleshooting felt like searching for a black switch in a pitch-dark room — except this time, we couldn’t even find the door.

Attempts to Recover

Log inspection
Found multiple NPE entries
Originated from an unrelated peripheral JAR
Eliminated as root cause
Deadlock suspicion
HSF thread pool exhaustion looked like a strike
Used `jstack` to check snapshots — no deadlocks
Eliminated again
Reboot servers
Restarted high-load machines
Worked for 2 minutes, then CPU spiked again
Scale-out approach
Added 20 new servers
All quickly succumbed to the same high load & GC frenzy

Elapsed time: 18 minutes since incident began.

Tension became suffocating. Leader’s gaze felt like a surgical scalpel.

> “Feels like we’re about to be carried out…” — whispered by a junior colleague.

---

Act II: Into the JVM’s “Body”

With standard methods failing, the only option was deep JVM surgery.

Step-by-step Diagnosis

Preserved one faulty server as a “crime scene”
Dumped heap memory and thread stack

Findings

Old Gen usage consistently high
Poor CMS GC performance — frequent, long Full GC cycles
Explained CPU spikes

JVM-level diagnosis under live-fire demands:

Clear reasoning
Decisive action
Bridging application logic with runtime internals

---

Content Ecosystem Note:

Platforms like AiToEarn官网 offer open-source tools for generating & publishing AI-powered content across major platforms (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter), with analytics integration — a parallel to how tooling can improve operational resilience.

---

Correlating Clues

Heap analysis showed 大量 `char[]` arrays, all tied to “Marriott activity configuration” — a giant object stuck in memory.

Thread Stack Analysis

# View waiting threads
$ sgrep 'TIMED_WAITING' HSF_JStack.log | wc -l
336
# View running threads
$ sgrep 'RUNNABLE' HSF_JStack.log | wc -l
246

336 waiting threads
246 running threads — probable problem zone

Filtering RUNNABLE stacks → Found repeated:

at com.alibaba.fastjson.toJSONString(...)

Hypothesis:

A massive object repeatedly serialized by FastJSON → consuming huge CPU and exhausting threads → cluster collapse.

---

Act III: One Good Line of Code

Following stack clues, I found the smoking gun in `XxxxxCacheManager.java`:

// ... partial code omitted
public void updateActivityXxxCache(Long sellerId, List xxxDOList) {
    try {
        if (CollectionUtils.isEmpty(xxxDOList)) {
            xxxDOList = new ArrayList<>();
        }
        // To prevent excessive read pressure on a single key, 20 hash keys
        for (int index = 0; index < XXX_CACHE_PARTITION_NUMBER; index++) {
            // Fatal: serialization inside loop
            tairCache.put(String.format(ACTIVITY_PLAY_KEY, xxxId, index),
                          JSON.toJSONString(xxxDOList), // serialized 20 times!
                          EXPIRE_TIME);
        }
    } catch (Exception e) {
        log.warn("update cache exception occur", e);
    }
}

---

Chain Reaction

Midnight launch → empty cache → 20 key hash design triggered
1–2 MB object serialized 20× per write in same thread
CPU meat grinder effect
Tair LDB rate-limited under amplified write traffic (20×)
Latency explosions → thread pool exhaustion → service avalanche

---

Act IV: Truth and Reflection

Rollback of “loop serialization” fix restored cluster stability by 00:30.

Total downtime: ~30 minutes.

---

Old A’s Reverse Side Three Rules

Capacity-first mindset
Optimizations without capacity assessment = recklessness
Measure code block execution time
Fine-grained APM could’ve doubled efficiency in pinpointing the bottleneck
Technical debt’s surprise attack
Legacy Tair LDB + neglect = hidden cockroach waiting to strike

---

Walking empty Hangzhou streets at 1 a.m., I realized:

> All grand systems are built from individual lines of code.

> The devil hides in the smallest of them.

Takeaway: P3 incidents often stem from a single misplaced `for` loop.

---

Modern Development Parallel

Incidents like this underline why:

Code discipline
Integrated performance tooling
Cross-platform visibility

are critical.

AiToEarn官网 shows how centralized, AI-enhanced, multi-platform tools can minimize blind spots in media output and technical ops alike.

---

Summary:

A midnight promotion launch became a full-cluster crash due to loop-based heavy serialization. Quick identification, rollback, and a post-mortem informed lasting operational lessons.

Mega Sale Crash! A Forgotten Optimization Almost Wiped Out the Entire Department

Honghao Wang

A Sultry Night in Hangzhou

The Avalanche Hits

Act I: Futile Struggle

Attempts to Recover

Act II: Into the JVM’s “Body”

Step-by-step Diagnosis

Findings

Correlating Clues

Thread Stack Analysis

Act III: One Good Line of Code

Chain Reaction

Act IV: Truth and Reflection

Old A’s Reverse Side Three Rules

Modern Development Parallel

Read more

Xiaoyuan Learning Tablet Wins 2025 IDEA International Design Award, Setting a New Benchmark for Study Devices

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Cloud Computing Giant Unveils 25 New Products in 10 Minutes — Kimi and MiniMax Debut

TopGear Picks 18 Cars of the Year, Only One from China

A Sultry Night in Hangzhou

The Avalanche Hits

Act I: Futile Struggle

Attempts to Recover

Act II: Into the JVM’s “Body”

Step-by-step Diagnosis

Findings

Correlating Clues

Thread Stack Analysis

Act III: One Good Line of Code

Chain Reaction

Act IV: Truth and Reflection

Old A’s Reverse Side Three Rules

Modern Development Parallel

Read more

Xiaoyuan Learning Tablet Wins 2025 IDEA International Design Award, Setting a New Benchmark for Study Devices

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Cloud Computing Giant Unveils 25 New Products in 10 Minutes — Kimi and MiniMax Debut

TopGear Picks 18 Cars of the Year, Only One from China

Old A’s Reverse Side Three Rules