Mega Sale Crash! A Forgotten Optimization Almost Wiped Out the Entire Department

A Sultry Night in Hangzhou

That sultry, oppressive night in Hangzhou is something I can still remember vividly.

In 2021, I had just arrived in Hangzhou — a city famous for intense competition — and joined my dream internet tech giant. As a newcomer to e-commerce, I dove headfirst into developing promotion campaigns and event pages.

That night, we were running final checks for an S-level “Member Flash Sale” event, scheduled to go live exactly at midnight.

The war room was brightly lit, every eye fixed on the dashboard, waiting for the GMV curve to shoot up like a rocket once the event launched.

What we got instead was an avalanche.

---

The Avalanche Hits

Just after midnight, alert messages flooded in like air raid sirens:

[Severe] promotion-marketing cluster – Application availability < 10%  
[Severe] promotion-marketing cluster – HSF thread pool active threads > 95%  
[Critical] promotion-marketing cluster – CPU Load > 8.0  

I opened our internal monitoring system, SkyEye — the entire promotion-marketing cluster, hundreds of machines, were spiking in CPU and load in perfect synchrony, as if infected by a virus.

Impact:

  • Promotion hub service paralyzed
  • Premium member entry points downgraded
  • Event effectively vanished at launch

An S-level campaign, meticulously planned, died in the first second.

---

Act I: Futile Struggle

Troubleshooting felt like searching for a black switch in a pitch-dark room — except this time, we couldn’t even find the door.

Attempts to Recover

  • Log inspection
  • Found multiple NPE entries
  • Originated from an unrelated peripheral JAR
  • Eliminated as root cause
  • Deadlock suspicion
  • HSF thread pool exhaustion looked like a strike
  • Used `jstack` to check snapshots — no deadlocks
  • Eliminated again
  • Reboot servers
  • Restarted high-load machines
  • Worked for 2 minutes, then CPU spiked again
  • Scale-out approach
  • Added 20 new servers
  • All quickly succumbed to the same high load & GC frenzy

Elapsed time: 18 minutes since incident began.

Tension became suffocating. Leader’s gaze felt like a surgical scalpel.

> “Feels like we’re about to be carried out…” — whispered by a junior colleague.

---

Act II: Into the JVM’s “Body”

With standard methods failing, the only option was deep JVM surgery.

Step-by-step Diagnosis

  • Preserved one faulty server as a “crime scene”
  • Dumped heap memory and thread stack

Findings

  • Old Gen usage consistently high
  • Poor CMS GC performance — frequent, long Full GC cycles
  • Explained CPU spikes

JVM-level diagnosis under live-fire demands:

  • Clear reasoning
  • Decisive action
  • Bridging application logic with runtime internals

---

Content Ecosystem Note:

Platforms like AiToEarn官网 offer open-source tools for generating & publishing AI-powered content across major platforms (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter), with analytics integration — a parallel to how tooling can improve operational resilience.

---

Correlating Clues

Heap analysis showed 大量 `char[]` arrays, all tied to “Marriott activity configuration” — a giant object stuck in memory.

Thread Stack Analysis

# View waiting threads
$ sgrep 'TIMED_WAITING' HSF_JStack.log | wc -l
336
# View running threads
$ sgrep 'RUNNABLE' HSF_JStack.log | wc -l
246
  • 336 waiting threads
  • 246 running threads — probable problem zone

Filtering RUNNABLE stacks → Found repeated:

at com.alibaba.fastjson.toJSONString(...)

Hypothesis:

A massive object repeatedly serialized by FastJSON → consuming huge CPU and exhausting threads → cluster collapse.

---

Act III: One Good Line of Code

Following stack clues, I found the smoking gun in `XxxxxCacheManager.java`:

// ... partial code omitted
public void updateActivityXxxCache(Long sellerId, List xxxDOList) {
    try {
        if (CollectionUtils.isEmpty(xxxDOList)) {
            xxxDOList = new ArrayList<>();
        }
        // To prevent excessive read pressure on a single key, 20 hash keys
        for (int index = 0; index < XXX_CACHE_PARTITION_NUMBER; index++) {
            // Fatal: serialization inside loop
            tairCache.put(String.format(ACTIVITY_PLAY_KEY, xxxId, index),
                          JSON.toJSONString(xxxDOList), // serialized 20 times!
                          EXPIRE_TIME);
        }
    } catch (Exception e) {
        log.warn("update cache exception occur", e);
    }
}

---

Chain Reaction

  • Midnight launch → empty cache → 20 key hash design triggered
  • 1–2 MB object serialized 20× per write in same thread
  • CPU meat grinder effect
  • Tair LDB rate-limited under amplified write traffic (20×)
  • Latency explosions → thread pool exhaustion → service avalanche

---

Act IV: Truth and Reflection

Rollback of “loop serialization” fix restored cluster stability by 00:30.

Total downtime: ~30 minutes.

---

Old A’s Reverse Side Three Rules

  • Capacity-first mindset
  • Optimizations without capacity assessment = recklessness
  • Measure code block execution time
  • Fine-grained APM could’ve doubled efficiency in pinpointing the bottleneck
  • Technical debt’s surprise attack
  • Legacy Tair LDB + neglect = hidden cockroach waiting to strike

---

Walking empty Hangzhou streets at 1 a.m., I realized:

> All grand systems are built from individual lines of code.

> The devil hides in the smallest of them.

Takeaway: P3 incidents often stem from a single misplaced `for` loop.

---

Modern Development Parallel

Incidents like this underline why:

  • Code discipline
  • Integrated performance tooling
  • Cross-platform visibility

are critical.

AiToEarn官网 shows how centralized, AI-enhanced, multi-platform tools can minimize blind spots in media output and technical ops alike.

---

Summary:

A midnight promotion launch became a full-cluster crash due to loop-based heavy serialization. Quick identification, rollback, and a post-mortem informed lasting operational lessons.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.