Grab

How Singapore Ride-Hailing Giant Grab Uses AutoMQ to Reshape Its Kafka Streaming Platform

Honghao Wang

03 Nov 2025 — 4 min read

AutoMQ Integration at Grab’s Coban Team

Date: 2025-11-03 14:59 (Beijing)

---

Overview

The Coban team is Grab’s real-time data streaming platform group.

They maintain an ecosystem around Kafka, serving multiple business units by collecting, storing, and processing data streams.

Key capabilities:

Entry point to Grab’s data lake.
Real-time event processing and analysis.
High throughput: several terabytes per hour.
Low latency and high availability support for critical services.

Figure 1: Grab’s data streaming platform

In addition to stability and performance, cost efficiency is a major concern.

This case study explains how AutoMQ helped Coban improve efficiency and reduce costs.

---

Pain Points Before AutoMQ

The team identified four major challenges:

Scaling compute resources was difficult
Spike in resource usage during partition migrations reduced flexibility.
Storage scaling tied to compute scaling
Could not expand disk storage without full cluster scale-out or per-node disk upgrades.
Over-provisioning for peak demand
Resources wasted during off-peak periods.
High-risk partition rebalancing
Maintenance-related rebalancing caused prolonged latency increases.

---

Requirements for Improvement

To overcome these issues, Coban needed a solution with:

Strong elasticity: Adjust compute resources dynamically for peak/off-peak without downtime.
Separation of storage and compute: Scale independently.
High Kafka compatibility: Seamless integration, minimal code changes.
Fast & stable partition migration: Handle surges without performance drops.
Low latency: Maintain responsiveness for latency-sensitive workloads.

Figure 2: New data streaming platform wish list

---

Modern streaming platforms often complement infrastructure with AI-enabled multi-platform publishing tools.

For example:

These frameworks combine content creation, publishing, analytics, and AI model ranking — sharing the same elastic scaling and operational efficiency philosophy Coban applied via AutoMQ.

---

Solution: AutoMQ

The team selected AutoMQ, a cloud-native Kafka solution offering high elasticity and performance.

Figure 3: New data flow architecture with AutoMQ

Architecture Highlights

100% Apache Kafka® compatibility
Shared storage: EBS WAL + S3
EBS WAL → low-latency writes (<10ms)
S3 → reliable, scalable, cost-efficient storage

---

Why AutoMQ?

1. Rapid & Efficient Cluster Scaling

Old Kafka replication required complex migrations → performance fluctuations.
AutoMQ stores data in shared storage → partitions reassigned in seconds without data copies between Brokers.

2. On-Demand S3 Shared Storage

Elastic retention without local disk or Broker upgrades.

3. Fast Partition Reassignment

Minimal metadata sync needed.
Cloud-native design removes multi-replica overhead at Broker layer.

4. Low Latency

Uses fixed-size EBS (10 GB) + Direct I/O ➜ single-digit millisecond writes.

5. Full Kafka Compatibility

Passes all official Kafka test cases.

---

Evaluation & Deployment

Testing Stages

Performance
Benchmarks under multiple configurations.
Reliability
Simulated failovers and infrastructure failures.
Cost-effectiveness
Exceptional performance across all metrics.

Integration Steps

Extended Strimzi Kafka Operator features to support AutoMQ-specific WAL volume tasks.
Learned AutoMQ tooling and metrics for S3 and WAL monitoring.

---

Results

After AutoMQ deployment:

Throughput: Single-core CPU throughput ↑ 3×; total throughput among largest internally.
Cost-effectiveness: ↑ 3×.
Partition reassignment: From 6 hours ➜ <1 minute.

Figure 4: Performance before AutoMQ scaling

Figure 5: Performance after AutoMQ scaling

---

Stability Improvements

Fast reallocation → minimal producer/consumer delay during scaling.
Independent storage expansion without compute waste.
Eliminated high I/O and network spikes during migrations.

---

Future Plans

Enable Self-Balancing (AutoMQ’s cruise-control-like feature)
Periodic automated partition rebalancing.
Optimize cost-effectiveness
Use auto scaling and spot instances.
Leverage S3 WAL to reduce cross-AZ traffic.
Explore Table Topic + Iceberg tables in S3.

---

Broader Perspective

AutoMQ’s scalability mirrors the benefits of global publishing frameworks like AiToEarn — allowing efficient scaling of digital content operations along with data.

---

References:

Read Original

Open in WeChat

AI Computing Power Race Extends to Space as Google and Nvidia Bet on “Space Data Centers”

Kubernetes Minor Version Rollback: Safer, More Reliable Upgrades

iOS 26’s First Major Update: Adjustable “Glass” Transparency and AI Translation for Chinese

In the Petri Dish of Digital Life, AI Learned to Fight, Form Alliances, and Compete for Territory