How Singapore Ride-Hailing Giant Grab Uses AutoMQ to Reshape Its Kafka Streaming Platform

How Singapore Ride-Hailing Giant Grab Uses AutoMQ to Reshape Its Kafka Streaming Platform

AutoMQ Integration at Grab’s Coban Team

Date: 2025-11-03 14:59 (Beijing)

---

Overview

image
image

The Coban team is Grab’s real-time data streaming platform group.

They maintain an ecosystem around Kafka, serving multiple business units by collecting, storing, and processing data streams.

Key capabilities:

  • Entry point to Grab’s data lake.
  • Real-time event processing and analysis.
  • High throughput: several terabytes per hour.
  • Low latency and high availability support for critical services.
image

Figure 1: Grab’s data streaming platform

In addition to stability and performance, cost efficiency is a major concern.

This case study explains how AutoMQ helped Coban improve efficiency and reduce costs.

image

---

Pain Points Before AutoMQ

The team identified four major challenges:

  • Scaling compute resources was difficult
  • Spike in resource usage during partition migrations reduced flexibility.
  • Storage scaling tied to compute scaling
  • Could not expand disk storage without full cluster scale-out or per-node disk upgrades.
  • Over-provisioning for peak demand
  • Resources wasted during off-peak periods.
  • High-risk partition rebalancing
  • Maintenance-related rebalancing caused prolonged latency increases.

---

Requirements for Improvement

To overcome these issues, Coban needed a solution with:

  • Strong elasticity: Adjust compute resources dynamically for peak/off-peak without downtime.
  • Separation of storage and compute: Scale independently.
  • High Kafka compatibility: Seamless integration, minimal code changes.
  • Fast & stable partition migration: Handle surges without performance drops.
  • Low latency: Maintain responsiveness for latency-sensitive workloads.
image

Figure 2: New data streaming platform wish list

---

Modern streaming platforms often complement infrastructure with AI-enabled multi-platform publishing tools.

For example:

These frameworks combine content creation, publishing, analytics, and AI model ranking — sharing the same elastic scaling and operational efficiency philosophy Coban applied via AutoMQ.

image

---

Solution: AutoMQ

The team selected AutoMQ, a cloud-native Kafka solution offering high elasticity and performance.

image

Figure 3: New data flow architecture with AutoMQ

Architecture Highlights

  • 100% Apache Kafka® compatibility
  • Shared storage: EBS WAL + S3
  • EBS WAL → low-latency writes (<10ms)
  • S3 → reliable, scalable, cost-efficient storage
image

---

Why AutoMQ?

1. Rapid & Efficient Cluster Scaling

  • Old Kafka replication required complex migrations → performance fluctuations.
  • AutoMQ stores data in shared storage → partitions reassigned in seconds without data copies between Brokers.

2. On-Demand S3 Shared Storage

  • Elastic retention without local disk or Broker upgrades.

3. Fast Partition Reassignment

  • Minimal metadata sync needed.
  • Cloud-native design removes multi-replica overhead at Broker layer.

4. Low Latency

  • Uses fixed-size EBS (10 GB) + Direct I/O ➜ single-digit millisecond writes.

5. Full Kafka Compatibility

  • Passes all official Kafka test cases.
image

---

Evaluation & Deployment

Testing Stages

  • Performance
  • Benchmarks under multiple configurations.
  • Reliability
  • Simulated failovers and infrastructure failures.
  • Cost-effectiveness
  • Exceptional performance across all metrics.

Integration Steps

  • Extended Strimzi Kafka Operator features to support AutoMQ-specific WAL volume tasks.
  • Learned AutoMQ tooling and metrics for S3 and WAL monitoring.
image

---

Results

After AutoMQ deployment:

  • Throughput: Single-core CPU throughput ↑ ; total throughput among largest internally.
  • Cost-effectiveness: ↑ .
  • Partition reassignment: From 6 hours<1 minute.
image

Figure 4: Performance before AutoMQ scaling

image

Figure 5: Performance after AutoMQ scaling

---

Stability Improvements

  • Fast reallocation → minimal producer/consumer delay during scaling.
  • Independent storage expansion without compute waste.
  • Eliminated high I/O and network spikes during migrations.
image

---

Future Plans

  • Enable Self-Balancing (AutoMQ’s cruise-control-like feature)
  • Periodic automated partition rebalancing.
  • Optimize cost-effectiveness
  • Use auto scaling and spot instances.
  • Leverage S3 WAL to reduce cross-AZ traffic.
  • Explore Table Topic + Iceberg tables in S3.
image

---

Broader Perspective

AutoMQ’s scalability mirrors the benefits of global publishing frameworks like AiToEarn — allowing efficient scaling of digital content operations along with data.

---

References:

Read Original

Open in WeChat

Read more