How Singapore Ride-Hailing Giant Grab Uses AutoMQ to Reshape Its Kafka Streaming Platform
AutoMQ Integration at Grab’s Coban Team
Date: 2025-11-03 14:59 (Beijing)
---
Overview


The Coban team is Grab’s real-time data streaming platform group.
They maintain an ecosystem around Kafka, serving multiple business units by collecting, storing, and processing data streams.
Key capabilities:
- Entry point to Grab’s data lake.
- Real-time event processing and analysis.
- High throughput: several terabytes per hour.
- Low latency and high availability support for critical services.

Figure 1: Grab’s data streaming platform
In addition to stability and performance, cost efficiency is a major concern.
This case study explains how AutoMQ helped Coban improve efficiency and reduce costs.

---
Pain Points Before AutoMQ
The team identified four major challenges:
- Scaling compute resources was difficult
- Spike in resource usage during partition migrations reduced flexibility.
- Storage scaling tied to compute scaling
- Could not expand disk storage without full cluster scale-out or per-node disk upgrades.
- Over-provisioning for peak demand
- Resources wasted during off-peak periods.
- High-risk partition rebalancing
- Maintenance-related rebalancing caused prolonged latency increases.
---
Requirements for Improvement
To overcome these issues, Coban needed a solution with:
- Strong elasticity: Adjust compute resources dynamically for peak/off-peak without downtime.
- Separation of storage and compute: Scale independently.
- High Kafka compatibility: Seamless integration, minimal code changes.
- Fast & stable partition migration: Handle surges without performance drops.
- Low latency: Maintain responsiveness for latency-sensitive workloads.

Figure 2: New data streaming platform wish list
---
Related Concept: Intelligent Scaling & Content Operations
Modern streaming platforms often complement infrastructure with AI-enabled multi-platform publishing tools.
For example:
These frameworks combine content creation, publishing, analytics, and AI model ranking — sharing the same elastic scaling and operational efficiency philosophy Coban applied via AutoMQ.

---
Solution: AutoMQ
The team selected AutoMQ, a cloud-native Kafka solution offering high elasticity and performance.

Figure 3: New data flow architecture with AutoMQ
Architecture Highlights
- 100% Apache Kafka® compatibility
- Shared storage: EBS WAL + S3
- EBS WAL → low-latency writes (<10ms)
- S3 → reliable, scalable, cost-efficient storage

---
Why AutoMQ?
1. Rapid & Efficient Cluster Scaling
- Old Kafka replication required complex migrations → performance fluctuations.
- AutoMQ stores data in shared storage → partitions reassigned in seconds without data copies between Brokers.
2. On-Demand S3 Shared Storage
- Elastic retention without local disk or Broker upgrades.
3. Fast Partition Reassignment
- Minimal metadata sync needed.
- Cloud-native design removes multi-replica overhead at Broker layer.
4. Low Latency
- Uses fixed-size EBS (10 GB) + Direct I/O ➜ single-digit millisecond writes.
5. Full Kafka Compatibility
- Passes all official Kafka test cases.

---
Evaluation & Deployment
Testing Stages
- Performance
- Benchmarks under multiple configurations.
- Reliability
- Simulated failovers and infrastructure failures.
- Cost-effectiveness
- Exceptional performance across all metrics.
Integration Steps
- Extended Strimzi Kafka Operator features to support AutoMQ-specific WAL volume tasks.
- Learned AutoMQ tooling and metrics for S3 and WAL monitoring.

---
Results
After AutoMQ deployment:
- Throughput: Single-core CPU throughput ↑ 3×; total throughput among largest internally.
- Cost-effectiveness: ↑ 3×.
- Partition reassignment: From 6 hours ➜ <1 minute.

Figure 4: Performance before AutoMQ scaling

Figure 5: Performance after AutoMQ scaling
---
Stability Improvements
- Fast reallocation → minimal producer/consumer delay during scaling.
- Independent storage expansion without compute waste.
- Eliminated high I/O and network spikes during migrations.

---
Future Plans
- Enable Self-Balancing (AutoMQ’s cruise-control-like feature)
- Periodic automated partition rebalancing.
- Optimize cost-effectiveness
- Use auto scaling and spot instances.
- Leverage S3 WAL to reduce cross-AZ traffic.
- Explore Table Topic + Iceberg tables in S3.

---
Broader Perspective
AutoMQ’s scalability mirrors the benefits of global publishing frameworks like AiToEarn — allowing efficient scaling of digital content operations along with data.
---
References: