AI news

Baidu’s Big Data Cost Governance Practices

Honghao Wang

20 Nov 2025 — 4 min read

Introduction

This article presents cost governance practices for big data within Baidu MEG (Mobile Ecosystem Group), set against the backdrop of rapid business growth and the imperative to reduce costs and increase efficiency. It addresses:

Key challenges currently faced
Optimization strategies for computing and storage
Governance achievements
Future directions

The goal is to offer the industry actionable governance experience.

Total: 7,590 words | Estimated read time: 19 minutes

---

01 Background

With Baidu’s business expansion, offline data volumes — and the associated costs — are rising dramatically. Applying big data governance is critical to sustain growth while controlling expenditure.

We examined the situation from three perspectives:

Resource Status

Multiple offline business types across product lines
Hundreds/thousands of AFS (Appendonly File Storage) accounts and EMR (E-MapReduce) queues
No unified governance start plan or standard

Management Status

Uneven resource utilization
Inconsistent storage/computing queue management
Lack of comprehensive procedures
Inefficient computing tasks, challenging maintenance

Cost Status

Tens of millions of offline computing cores
Thousands of petabytes of storage
Annual costs in the billions RMB
Without optimization, costs will continue to climb

Summary

Key issues include:

Data fragmentation
Resource waste
Escalating costs

Solution: Create unified governance standards and a big data resource management platform to deliver:

Storage, computing, task, and cost views per product line
Engine-level optimization for storage and computing

△ Current state of data cost governance

---

02 Data Cost Governance Practice Plan

2.1 Overall Governance Framework

We devised a framework comprising:

Data asset health measurement
Platform capabilities
Engine empowerment

Focus: Computing + Storage governance for cost reduction and efficiency.

△ Overall framework for data cost governance

---

2.1.1 Data Asset Health Measurement

We introduced a unified health score metric to assess computing and storage resources.

Data Collection

From an offline data acquisition service, gathering:

Computing queues
Tasks
Storage accounts
Data tables

Governance Items

Computing: Uneven utilization, long-running high-resource tasks, data skew, invalid tasks.
Storage: Cold data (1/2/N years), anomalous lifecycles, low inode utilization, unclaimed directories.

Health Scores

Computing Health = Weighted average of utilization, balance, governance items.
Storage Health = Weighted combination of usage, peak, cold data ratio, governance factors.

Benefit: Consistent, standardized measurement guides governance across product lines.

---

2.1.2 Platform Capabilities

Our Big Data Resource Management Platform aggregates all offline computation and storage data into:

Computation View: Queue usage, tasks, lifecycle management, optimization.
Storage View: Account usage, directory cleanup, migration, cold data mining.
Cost View: Unified visual of total offline resource costs.

---

2.1.3 Engine Empowerment

Many users lack tuning awareness for tasks. We leverage engine capabilities to:

Computation: Machine learning-driven intelligent parameter tuning for optimal configurations.
Storage: Analyze datasets for smart compression without impacting performance.

---

03 Computation & Storage Cost Optimization

3.1 Computation Governance

Three key optimizations:

3.1.1 Management Control

Manage >1,000 EMR queues, >10,000 Hadoop/Spark tasks
Apply 30+ real-time control policies, e.g.:
Concurrency limits
Basic parameter tuning
Zombie task control
Coverage: Millions of cores, 200,000+ daily control triggers.

△ Task Management and Control Workflow

---

3.1.2 Hybrid Scheduling

Findings:

Hadoop: High CPU, low memory
Spark: Lower CPU, high memory
Low average utilization with time-varying peaks
Queue fragmentation

Solution:

Hadoop + Spark hybrid scheduling:

Submission: Rank tasks by priority + time
Scheduling: Apply strategy chains for optimal queue selection
Execution: Balanced utilization, improved low-frequency queues

△ Task Hybrid Scheduling Workflow

---

3.1.3 Intelligent Parameter Tuning

Challenges:

Poor configuration awareness → wasted resources
Spark engine optimizers overlook routine job-specific info

Solution:

Model-trained tuning for `spark.executor.instances`, `spark.executor.cores`, `spark.executor.memory`
Iterative parameter recommendation & feedback cycles
HBO (History Based Optimization):
Joins/Aggregations tuning
Shuffle partition adjustment
Serialization optimization

△ Task Parameter Configuration Case

△ Basic Parameter Intelligent Tuning Flow

△ HBO Intelligent Tuning Flow

---

3.2 Storage Governance

Current Issues:

Numerous unassigned accounts without quotas
Data growth with redundant historical data
Weak security controls

---

3.2.1 Lifecycle Management

Five layers:

Access Layer: Standards for usage, quota, cold data handling
Service Layer: Cold data detection, cost analysis
Storage Layer: Ownership metadata in MySQL, usage in Table system
Execution Layer: Daily cleanup, compression, monitoring
User Layer: Dashboards, APIs, governance tools

△ Storage Lifecycle Management Process

---

3.2.2 Basic Governance

Parse quota + cold data
Monitor trends, detect anomalies
User-configured cleanup/compression/monitoring
Scheduled execution via backend services

△ AFS Storage Basic Governance Process

---

3.2.3 Intelligent Compression

Data Warehouse Tables: Sort optimization, ZSTD compression, page size/level tuning
Non-DW Tables: Hot/warm/cold classification, format optimization, scheduled compression

△ Intelligent Compression Process

---

Governance Achievements

Data Development & Cost Optimization

Efficiency: Resource delivery from weeks → daily cycles
Computing Cost: +30% EMR core utilization → tens of millions RMB saved annually
Storage Efficiency: Managed thousands of accounts, cleaned unused data
Storage Cost: +20% utilization → hundreds of PB freed

Governance Assets

Standardized development processes
Resource usage & cost dashboards
Task lifecycle overviews
Detailed governance item dashboards

---

Future Plans

We will:

Continue refining standardized, intelligent governance
Integrate lessons into processes and standards
Explore AI-powered automation for broader applications

Data-heavy enterprises can benefit from similar AI + governance frameworks for both technical and creative workflows, ensuring efficiency, traceability, and sustainable growth.