Google Cloud C4 Cuts GPT OSS Total Cost of Ownership by 70% with Intel and Hugging Face

Google Cloud C4 Cuts GPT OSS Total Cost of Ownership by 70% with Intel and Hugging Face
# Benchmarking GPT OSS on Google Cloud C4 VMs with Intel® Xeon® 6 (Granite Rapids)

Intel and Hugging Face partnered to demonstrate **real-world performance gains** from upgrading to **Google’s new C4 Virtual Machine (VM)**, powered by **Intel® Xeon® 6 processors (codenamed Granite Rapids – GNR)**.  
They benchmarked the **OpenAI GPT OSS Mixture-of-Experts (MoE) Large Language Model** to measure text generation improvements.

**Key outcome:**  
- **1.7× Total Cost of Ownership (TCO) gain** over earlier-generation Google C3 VM instances  
- **1.4×–1.7× TPOT throughput/vCPU/dollar**  
- **Lower hourly cost** compared to C3 VMs

---

## 1. Introduction

### What is GPT OSS?
GPT OSS is an open-source **Mixture of Experts (MoE)** model from OpenAI.  
An MoE uses multiple specialized “expert” sub-networks plus a *gating network* that decides which experts to run per input token.

**Advantages:**
- **Efficient scaling** of parameters without a linear increase in compute cost
- **Specialization** across data distributions
- **CPU inference viability** — only a small set of experts runs per token

---

### Intel & Hugging Face Optimization
Intel and Hugging Face added an execution optimization ([PR #40304](https://github.com/huggingface/transformers/pull/40304)) that:
- Eliminates wasteful computation (experts no longer process all tokens)
- Processes only **assigned tokens per expert**
- Reduces FLOPs and improves performance efficiency

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-52.png)

---

This type of framework-level improvement makes **cloud AI workloads** more cost-effective.  
Creators and developers can leverage such gains in workflows via platforms like [AiToEarn官网](https://aitoearn.ai/), which support:
- AI-powered creation  
- Global multi-platform publishing  
- Monetization tracking

---

## 2. Benchmark Scope & Hardware

We benchmarked GPT OSS in **controlled/repeatable** text generation tests to separate:
1. Architectural differences between **C4 (Xeon 6 – GNR)** and **C3 (Xeon 4th Gen – SPR)**
2. Execution efficiency of **MoE inference**

**Focus metrics:**
- **Steady-state decoding latency** (per-token)
- **Throughput scaling** with larger batch sizes
- Fixed sequence length, static KV cache, **SDPA attention** ensured deterministic results

---

### Configuration Summary
- **Model:** [unsloth/gpt-oss-120b-BF16](https://huggingface.co/unsloth/gpt-oss-120b-BF16)  
- **Precision:** bfloat16  
- **Input length:** 1024 tokens (left-padded)  
- **Output length:** 1024 tokens  
- **Batch sizes tested:** 1, 2, 4, 8, 16, 32, 64  
- **Features:**
  - Static KV cache
  - SDPA attention backend
- **Metric:** Throughput in total generated tokens/sec

---

### Hardware Specs

| Instance | Architecture                            | vCPUs |
|----------|-----------------------------------------|-------|
| **C3**   | 4th Gen Intel® Xeon® (SPR)               | 172   |
| **C4**   | Intel® Xeon® 6 (Granite Rapids – GNR)    | 144   |

---

## 3. Creating the VMs

### 3.1 C3 VM Setup
1. Go to [Google Cloud Console](https://console.cloud.google.com/)  
2. Create a VM → choose **C3**, machine type: `c3-standard-176`  
3. Set **CPU platform** and enable **all-core turbo**  
   [![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-47.png)](https://huggingface.co/datasets/Intel/blog/resolve/main/gpt-oss-on-intel-xeon/spr.png)
4. Configure OS & Storage:  
   [![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-45.png)](https://huggingface.co/datasets/Intel/blog/resolve/main/gpt-oss-on-intel-xeon/spr-os.png)  
5. Leave other settings default and click **Create**

---

### 3.2 C4 VM Setup
1. Go to [Google Cloud Console](https://console.cloud.google.com/)  
2. Create a VM → choose **C4**, machine type: `c4-standard-144`  
3. Set **CPU platform** and enable **all-core turbo**  
   [![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-38.png)](https://huggingface.co/datasets/Intel/blog/resolve/main/gpt-oss-on-intel-xeon/gnr.png)
4. Configure OS & Storage same as C3  
5. Click **Create**

---

## 4. Environment Setup

SSH into the VM and install Docker, then:

git clone https://github.com/huggingface/transformers.git

cd transformers/

git checkout 26b65fb5168f324277b85c558ef8209bfceae1fe

cd docker/transformers-intel-cpu/

sudo docker build . -t

sudo docker run -it --rm --privileged \

-v /home/:/workspace \

/bin/bash


---

### Inside the container:
1. **Install Transformers exact version**

pip install git+https://github.com/huggingface/transformers.git@26b65fb5168f324277b85c558ef8209bfceae1fe


2. **Install PyTorch CPU build**

pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu


---

## 5. Benchmark Procedure

For each **batch size**:
1. Create fixed-length left-padded input (1024 tokens)  
2. Warm-up run  
3. Set `max_new_tokens=1024`, measure **total latency**  
4. **Throughput formula**:  
   \[
   \text{Throughput} = \frac{\text{OutputTokens} \times \text{BatchSize}}{\text{TotalLatency}}
   \]

Run:

numactl -l python benchmark.py


---

### Example Benchmark Script

import os

import time

import torch

from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer

INPUT_TOKENS = 1024

OUTPUT_TOKENS = 1024

def get_inputs(tokenizer, batch_size):

dataset = load_dataset("ola13/small-the_pile", split="train")

tokenizer.padding_side = "left"

selected_texts = []

for sample in dataset:

input_ids = tokenizer(sample["text"], return_tensors="pt").input_ids

if len(selected_texts) == 0 and input_ids.shape[-1] >= INPUT_TOKENS:

selected_texts.append(sample["text"])

elif len(selected_texts) > 0:

selected_texts.append(sample["text"])

if len(selected_texts) == batch_size:

break

return tokenizer(

selected_texts,

max_length=INPUT_TOKENS,

padding="max_length",

truncation=True,

return_tensors="pt"

)

def run_generate(model, inputs, generation_config):

inputs["generation_config"] = generation_config

model.generate(**inputs) # warm up

pre = time.time()

model.generate(**inputs)

latency = (time.time() - pre)

return latency


---

## 6. Results

### 6.1 Normalized Throughput per vCPU
C4 (Xeon 6) shows **1.4×–1.7×** throughput/vCPU over C3.

Formula:

Normalized Throughput = Throughput / vCPU_Count


Normalized comparison:

normalized_throughput_per_vCPU =

(throughput_C4 / vCPUs_C4)

———————————————

(throughput_C3 / vCPUs_C3)

\[
\frac{\text{Throughput}_\mathrm{C4} / \text{vCPUs}_\mathrm{C4}}
     {\text{Throughput}_\mathrm{C3} / \text{vCPUs}_\mathrm{C3}}
\]

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-39.png)

---

### 6.2 Cost & TCO Impact
At batch size 64:
- **C4 delivers 1.7× per-vCPU throughput** vs. C3
- Equal per-vCPU pricing ⇒ **1.7× TCO advantage** for C4

**Cost equation:**

TCO_C3 / TCO_C4 ≈ 1.7

![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-33.png)

---

## 7. Conclusion

**Google Cloud C4 VMs + Intel® Xeon® 6 (GNR)**:
- Higher throughput
- Lower latency
- Better cost efficiency for large-scale MoE inference

With Intel & Hugging Face optimizations, GPT OSS MoE models can achieve **high efficiency on modern general-purpose CPUs**.

Creators and enterprises can integrate this compute efficiency into **global AI content workflows** via [AiToEarn](https://aitoearn.ai)—an open-source monetization platform enabling:
- AI content generation
- Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
- Analytics and model ranking ([AI模型排名](https://rank.aitoearn.ai))

This bridges **high-performance AI inference** with **scalable monetization** opportunities.

Read more