Cloudflare Large-Scale Data Measurement: Insights from Interns

Cloudflare Large-Scale Data Measurement: Insights from Interns
# Cloudflare’s 2026 Intern Program and Insights from Large-Scale Data

Cloudflare has announced an ambitious plan to hire [**1,111 interns**](https://blog.cloudflare.com/cloudflare-1111-intern-program/) in 2026 — roughly **25% of its full-time workforce**. This creates:

- Countless opportunities to **design**, **build**, and **ship production code**.
- Rare chances to **measure** aspects of the Internet that are typically hard to observe and even harder to understand.

While Cloudflare’s immense [data resources](https://radar.cloudflare.com/) are valuable, **measurement is never easy** — even here. Big datasets mean **more noise to sift through** and require careful elimination of alternative explanations.

In 2022, **Ram Sundara Raman** joined Cloudflare as a PhD student intern. Now an Assistant Professor at the University of California, Santa Cruz, he returns to share his experience working with data at Cloudflare scale.

---

## For Prospective Interns

When applying for data and measurement projects, ask yourself:  
> **“If, how, or why would my idea matter to Cloudflare?”**

Cloudflare welcomes ideas that connect **research with real-world customer impact**.

---

## Leveraging Tools for Research Dissemination

Projects often benefit from platforms that extend reach beyond a single channel.  
[**AiToEarn**](https://aitoearn.ai/) is one such open-source ecosystem for:

- **AI-driven content generation**
- **Cross-platform publishing**
- **Analytics & AI model rankings** ([View rankings](https://rank.aitoearn.ai))

It supports platforms including Douyin, Kwai, WeChat, Bilibili, Rednote (Xiaohongshu), Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).

---

# Insights from Large-Scale Data: A Small Miracle

### Background

Before his Cloudflare internship in 2022, Ram worked on **network security and privacy** at the University of Michigan, focusing on **active measurements** like:

- Detection of [HTTPS interception](https://dl.acm.org/doi/10.1145/3419394.3423665)  
- Identification of [connection tampering](https://dl.acm.org/doi/10.1145/3372297.3417883)

These attacks, often executed by **network middleboxes**, can undermine security and block regional access to services — e.g., the HTTPS Interception Man-in-the-Middle attack in Kazakhstan in 2019.

### Challenges in Detection

Issues include:

- Varied **geographic and temporal patterns**
- No technical means to **notify affected users**
- Lack of transparency from third parties

Large-scale, real-world datasets are essential for addressing these, but access is rare.

---

## Existing Work: Censored Planet

Ram helped develop [**Censored Planet**](https://censoredplanet.org/) — an active censorship measurement observatory across **200+ countries**.

Limitations:

- Measures only the **2,000 most popular websites**
- Constrained by **time, resources, and visibility**

---

## Why Passive Data Is Harder Than You Think

**Key Finding:** Even with Cloudflare’s massive data, detecting middlebox interference **at scale is extremely challenging** ([Research paper](https://research.cloudflare.com/publications/SundaraRaman2023/), [SIGCOMM’23](https://www.sigcomm.org/)).

### Active vs Passive Measurement

Active Probing:
- Tailored measurement requests
- Precise targeting
- Easier control of variables

Passive Observation:
- Uses existing traffic data flowing to Cloudflare
- No control over variables or ground truth
- Must rely on **sampling, accurate extraction, and interpretation**

---

## Core Constraints Faced in the Internship

1. **Only natural incoming data** — no external datasets or custom probes.
2. Loss of ability to **choose measurement points**.
3. Dataset spread across **millions of users and varied connection paths**.
4. Handling **noisy data** and **biases** in sampling.

---

## Traps & Tripwires in Passive Data Analysis

### 1. Scale
- 45M HTTP requests/second across 285 data centers.
- NEL data mostly excluded due to bias.
- Used [**IPTABLES rules**](https://blog.cloudflare.com/tcp-resets-timeouts/#first-sample-connections) to sample 1 in 10,000 connections.
- Logged first 10 inbound packets only.

### 2. Noisy Data
Sources of misinterpretation:
- Millisecond timestamp resolution issues
- Denial-of-service traffic mimicking interference
- Protocol quirks like [**Happy Eyeballs**](https://datatracker.ietf.org/doc/html/rfc6555)

**Solution:** Iteratively refine tampering signatures with corroboration (e.g., inconsistent IP TTL fields).

### 3. Lack of Ground Truth
- No active experiments to confirm anomalies.
- Relied on prior censorship research signals ([censorbib.nymity.ch](https://censorbib.nymity.ch/)).

---

## Understanding the Limits

Even as a large provider:
- Can identify affected connections, **not the source of tampering**.
- Can sometimes detect blocked domains, but not always.
- See only activity that is affected — not what *could* be.

**Conclusion:** **Global view ≠ Easy observation.** Massive data still requires domain expertise and careful interpretation.

---

## Research Outcomes from the Internship

- Created **19 tampering signatures**
- Identified patterns across **hundreds of networks**
- Tracked spikes during events — e.g., protests in Iran (late 2022)  

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-130.png)  
*Figure 1: Increase in match rates for 19 tampering signatures.*

**Live results:** [**Cloudflare Radar**](https://radar.cloudflare.com/security/network-layer#tcp-resets-and-timeouts)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-120.png)  
*Figure 2: Data shared on Cloudflare Radar.*

---

## Looking Ahead

**Proposed approach:** Combine **passive & active probing** for a fuller picture of tampering.

Ongoing efforts:  
- [UCSC RANDLab](https://randlab.engineering.ucsc.edu/)  
- [Censored Planet](https://censoredplanet.org/)

---

## Internship Opportunities

Those interested in projects like this can [**apply here**](https://www.cloudflare.com/en-gb/careers/jobs/?department=Early+Talent).

---

## Bridging Research & Public Communication

Tools such as [AiToEarn](https://aitoearn.ai/) enable:
- AI-powered content generation
- Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Facebook, Instagram, LinkedIn, YouTube, Pinterest, X)
- Analytics & [AI Model Rankings](https://rank.aitoearn.ai)

This supports researchers in **disseminating technical findings** widely while monetizing content.

---

Read more