AI news

How Protective Re-Routing (PRR) Improves Network Reliability

Honghao Wang

15 Nov 2025 — 3 min read

Cloud Infrastructure Reliability and Protective ReRoute (PRR)

Cloud infrastructure reliability is fundamental, yet even the most advanced global networks can encounter a critical challenge: slow or failed recovery from routing outages.

In massive, planet-scale networks like Google’s, router failures or complex hidden conditions can prevent traditional routing protocols from restoring service quickly — or sometimes at all. These short but costly outages—known as slow convergence or convergence failure—severely impact:

Real-time applications with low tolerance for packet loss
Large-scale AI/ML training workloads, where even brief network disruption can waste millions in compute resources

---

Google’s Paradigm Shift: Protective ReRoute

To address this, Google developed Protective ReRoute (PRR) — a host-based mechanism that moves rapid failure recovery from the centralized network core to distributed endpoints.

Key stats:

Deployed in production for 5+ years
Recovers from up to 84%¹ of inter–data center outages caused by slow convergence

Google Cloud customers with workloads highly sensitive to packet loss can also enable PRR.

---

Why Traditional In-Network Recovery Falls Short

Routing protocols detect and fix network failures by recalculating affected paths (reconvergence).

At Google’s scale:

Reconvergence can take seconds to minutes
Large network topology increases probability of complex failure scenarios
For distributed AI training, even seconds of packet loss can cause job failure

---

PRR: A Host-Based Solution

Core idea: Endpoints detect failures & redirect traffic to healthy paths instantly.

How it works:

Host detects packet loss or latency on current route.
Host tweaks packet headers to signal network to use alternate pre-established path.
Recovery occurs far faster than global reconvergence.

---

Reliability Model Shift

Traditional model:

System reliability drops as network diameter (number of stages) increases
Slow reconvergence in serial stages degrades stability

PRR model:

Treats network as parallel system of paths functioning as a single stage
Reliability grows exponentially with available paths

---

PRR Functional Components

End-to-end failure detection
Continuous host monitoring of path health
On Linux: uses TCP retransmission timeout (RTO)
Detects failure in multiples of RTT
Packet-header modification at the host
Upon detection, modifies IPv6 flow label (Linux kernel 4.20+)
Google SDN protects IPv4 and non-Linux via overlay headers
PRR-aware forwarding
Routers recognize header change and switch to alternate path

---

Proof of Impact

Reduced network downtime due to slow convergence by up to 84%
Host-initiated recovery finishes in single-digit RTT multiples — much faster than network reconvergence

---

Key Use Cases for Ultra-Reliable Networking

AI/ML training and inference
Large distributed workloads benefit from uninterrupted data transfer.
Data integrity and storage
Minimizes risk of corruption or loss from packet drops.
Real-time applications
Online gaming, conferencing, voice cannot tolerate delay.
Frequent short-lived connections
Reduces disconnect risk in fast, repeated network calls.

---

Activating Protective ReRoute in Google Cloud

PRR is open-source, integrated into Linux kernel 4.20+, and available in two modes:

Hypervisor mode
Automatic protection without OS changes
Recovery in single-digit seconds for moderate fan-out traffic
Guest mode
Manual enablement for maximum speed
Recommended for critical AI/ML and latency-sensitive apps

---

Steps to Enable Guest Mode PRR

See full documentation here.

Prerequisites:

VM runs Linux kernel 4.20+
Application uses TCP
IPv6 traffic preferred; IPv4 needs gVNIC driver

---

Recommended Actions

Cloud customers: Enable guest-mode PRR for packet-loss-sensitive workloads.
Network architects: Design with path diversity, shift from serial to parallel reliability.
Open-source community: Advocate and contribute host-level reliability features across OSes.

---

¹ Research source

---

Platforms like AiToEarn官网 echo PRR’s decentralization principle — giving content creators distributed empowerment with AI tools for generating, publishing, and monetizing content across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter). Both PRR and AiToEarn demonstrate that agility at the edge drives performance and resilience at global scale.