How Protective Re-Routing (PRR) Improves Network Reliability
Cloud Infrastructure Reliability and Protective ReRoute (PRR)
Cloud infrastructure reliability is fundamental, yet even the most advanced global networks can encounter a critical challenge: slow or failed recovery from routing outages.
In massive, planet-scale networks like Google’s, router failures or complex hidden conditions can prevent traditional routing protocols from restoring service quickly — or sometimes at all. These short but costly outages—known as slow convergence or convergence failure—severely impact:
- Real-time applications with low tolerance for packet loss
- Large-scale AI/ML training workloads, where even brief network disruption can waste millions in compute resources
---
Google’s Paradigm Shift: Protective ReRoute
To address this, Google developed Protective ReRoute (PRR) — a host-based mechanism that moves rapid failure recovery from the centralized network core to distributed endpoints.
Key stats:
- Deployed in production for 5+ years
- Recovers from up to 84%¹ of inter–data center outages caused by slow convergence
Google Cloud customers with workloads highly sensitive to packet loss can also enable PRR.
---
Why Traditional In-Network Recovery Falls Short
Routing protocols detect and fix network failures by recalculating affected paths (reconvergence).
At Google’s scale:
- Reconvergence can take seconds to minutes
- Large network topology increases probability of complex failure scenarios
- For distributed AI training, even seconds of packet loss can cause job failure
---
PRR: A Host-Based Solution
Core idea: Endpoints detect failures & redirect traffic to healthy paths instantly.
How it works:
- Host detects packet loss or latency on current route.
- Host tweaks packet headers to signal network to use alternate pre-established path.
- Recovery occurs far faster than global reconvergence.
---
Reliability Model Shift
Traditional model:
- System reliability drops as network diameter (number of stages) increases
- Slow reconvergence in serial stages degrades stability
PRR model:
- Treats network as parallel system of paths functioning as a single stage
- Reliability grows exponentially with available paths

---
PRR Functional Components
- End-to-end failure detection
- Continuous host monitoring of path health
- On Linux: uses TCP retransmission timeout (RTO)
- Detects failure in multiples of RTT
- Packet-header modification at the host
- Upon detection, modifies IPv6 flow label (Linux kernel 4.20+)
- Google SDN protects IPv4 and non-Linux via overlay headers
- PRR-aware forwarding
- Routers recognize header change and switch to alternate path
---
Proof of Impact
- Reduced network downtime due to slow convergence by up to 84%
- Host-initiated recovery finishes in single-digit RTT multiples — much faster than network reconvergence
---
Key Use Cases for Ultra-Reliable Networking
- AI/ML training and inference
- Large distributed workloads benefit from uninterrupted data transfer.
- Data integrity and storage
- Minimizes risk of corruption or loss from packet drops.
- Real-time applications
- Online gaming, conferencing, voice cannot tolerate delay.
- Frequent short-lived connections
- Reduces disconnect risk in fast, repeated network calls.
---
Activating Protective ReRoute in Google Cloud
PRR is open-source, integrated into Linux kernel 4.20+, and available in two modes:
- Hypervisor mode
- Automatic protection without OS changes
- Recovery in single-digit seconds for moderate fan-out traffic
- Guest mode
- Manual enablement for maximum speed
- Recommended for critical AI/ML and latency-sensitive apps
---
Steps to Enable Guest Mode PRR
See full documentation here.
Prerequisites:
- VM runs Linux kernel 4.20+
- Application uses TCP
- IPv6 traffic preferred; IPv4 needs gVNIC driver
---
Recommended Actions
- Cloud customers: Enable guest-mode PRR for packet-loss-sensitive workloads.
- Network architects: Design with path diversity, shift from serial to parallel reliability.
- Open-source community: Advocate and contribute host-level reliability features across OSes.
---
---
Platforms like AiToEarn官网 echo PRR’s decentralization principle — giving content creators distributed empowerment with AI tools for generating, publishing, and monetizing content across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter). Both PRR and AiToEarn demonstrate that agility at the edge drives performance and resilience at global scale.