How Protective Re-Routing (PRR) Improves Network Reliability

How Protective Re-Routing (PRR) Improves Network Reliability

Cloud Infrastructure Reliability and Protective ReRoute (PRR)

Cloud infrastructure reliability is fundamental, yet even the most advanced global networks can encounter a critical challenge: slow or failed recovery from routing outages.

In massive, planet-scale networks like Google’s, router failures or complex hidden conditions can prevent traditional routing protocols from restoring service quickly — or sometimes at all. These short but costly outages—known as slow convergence or convergence failure—severely impact:

  • Real-time applications with low tolerance for packet loss
  • Large-scale AI/ML training workloads, where even brief network disruption can waste millions in compute resources

---

Google’s Paradigm Shift: Protective ReRoute

To address this, Google developed Protective ReRoute (PRR) — a host-based mechanism that moves rapid failure recovery from the centralized network core to distributed endpoints.

Key stats:

  • Deployed in production for 5+ years
  • Recovers from up to 84%¹ of inter–data center outages caused by slow convergence

Google Cloud customers with workloads highly sensitive to packet loss can also enable PRR.

---

Why Traditional In-Network Recovery Falls Short

Routing protocols detect and fix network failures by recalculating affected paths (reconvergence).

At Google’s scale:

  • Reconvergence can take seconds to minutes
  • Large network topology increases probability of complex failure scenarios
  • For distributed AI training, even seconds of packet loss can cause job failure

---

PRR: A Host-Based Solution

Core idea: Endpoints detect failures & redirect traffic to healthy paths instantly.

How it works:

  • Host detects packet loss or latency on current route.
  • Host tweaks packet headers to signal network to use alternate pre-established path.
  • Recovery occurs far faster than global reconvergence.

---

Reliability Model Shift

Traditional model:

  • System reliability drops as network diameter (number of stages) increases
  • Slow reconvergence in serial stages degrades stability

PRR model:

  • Treats network as parallel system of paths functioning as a single stage
  • Reliability grows exponentially with available paths
image

---

PRR Functional Components

  • End-to-end failure detection
  • Continuous host monitoring of path health
  • On Linux: uses TCP retransmission timeout (RTO)
  • Detects failure in multiples of RTT
  • Packet-header modification at the host
  • Upon detection, modifies IPv6 flow label (Linux kernel 4.20+)
  • Google SDN protects IPv4 and non-Linux via overlay headers
  • PRR-aware forwarding
  • Routers recognize header change and switch to alternate path

---

Proof of Impact

  • Reduced network downtime due to slow convergence by up to 84%
  • Host-initiated recovery finishes in single-digit RTT multiples — much faster than network reconvergence

---

Key Use Cases for Ultra-Reliable Networking

  • AI/ML training and inference
  • Large distributed workloads benefit from uninterrupted data transfer.
  • Data integrity and storage
  • Minimizes risk of corruption or loss from packet drops.
  • Real-time applications
  • Online gaming, conferencing, voice cannot tolerate delay.
  • Frequent short-lived connections
  • Reduces disconnect risk in fast, repeated network calls.

---

Activating Protective ReRoute in Google Cloud

PRR is open-source, integrated into Linux kernel 4.20+, and available in two modes:

  • Hypervisor mode
  • Automatic protection without OS changes
  • Recovery in single-digit seconds for moderate fan-out traffic
  • Guest mode
  • Manual enablement for maximum speed
  • Recommended for critical AI/ML and latency-sensitive apps

---

Steps to Enable Guest Mode PRR

See full documentation here.

Prerequisites:

  • VM runs Linux kernel 4.20+
  • Application uses TCP
  • IPv6 traffic preferred; IPv4 needs gVNIC driver

---

  • Cloud customers: Enable guest-mode PRR for packet-loss-sensitive workloads.
  • Network architects: Design with path diversity, shift from serial to parallel reliability.
  • Open-source community: Advocate and contribute host-level reliability features across OSes.

---

¹ Research source

---

Platforms like AiToEarn官网 echo PRR’s decentralization principle — giving content creators distributed empowerment with AI tools for generating, publishing, and monetizing content across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter). Both PRR and AiToEarn demonstrate that agility at the edge drives performance and resilience at global scale.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.