Ultra-Long Sequence Parallelism: Ulysses + Ring-Attention Principles and Implementation

Ultra-Long Sequence Parallelism: Ulysses + Ring-Attention Principles and Implementation

00 — Preface

Long-Sequence Training Challenges

Training ultra-long sequences is a critical aspect of large model development. In real-world inference — especially within Agent pipelines — a model’s generalization ability to handle long contexts directly impacts its reliability.

However, long-sequence scenarios place heavy demands on training resources due to the O(N²) complexity of Attention. As sequence length grows, GPU memory usage skyrockets, making efficient training difficult on GPUs with limited memory.

---

Sequence Parallelism Overview

Sequence Parallelism (SP) allows training long sequences across multiple GPUs/nodes without large-memory requirements.

> Definition: Sequence parallelism splits an input sequence into smaller sub-sequences processed in parallel on different GPUs, reducing memory load.

Common Approaches

  • Ulysses (DeepSpeed)
  • Ring-Attention (Flash-Attn derivative)
  • Megatron-SP/CP (Megatron-LM optimizations)

In this guide, we focus on Ulysses and Ring-Attention, which integrate well with the Transformers ecosystem.

---

Memory Reduction Example (Qwen2.5‑3B)

| SP Size | SP Strategy | GPU Memory | Training Time |

|---------|------------------------------|------------|---------------|

| 8 | ulysses=2 + ring=4 | 17.92 GiB | 1:07:20 |

| 4 | ulysses=2 + ring=2 | 27.78 GiB | 37:48 |

| 2 | ulysses=2 | 48.50 GiB | 24:16 |

| 1 | No SP | 75.35 GiB | 19:41 |

Note: More splits reduce memory but increase communication overhead and training time.

---

01 — Ulysses

Core Idea

Ulysses splits sequences, exchanges activations before Attention, ensures each GPU has the full sequence, distributes Attention Heads across GPUs, and exchanges results after computation.

image
image

Key Points:

  • N = sequence length
  • d = hidden size = (#Heads × each head size)
  • QKV remains intact, but different heads are on different GPUs.

Advantages:

  • Works with GQA, MHA, `flash-attn`, SDPA, `padding_free`
  • Compatible with forward optimization techniques

Limitations:

  • Depends on number of Attention Heads; GQA with fewer KV heads may scale poorly.

---

02 — Ring-Attention

Concept Origins

Built on Flash-Attention, which blocks QKV/softmax to:

  • Maximize SRAM use
  • Reduce memory footprint
  • Process incrementally
image
image

Ring-Attention Idea:

Split sequence into blocks across GPUs, compute block Attention locally, then circulate K/V among GPUs in a ring to complete the final Attention output.

image

---

Numerical Stability in Softmax

  • Avoid direct exponentiation to prevent overflow/underflow
  • Use LogSumExp (LSE) and PyTorch’s `softplus` for stable updates

---

Zigzag Ring-Attention Optimization

Standard Ring-Attn has uneven load due to causal masking.

Zigzag slicing pairs blocks strategically (e.g., 0/7, 1/6, 2/5, 3/4) to balance computation.

image
image

This:

  • Balances workload
  • Further reduces compute by limiting KV/Q processing in certain cases.

---

03 — Combining Ulysses + Ring-Attention

Comparison:

  • Ulysses: Low comm volume, depends on heads, requires specific network topology
  • Ring-Attn: P2P ring comm, no head constraint, higher comm volume

Strategy:

  • Use Ulysses first (low comm cost)
  • Add Ring-Attn if:
  • Too many sequence splits
  • Insufficient Attention Heads (e.g., GQA)

SWIFT Implementation

  • Works for pure text, multimodal, SFT, DPO, GRPO
  • Auto-detects split strategy
  • Supports odd GPU counts

Command:

--sequence_parallel_size N

Split Flow Example

Splitting into 4 parts (Ulysses world_size=2, Ring world_size=2) with 4 heads:

image

---

Multimodal & `padding_free` Adaptations

  • Split happens in backbone forward hook after ViT encoding merges
  • Ensures compatibility with text/multimodal
  • Padding each sequence to divisible length before split
  • Special handling for padding in QKV during attention
  • Loss calc rewritten for GRPO/DPO due to full-sequence requirements

---

04 — GPU Memory Optimization Results

Example: 3B model, 8×A100 GPUs

NPROC_PER_NODE=8 \
swift sft \
    --model Qwen/Qwen2.5-3B-Instruct \
    --dataset 'test.jsonl' \ # 9000 tokens/sequence
    --train_type lora \
    --torch_dtype bfloat16 \
    --per_device_train_batch_size 4 \
    --target_modules all-linear \
    --gradient_accumulation_steps 8 \
    --save_total_limit 2 \
    --save_only_model true \
    --save_steps 50 \
    --max_length 65536 \
    --warmup_ratio 0.05 \
    --attn_impl flash_attn \
    --sequence_parallel_size 8 \
    --logging_steps 1 \
    --use_logits_to_keep false \
    --padding_free true

Memory dropped from 80 GiB → <20 GiB with 8-way split.

---

05 — Outlook

Future optimizations:

  • Faster backward pass without full recompute of `flash_attn_forward`
  • Reduce P2P volume and add async execution

---

References

---

For teams working with long-sequence models and multi-platform deployment, AiToEarn provides:

  • AI content generation
  • Simultaneous publishing to Douyin, Kwai, Bilibili, YouTube, X (Twitter), etc.
  • Analytics + AI 模型排名
  • Open-source integrations: GitHub
  • Blog: AiToEarn博客

By combining GPU-efficient parallelism methods like Ulysses/Ring-Attn with multi-channel publishing, creators can link model outputs directly to audience engagement and monetization.

Read more