sequence parallelism

Ultra-Long Sequence Parallelism: Ulysses + Ring-Attention Principles and Implementation

Honghao Wang

04 Nov 2025 — 4 min read

00 — Preface

Long-Sequence Training Challenges

Training ultra-long sequences is a critical aspect of large model development. In real-world inference — especially within Agent pipelines — a model’s generalization ability to handle long contexts directly impacts its reliability.

However, long-sequence scenarios place heavy demands on training resources due to the O(N²) complexity of Attention. As sequence length grows, GPU memory usage skyrockets, making efficient training difficult on GPUs with limited memory.

---

Sequence Parallelism Overview

Sequence Parallelism (SP) allows training long sequences across multiple GPUs/nodes without large-memory requirements.

> Definition: Sequence parallelism splits an input sequence into smaller sub-sequences processed in parallel on different GPUs, reducing memory load.

Common Approaches

Ulysses (DeepSpeed)
Ring-Attention (Flash-Attn derivative)
Megatron-SP/CP (Megatron-LM optimizations)

In this guide, we focus on Ulysses and Ring-Attention, which integrate well with the Transformers ecosystem.

---

Memory Reduction Example (Qwen2.5‑3B)

|---------|------------------------------|------------|---------------|

| 8 | ulysses=2 + ring=4 | 17.92 GiB | 1:07:20 |

| 4 | ulysses=2 + ring=2 | 27.78 GiB | 37:48 |

| 2 | ulysses=2 | 48.50 GiB | 24:16 |

| 1 | No SP | 75.35 GiB | 19:41 |

Note: More splits reduce memory but increase communication overhead and training time.

---

01 — Ulysses

Core Idea

Ulysses splits sequences, exchanges activations before Attention, ensures each GPU has the full sequence, distributes Attention Heads across GPUs, and exchanges results after computation.

Key Points:

N = sequence length
d = hidden size = (#Heads × each head size)
QKV remains intact, but different heads are on different GPUs.

Advantages:

Works with GQA, MHA, `flash-attn`, SDPA, `padding_free`
Compatible with forward optimization techniques

Limitations:

Depends on number of Attention Heads; GQA with fewer KV heads may scale poorly.

---

02 — Ring-Attention

Concept Origins

Built on Flash-Attention, which blocks QKV/softmax to:

Maximize SRAM use
Reduce memory footprint
Process incrementally

Ring-Attention Idea:

Split sequence into blocks across GPUs, compute block Attention locally, then circulate K/V among GPUs in a ring to complete the final Attention output.

---

Numerical Stability in Softmax

Avoid direct exponentiation to prevent overflow/underflow
Use LogSumExp (LSE) and PyTorch’s `softplus` for stable updates

---

Zigzag Ring-Attention Optimization

Standard Ring-Attn has uneven load due to causal masking.

Zigzag slicing pairs blocks strategically (e.g., 0/7, 1/6, 2/5, 3/4) to balance computation.

This:

Balances workload
Further reduces compute by limiting KV/Q processing in certain cases.

---

03 — Combining Ulysses + Ring-Attention

Comparison:

Ulysses: Low comm volume, depends on heads, requires specific network topology
Ring-Attn: P2P ring comm, no head constraint, higher comm volume

Strategy:

Use Ulysses first (low comm cost)
Add Ring-Attn if:
Too many sequence splits
Insufficient Attention Heads (e.g., GQA)

SWIFT Implementation

Works for pure text, multimodal, SFT, DPO, GRPO
Auto-detects split strategy
Supports odd GPU counts

Command:

--sequence_parallel_size N

Split Flow Example

Splitting into 4 parts (Ulysses world_size=2, Ring world_size=2) with 4 heads:

---

Multimodal & `padding_free` Adaptations

Split happens in backbone forward hook after ViT encoding merges
Ensures compatibility with text/multimodal
Padding each sequence to divisible length before split
Special handling for padding in QKV during attention
Loss calc rewritten for GRPO/DPO due to full-sequence requirements

---

04 — GPU Memory Optimization Results

Example: 3B model, 8×A100 GPUs

NPROC_PER_NODE=8 \
swift sft \
    --model Qwen/Qwen2.5-3B-Instruct \
    --dataset 'test.jsonl' \ # 9000 tokens/sequence
    --train_type lora \
    --torch_dtype bfloat16 \
    --per_device_train_batch_size 4 \
    --target_modules all-linear \
    --gradient_accumulation_steps 8 \
    --save_total_limit 2 \
    --save_only_model true \
    --save_steps 50 \
    --max_length 65536 \
    --warmup_ratio 0.05 \
    --attn_impl flash_attn \
    --sequence_parallel_size 8 \
    --logging_steps 1 \
    --use_logits_to_keep false \
    --padding_free true

⏬ Memory dropped from 80 GiB → <20 GiB with 8-way split.

---

05 — Outlook

Future optimizations:

Faster backward pass without full recompute of `flash_attn_forward`
Reduce P2P volume and add async execution

---

References

---

For teams working with long-sequence models and multi-platform deployment, AiToEarn provides:

AI content generation
Simultaneous publishing to Douyin, Kwai, Bilibili, YouTube, X (Twitter), etc.
Analytics + AI 模型排名
Open-source integrations: GitHub
Blog: AiToEarn博客

By combining GPU-efficient parallelism methods like Ulysses/Ring-Attn with multi-channel publishing, creators can link model outputs directly to audience engagement and monetization.

Ultra-Long Sequence Parallelism: Ulysses + Ring-Attention Principles and Implementation

Honghao Wang

00 — Preface

Long-Sequence Training Challenges

Sequence Parallelism Overview

Common Approaches

Memory Reduction Example (Qwen2.5‑3B)

01 — Ulysses

Core Idea

02 — Ring-Attention

Concept Origins

Numerical Stability in Softmax

Zigzag Ring-Attention Optimization

03 — Combining Ulysses + Ring-Attention

Strategy:

SWIFT Implementation

Split Flow Example

Multimodal & `padding_free` Adaptations

04 — GPU Memory Optimization Results

05 — Outlook

References

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days

00 — Preface

Long-Sequence Training Challenges

Sequence Parallelism Overview

Common Approaches

Memory Reduction Example (Qwen2.5‑3B)

01 — Ulysses

Core Idea

02 — Ring-Attention

Concept Origins

Numerical Stability in Softmax

Zigzag Ring-Attention Optimization

03 — Combining Ulysses + Ring-Attention

Strategy:

SWIFT Implementation

Split Flow Example

Multimodal & `padding_free` Adaptations

04 — GPU Memory Optimization Results

05 — Outlook

References

Related Note

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days