EntropyLong: Efficient Long-Context Training via Uncertainty Prediction

EntropyLong: Efficient Long-Context Training via Uncertainty Prediction

EntropyLong: An Information-Theoretic Approach to Long-Text Training Data

image
image

EntropyLong is a long-text data construction method based on predictive uncertainty.

It identifies locations of missing information by measuring a model’s prediction entropy, retrieves relevant distant context, and verifies whether this context reduces uncertainty — ensuring training data contains genuine long-range dependencies.

Unlike heuristic methods, EntropyLong is grounded in information theory, guaranteeing measurable information gain for each constructed dependency.

image

> Paper link: Arxiv: 2510.02330

---

1. Background: Challenges in Long-Text Training

Large language models now support context windows of millions of tokens, but native high-quality long documents are rare.

Naively concatenating short documents does not yield true long-range relationships, so models fail to utilize long contexts effectively.

Existing Synthetic Approaches

  • Query-driven – retrieve semantically related documents to form coherent sequences.
  • Contrast-driven – interleave relevant docs with distractors for better discrimination.
  • Task-driven – synthesize data targeting specific long-text capabilities.
  • Self-generated – have LLMs create their own long-context content.

Limitations:

All these methods assume certain constructions are “good” without verifying actual utility for the model.

---

2. EntropyLong: Data Construction from Model Uncertainty

2.1 Core Insight – Prediction Entropy Marks Missing Information

A position with high entropy means the model is unsure and missing key context.

If distant information reduces entropy, a genuine dependency exists.

Entropy at position t:

image
  • High-entropy points → anchor positions for adding context.

---

2.2 Guiding Principle – Maximize Measured Information Gain

We define Contextual Information Gain as the reduction in entropy after adding candidate context:

image

Practical implications:

  • Empirical Validation – Only accept contexts that measurably reduce entropy.
  • Information-Theoretic Selection – Pick contexts that maximize validated gain.

---

2.3 Testable Hypotheses

  • Necessity of Validation – Data with validated contexts will outperform semantic-only retrieval sets for fine-grained dependencies.
  • Optimal Thresholds – Ideal thresholds for both high-entropy detection and context acceptance balance quality and quantity of training data.

---

3. Methodology: Four Stages

image

Stage 1 – High-Entropy Position Selection

  • Compute entropy for each document position.
  • Use adaptive threshold:
  • where \(\bar{H}\) and \(\sigma_H\) are mean and std dev of entropy, \(\alpha\) is selectivity (e.g., 1.5).
  • Mark positions meeting condition as high-uncertainty anchors.
image

---

Stage 2 – Context Retrieval by Information Theory

  • For each high-entropy position \(p\), select a window \(w\) before and after.
  • Use this window as a dense-vector search query over a large corpus.
  • Rank by cosine similarity; keep top \(k\) candidates.

---

Stage 3 – Entropy Reduction Verification

  • Prepend each candidate context to original doc.
  • Recompute entropy at anchor positions:
  • Keep only contexts meeting reduction threshold:
image
image

---

Stage 4 – Strategic Concatenation

  • Randomly shuffle verified contexts linked to different high-entropy points.
  • Concatenate to form final training sequence.

---

4. Experiments & Results

4.1 Setup

  • Base Model: Meta-Llama-3-8B
  • Context: Expanded to 128K tokens via RoPE adjustment.
  • Steps: 1000 steps, batch 4M tokens.
  • Dataset: FineWeb-Edu, Cosmopedia, 100K docs sampled, retrieval corpus >1B docs.
  • Baselines: Quest (semantic concat), NExtLong (distraction-based concat).
  • Benchmarks: RULER, LongBench-v2.

4.2 Results

RULER

image
  • Avg score: EntropyLong 87.37 vs Quest 80.53, NExtLong 85.22.
  • At 128K: 81.26 vs Quest 60.11, NExtLong 77.99.

LongBench-v2 (after instruction fine-tuning)

image
  • Strong transfer across domains.
  • “Long” task score: 31.50 vs Quest 21.30, NExtLong 23.10 (+8.40 best baseline).

---

5. Analysis: Why EntropyLong Works

5.1 Validation is Crucial (Hypothesis 1)

image
  • Full EntropyLong: 87.37
  • NoVerify version: 85.82 → drop due to false correlations retained.

5.2 Threshold Effects (Hypothesis 2)

High-Entropy Selection Threshold

image
  • Low: too many positions (noise), score drop.
  • High: too few positions, insufficient training signal.
  • Optimum: 292 positions.

Entropy Drop Threshold

image
  • Low: accepts weak contexts.
  • High: too restrictive.
  • Optimum: 46 dependencies → thresholds 2.0 & 0.4 validated.

---

5.3 Attention Pattern Analysis

image
  • Length expansion: EntropyLong attention stable toward correct answer across context sizes.
  • Middle loss mitigation: +12–44% improvement over NExtLong at mid-sequence positions.

---

6. Conclusion

EntropyLong transforms long-text data construction from heuristic to evidence-based, leveraging model uncertainty to guide context selection.

Experiments confirm:

  • Empirical validation → tangible gains.
  • Threshold optimization → strong balance of quality and quantity.
  • Superior performance → across synthetic and real-world long-text benchmarks.
image
image
image

---

Further Reading:

Read more