long-context

EntropyLong: Efficient Long-Context Training via Uncertainty Prediction

Honghao Wang

17 Oct 2025 — 4 min read

EntropyLong: An Information-Theoretic Approach to Long-Text Training Data

EntropyLong is a long-text data construction method based on predictive uncertainty.

It identifies locations of missing information by measuring a model’s prediction entropy, retrieves relevant distant context, and verifies whether this context reduces uncertainty — ensuring training data contains genuine long-range dependencies.

Unlike heuristic methods, EntropyLong is grounded in information theory, guaranteeing measurable information gain for each constructed dependency.

> Paper link: Arxiv: 2510.02330

---

1. Background: Challenges in Long-Text Training

Large language models now support context windows of millions of tokens, but native high-quality long documents are rare.

Naively concatenating short documents does not yield true long-range relationships, so models fail to utilize long contexts effectively.

Existing Synthetic Approaches

Query-driven – retrieve semantically related documents to form coherent sequences.
Contrast-driven – interleave relevant docs with distractors for better discrimination.
Task-driven – synthesize data targeting specific long-text capabilities.
Self-generated – have LLMs create their own long-context content.

Limitations:

All these methods assume certain constructions are “good” without verifying actual utility for the model.

---

2. EntropyLong: Data Construction from Model Uncertainty

2.1 Core Insight – Prediction Entropy Marks Missing Information

A position with high entropy means the model is unsure and missing key context.

If distant information reduces entropy, a genuine dependency exists.

Entropy at position t:

High-entropy points → anchor positions for adding context.

---

2.2 Guiding Principle – Maximize Measured Information Gain

We define Contextual Information Gain as the reduction in entropy after adding candidate context:

Practical implications:

Empirical Validation – Only accept contexts that measurably reduce entropy.
Information-Theoretic Selection – Pick contexts that maximize validated gain.

---

2.3 Testable Hypotheses

Necessity of Validation – Data with validated contexts will outperform semantic-only retrieval sets for fine-grained dependencies.
Optimal Thresholds – Ideal thresholds for both high-entropy detection and context acceptance balance quality and quantity of training data.

---

3. Methodology: Four Stages

Stage 1 – High-Entropy Position Selection

Compute entropy for each document position.
Use adaptive threshold:
where \(\bar{H}\) and \(\sigma_H\) are mean and std dev of entropy, \(\alpha\) is selectivity (e.g., 1.5).
Mark positions meeting condition as high-uncertainty anchors.

---

Stage 2 – Context Retrieval by Information Theory

For each high-entropy position \(p\), select a window \(w\) before and after.
Use this window as a dense-vector search query over a large corpus.
Rank by cosine similarity; keep top \(k\) candidates.

---

Stage 3 – Entropy Reduction Verification

Prepend each candidate context to original doc.
Recompute entropy at anchor positions:
Keep only contexts meeting reduction threshold:

---

Stage 4 – Strategic Concatenation

Randomly shuffle verified contexts linked to different high-entropy points.
Concatenate to form final training sequence.

---

4. Experiments & Results

4.1 Setup

Base Model: Meta-Llama-3-8B
Context: Expanded to 128K tokens via RoPE adjustment.
Steps: 1000 steps, batch 4M tokens.
Dataset: FineWeb-Edu, Cosmopedia, 100K docs sampled, retrieval corpus >1B docs.
Baselines: Quest (semantic concat), NExtLong (distraction-based concat).
Benchmarks: RULER, LongBench-v2.

4.2 Results

RULER

Avg score: EntropyLong 87.37 vs Quest 80.53, NExtLong 85.22.
At 128K: 81.26 vs Quest 60.11, NExtLong 77.99.

LongBench-v2 (after instruction fine-tuning)

Strong transfer across domains.
“Long” task score: 31.50 vs Quest 21.30, NExtLong 23.10 (+8.40 best baseline).

---

5. Analysis: Why EntropyLong Works

5.1 Validation is Crucial (Hypothesis 1)

Full EntropyLong: 87.37
NoVerify version: 85.82 → drop due to false correlations retained.

5.2 Threshold Effects (Hypothesis 2)

High-Entropy Selection Threshold

Low: too many positions (noise), score drop.
High: too few positions, insufficient training signal.
Optimum: 292 positions.

Entropy Drop Threshold

Low: accepts weak contexts.
High: too restrictive.
Optimum: 46 dependencies → thresholds 2.0 & 0.4 validated.

---

5.3 Attention Pattern Analysis

Length expansion: EntropyLong attention stable toward correct answer across context sizes.
Middle loss mitigation: +12–44% improvement over NExtLong at mid-sequence positions.

---

6. Conclusion

EntropyLong transforms long-text data construction from heuristic to evidence-based, leveraging model uncertainty to guide context selection.

Experiments confirm:

Empirical validation → tangible gains.
Threshold optimization → strong balance of quality and quantity.
Superior performance → across synthetic and real-world long-text benchmarks.

---

Further Reading: