AI Playbooks

Build an Efficient Framework for AI Agents to Handle "Marathon" Tasks

Honghao Wang

27 Nov 2025 — 4 min read

Effective Harnesses for Long-Running Agents

As AI agents gain capabilities, they are tackling more complex, long-duration tasks — sometimes taking hours or days to complete. A persistent challenge is keeping agents coherent across multiple memory cycles, since each session starts with limited context.

---

The Core Problem: Context Window Limits

Long-running agent work happens in multiple, disconnected sessions. Every new session essentially forgets what occurred earlier.

Imagine a rotating team of engineers where each remembers nothing from the previous shift — productivity would plummet.

Because AI models have a limited context window (the maximum information they can process at once), complex projects extending beyond this limit need a method to bridge sessions.

---

Our Dual-Agent Solution with Claude Agent SDK

We developed a two-agent harness to work efficiently across multiple context windows:

Initializer Agent – sets up the environment during the first run.
Coding Agent – makes incremental coding progress during each subsequent session, preparing a clear handoff document for the next session.

📂 Code samples: Quickstart Guide

---

The “Marathon Agent” Dilemma

The Claude Agent SDK supports programming, planning, and information-gathering tasks.

It includes context compaction features to condense historical info, theoretically enabling infinite continuous work.

Reality check: Compaction alone isn’t enough.

Failure patterns we see:

Too much at once – Agents try a complete one-shot build, overflowing context and leaving fragmented output for the next session.
Overconfidence – Agents declare “done” after partial builds without full validation.

---

The Needed Breakthrough

We realized success required:

An initial environment to support all features requested
Step-by-step module-based development cadence

---

Why This Matters Beyond Code

Tools that synchronize work across sessions have value in research, multimedia, and creator workflows.

Platforms like AiToEarn offer AI-assisted generation, multi-platform publishing, analytics, and model ranking — coordinating multi-session work much like our harness does for agents. Learn more: AiToEarn博客 | AI模型排名.

---

Guiding Agents with Incremental Progress

A Clean State at the end of each session means:

Merge-ready code
No major bugs
Clear structure
Documented for handoff

---

Two-Step Process from Experiments

Phase 1 — Initialize Agent

Create:
`init.sh` script
`claude-progress.txt` operation log
Initial Git commit with file list

Phase 2 — Programming Agent

Make small feature updates per session
Log updates in a structured progress file

Insight: Combine `claude-progress.txt` + Git history so future agents can start fresh without losing work context.

---

Environment Management

Per Claude 4 Prompt Guide:

> Use a different prompt for the first context window.

Initializer tasks include:

1. Feature List

Detailed breakdown based on initial user request
Example: Cloning claude.ai → 200+ features
All start as “Failing”
Strict rule: Only update “passes” field; never delete items
Stored in JSON, not Markdown, to avoid accidental corruption

---

2. Incremental Progress

One feature at a time
Maintain clean environment
Good habits:
Descriptive Git commit messages
Work summaries in progress file
Quickly revert via Git if needed

---

3. Testing

Common failure: Marking features “complete” without end-to-end testing.

Better results come from adding:

Browser automation (e.g., Puppeteer MCP)
Human-like workflows

Screenshot by Claude testing a cloned claude.ai app.

Known limitation: Can’t detect native browser alerts via Puppeteer MCP → related functionality often buggy.

---

Session Orientation Steps

At start of session:

Run `pwd` – confirm current working directory
Read Git logs & `claude-progress.txt` – understand recent work
Read `feature_list.json` – pick top-priority unfinished feature
Run `init.sh` – start development server
Test core functionality via Puppeteer MCP
Begin new feature development

Benefit: Saves tokens — avoids re-learning app setup each time.

---

Example Session Flow

[Assistant] I’ll start by getting familiar with the environment and the current project status.
[Tool Use] 
[Tool Use] 
[Tool Use] 
[Assistant] Let me check the git log to see recent changes.
[Tool Use] 
[Assistant] Now I’ll check if there is an init.sh script to restart the server.

[Assistant] Great! Now I’ll navigate to the application to verify some core features are working.

[Assistant] ... core functionalities are running well. Main chat, theme switching, conversation loading, error handling operational.

---

Failure Modes & Solutions

| Failure Mode | Initializer Agent | Coding Agent |

|--------------|------------------|--------------|

| Declares project “complete” too early | Build function list file | Only update “Passed” after self-verification |

| Buggy env / unclear progress | Create repo + progress doc | Read progress/Git logs, run basic tests before coding |

| Marks functions complete prematurely | Build function list | Self-verify end-to-end before marking “Passed” |

| Doesn’t know how to run app | Write `init.sh` | Read/run `init.sh` at session start |

---

Summary of Solutions

Function List File (JSON) – Structured descriptions, end-to-end
Progress Documentation – Git commits + logs for review
Strict Self-Verification – Only mark features complete after thorough testing
Startup Script (`init.sh`) – Reliable dev server launch

---

Future Outlook

Open questions remain:

Is a single general-purpose agent optimal?
Or should we develop specialist agents for testing, QA, code cleaning?

This framework currently suits full-stack web dev; future work may extend to research or financial modeling.

---

Acknowledgements

Thanks to the Anthropic teams — particularly Code RL and Claude Code — enabling safe, autonomous long-cycle programming with Claude.

Interested? Apply: anthropic.com/careers

---

Note: Independent creators also face similar multi-session AI challenges.

Platforms like AiToEarn官网 combine open-source generation, publishing, analytics, and ranking — enabling monetization of long-term, multi-platform workflows.

---

Source: Anthropic Engineering Blog