Build an Efficient Framework for AI Agents to Handle "Marathon" Tasks
Effective Harnesses for Long-Running Agents
As AI agents gain capabilities, they are tackling more complex, long-duration tasks — sometimes taking hours or days to complete. A persistent challenge is keeping agents coherent across multiple memory cycles, since each session starts with limited context.
---
The Core Problem: Context Window Limits
Long-running agent work happens in multiple, disconnected sessions. Every new session essentially forgets what occurred earlier.
Imagine a rotating team of engineers where each remembers nothing from the previous shift — productivity would plummet.
Because AI models have a limited context window (the maximum information they can process at once), complex projects extending beyond this limit need a method to bridge sessions.
---
Our Dual-Agent Solution with Claude Agent SDK
We developed a two-agent harness to work efficiently across multiple context windows:
- Initializer Agent – sets up the environment during the first run.
- Coding Agent – makes incremental coding progress during each subsequent session, preparing a clear handoff document for the next session.
📂 Code samples: Quickstart Guide
---
The “Marathon Agent” Dilemma
The Claude Agent SDK supports programming, planning, and information-gathering tasks.
It includes context compaction features to condense historical info, theoretically enabling infinite continuous work.
Reality check: Compaction alone isn’t enough.
Failure patterns we see:
- Too much at once – Agents try a complete one-shot build, overflowing context and leaving fragmented output for the next session.
- Overconfidence – Agents declare “done” after partial builds without full validation.
---
The Needed Breakthrough
We realized success required:
- An initial environment to support all features requested
- Step-by-step module-based development cadence
---
Why This Matters Beyond Code
Tools that synchronize work across sessions have value in research, multimedia, and creator workflows.
Platforms like AiToEarn offer AI-assisted generation, multi-platform publishing, analytics, and model ranking — coordinating multi-session work much like our harness does for agents. Learn more: AiToEarn博客 | AI模型排名.
---
Guiding Agents with Incremental Progress
A Clean State at the end of each session means:
- Merge-ready code
- No major bugs
- Clear structure
- Documented for handoff
---
Two-Step Process from Experiments
Phase 1 — Initialize Agent
- Create:
- `init.sh` script
- `claude-progress.txt` operation log
- Initial Git commit with file list
Phase 2 — Programming Agent
- Make small feature updates per session
- Log updates in a structured progress file
Insight: Combine `claude-progress.txt` + Git history so future agents can start fresh without losing work context.
---
Environment Management
> Use a different prompt for the first context window.
Initializer tasks include:
1. Feature List
- Detailed breakdown based on initial user request
- Example: Cloning claude.ai → 200+ features
- All start as “Failing”
- Strict rule: Only update “passes” field; never delete items
- Stored in JSON, not Markdown, to avoid accidental corruption
---
2. Incremental Progress
- One feature at a time
- Maintain clean environment
- Good habits:
- Descriptive Git commit messages
- Work summaries in progress file
- Quickly revert via Git if needed
---
3. Testing
Common failure: Marking features “complete” without end-to-end testing.
Better results come from adding:
- Browser automation (e.g., Puppeteer MCP)
- Human-like workflows

Screenshot by Claude testing a cloned claude.ai app.
Known limitation: Can’t detect native browser alerts via Puppeteer MCP → related functionality often buggy.
---
Session Orientation Steps
At start of session:
- Run `pwd` – confirm current working directory
- Read Git logs & `claude-progress.txt` – understand recent work
- Read `feature_list.json` – pick top-priority unfinished feature
- Run `init.sh` – start development server
- Test core functionality via Puppeteer MCP
- Begin new feature development
Benefit: Saves tokens — avoids re-learning app setup each time.
---
Example Session Flow
[Assistant] I’ll start by getting familiar with the environment and the current project status.
[Tool Use]
[Tool Use]
[Tool Use]
[Assistant] Let me check the git log to see recent changes.
[Tool Use]
[Assistant] Now I’ll check if there is an init.sh script to restart the server.
[Assistant] Great! Now I’ll navigate to the application to verify some core features are working.
[Assistant] ... core functionalities are running well. Main chat, theme switching, conversation loading, error handling operational.
---
Failure Modes & Solutions
| Failure Mode | Initializer Agent | Coding Agent |
|--------------|------------------|--------------|
| Declares project “complete” too early | Build function list file | Only update “Passed” after self-verification |
| Buggy env / unclear progress | Create repo + progress doc | Read progress/Git logs, run basic tests before coding |
| Marks functions complete prematurely | Build function list | Self-verify end-to-end before marking “Passed” |
| Doesn’t know how to run app | Write `init.sh` | Read/run `init.sh` at session start |
---
Summary of Solutions
- Function List File (JSON) – Structured descriptions, end-to-end
- Progress Documentation – Git commits + logs for review
- Strict Self-Verification – Only mark features complete after thorough testing
- Startup Script (`init.sh`) – Reliable dev server launch
---
Future Outlook
Open questions remain:
- Is a single general-purpose agent optimal?
- Or should we develop specialist agents for testing, QA, code cleaning?
This framework currently suits full-stack web dev; future work may extend to research or financial modeling.
---
Acknowledgements
Thanks to the Anthropic teams — particularly Code RL and Claude Code — enabling safe, autonomous long-cycle programming with Claude.
Interested? Apply: anthropic.com/careers
---
Note: Independent creators also face similar multi-session AI challenges.
Platforms like AiToEarn官网 combine open-source generation, publishing, analytics, and ranking — enabling monetization of long-term, multi-platform workflows.
---
Source: Anthropic Engineering Blog