Build an Efficient Framework for AI Agents to Handle "Marathon" Tasks

Build an Efficient Framework for AI Agents to Handle "Marathon" Tasks

Effective Harnesses for Long-Running Agents

Original Article ›

As AI agents gain capabilities, they are tackling more complex, long-duration tasks — sometimes taking hours or days to complete. A persistent challenge is keeping agents coherent across multiple memory cycles, since each session starts with limited context.

---

The Core Problem: Context Window Limits

Long-running agent work happens in multiple, disconnected sessions. Every new session essentially forgets what occurred earlier.

Imagine a rotating team of engineers where each remembers nothing from the previous shift — productivity would plummet.

Because AI models have a limited context window (the maximum information they can process at once), complex projects extending beyond this limit need a method to bridge sessions.

---

Our Dual-Agent Solution with Claude Agent SDK

We developed a two-agent harness to work efficiently across multiple context windows:

  • Initializer Agent – sets up the environment during the first run.
  • Coding Agent – makes incremental coding progress during each subsequent session, preparing a clear handoff document for the next session.

📂 Code samples: Quickstart Guide

---

The “Marathon Agent” Dilemma

The Claude Agent SDK supports programming, planning, and information-gathering tasks.

It includes context compaction features to condense historical info, theoretically enabling infinite continuous work.

Reality check: Compaction alone isn’t enough.

Failure patterns we see:

  • Too much at once – Agents try a complete one-shot build, overflowing context and leaving fragmented output for the next session.
  • Overconfidence – Agents declare “done” after partial builds without full validation.

---

The Needed Breakthrough

We realized success required:

  • An initial environment to support all features requested
  • Step-by-step module-based development cadence

---

Why This Matters Beyond Code

Tools that synchronize work across sessions have value in research, multimedia, and creator workflows.

Platforms like AiToEarn offer AI-assisted generation, multi-platform publishing, analytics, and model ranking — coordinating multi-session work much like our harness does for agents. Learn more: AiToEarn博客 | AI模型排名.

---

Guiding Agents with Incremental Progress

A Clean State at the end of each session means:

  • Merge-ready code
  • No major bugs
  • Clear structure
  • Documented for handoff

---

Two-Step Process from Experiments

Phase 1 — Initialize Agent

  • Create:
  • `init.sh` script
  • `claude-progress.txt` operation log
  • Initial Git commit with file list

Phase 2 — Programming Agent

  • Make small feature updates per session
  • Log updates in a structured progress file

Insight: Combine `claude-progress.txt` + Git history so future agents can start fresh without losing work context.

---

Environment Management

Per Claude 4 Prompt Guide:

> Use a different prompt for the first context window.

Initializer tasks include:

1. Feature List

  • Detailed breakdown based on initial user request
  • Example: Cloning claude.ai → 200+ features
  • All start as “Failing”
  • Strict rule: Only update “passes” field; never delete items
  • Stored in JSON, not Markdown, to avoid accidental corruption

---

2. Incremental Progress

  • One feature at a time
  • Maintain clean environment
  • Good habits:
  • Descriptive Git commit messages
  • Work summaries in progress file
  • Quickly revert via Git if needed

---

3. Testing

Common failure: Marking features “complete” without end-to-end testing.

Better results come from adding:

  • Browser automation (e.g., Puppeteer MCP)
  • Human-like workflows
image

Screenshot by Claude testing a cloned claude.ai app.

Known limitation: Can’t detect native browser alerts via Puppeteer MCP → related functionality often buggy.

---

Session Orientation Steps

At start of session:

  • Run `pwd` – confirm current working directory
  • Read Git logs & `claude-progress.txt` – understand recent work
  • Read `feature_list.json` – pick top-priority unfinished feature
  • Run `init.sh` – start development server
  • Test core functionality via Puppeteer MCP
  • Begin new feature development

Benefit: Saves tokens — avoids re-learning app setup each time.

---

Example Session Flow

[Assistant] I’ll start by getting familiar with the environment and the current project status.
[Tool Use] 
[Tool Use] 
[Tool Use] 
[Assistant] Let me check the git log to see recent changes.
[Tool Use] 
[Assistant] Now I’ll check if there is an init.sh script to restart the server.

[Assistant] Great! Now I’ll navigate to the application to verify some core features are working.

[Assistant] ... core functionalities are running well. Main chat, theme switching, conversation loading, error handling operational.

---

Failure Modes & Solutions

| Failure Mode | Initializer Agent | Coding Agent |

|--------------|------------------|--------------|

| Declares project “complete” too early | Build function list file | Only update “Passed” after self-verification |

| Buggy env / unclear progress | Create repo + progress doc | Read progress/Git logs, run basic tests before coding |

| Marks functions complete prematurely | Build function list | Self-verify end-to-end before marking “Passed” |

| Doesn’t know how to run app | Write `init.sh` | Read/run `init.sh` at session start |

---

Summary of Solutions

  • Function List File (JSON) – Structured descriptions, end-to-end
  • Progress Documentation – Git commits + logs for review
  • Strict Self-Verification – Only mark features complete after thorough testing
  • Startup Script (`init.sh`) – Reliable dev server launch

---

Future Outlook

Open questions remain:

  • Is a single general-purpose agent optimal?
  • Or should we develop specialist agents for testing, QA, code cleaning?

This framework currently suits full-stack web dev; future work may extend to research or financial modeling.

---

Acknowledgements

Thanks to the Anthropic teams — particularly Code RL and Claude Code — enabling safe, autonomous long-cycle programming with Claude.

Interested? Apply: anthropic.com/careers

---

Note: Independent creators also face similar multi-session AI challenges.

Platforms like AiToEarn官网 combine open-source generation, publishing, analytics, and ranking — enabling monetization of long-term, multi-platform workflows.

---

Source: Anthropic Engineering Blog

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.