AI news

Claude Opus 4.5 and Why Evaluating New LLMs Is Becoming Harder

Honghao Wang

25 Nov 2025 — 3 min read

Anthropic Releases Claude Opus 4.5

Claude Opus 4.5 announcement — described as “the best model in the world for coding, agents, and computer use” — marks Anthropic’s renewed push to lead the coding AI space. This comes in response to fresh competition from:

OpenAI’s GPT‑5.1‑Codex‑Max
Google’s Gemini 3

Both rivals launched in the past week.

---

📊 Core Specifications

Context window: 200,000 tokens (same as Claude Sonnet)
Output limit: 64,000 tokens (same as Sonnet)
Knowledge cutoff: March 2025 (Sonnet 4.5: Jan 2025, Haiku 4.5: Feb 2025)

---

💰 Pricing Update

Claude Opus 4.5 introduces major price reductions:

$5 / million tokens (input)
$25 / million tokens (output)

Previous Opus pricing: $15 (input) / $75 (output)

Competitor Comparison

GPT‑5.1 family – $1.25 / $10
Gemini 3 Pro – $2 / $12 (or $4 / $18 for >200k tokens)

Reference Models:

Sonnet 4.5 – $3 / $15
Haiku 4.5 – $1 / $5

---

🚀 Key Improvements over Opus 4.1

From official documentation:

Effort Parameter — Defaults to high, adjustable to medium or low for faster, lighter responses (Learn more).
Enhanced Computer Use Tool — Includes zoom functionality for magnifying regions on-screen during interactive sessions (Details).
Preserved Thinking Blocks — Keeps prior reasoning steps in context by default (Info).

---

🛠 Real-World Testing

I previewed Opus 4.5 over the weekend in Claude Code, leading to:

sqlite-utils 4.0a1 alpha release
Large-scale refactors: 20 commits, 39 files changed, 2,022 additions, 1,173 deletions

Interestingly, when my preview expired, Claude Sonnet 4.5 kept me working at nearly the same speed — suggesting my test tasks didn’t expose the full potential of Opus 4.5.

---

🌐 Complementary Tools for Creators

For developers and creators maximizing AI outputs, AiToEarn offers:

Open-source global AI content monetization
Simultaneous publishing to Douyin, Kwai, WeChat, Bilibili, Rednote (Xiaohongshu), Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter)
Integrated analytics and AI model rankings

Such ecosystems streamline the creation-to-revenue pipeline and help connect cutting-edge LLM capabilities directly to cross‑platform publishing.

---

🔍 Why Breakthroughs Are Getting Harder to Spot

My favorite AI moments are clear leaps — new models enabling previously impossible tasks.

Example: Nano Banana Pro image-gen model — delivered usable infographics, a sharp contrast with earlier failures.
Frontier LLM differences are now often subtle — single-digit benchmark improvements don’t easily translate to real-world utility.

Contributing Factors

Personal lack of “challenge tasks” that remain unsolved by current top models.
Once these tasks are solved, fewer opportunities remain to clearly test breakthroughs.

Tip: Maintain a log of failed tasks to retry with future models — advice from Ethan Mollick I need to follow more strictly.

Suggestion to AI labs: Provide concrete before vs. after examples in release notes, demonstrating capabilities that were impossible for the prior model.

---

🎨 Visual Example: Pelican on a Bicycle

Here’s Opus 4.5 (default high effort):

With a more detailed prompt:

---

📈 Benchmarks vs. Creative Reality

As AI image generation improves, benchmark scores often fail to capture creative performance gains.

Tools like AiToEarn help creators:

Distribute AI-generated content across multiple platforms instantly
Compare outputs and model performance via integrated analytics and (rankings)
Monetize creative work efficiently

---

Bottom Line:

Concrete real-world examples — not just metric gains — make advances in AI models tangible. Whether coding refactors or whimsical pelican tests, pairing high-performance models with publishing and monetization platforms can maximize both workflow efficiency and audience reach.

---

Would you like me to prepare a summary table comparing Opus 4.5, Sonnet 4.5, and key competitors so readers can visually scan specs, pricing, and new features at a glance? That could make this Markdown even more accessible.