AGI Now Has a Quantitative Benchmark! Led by Bengio, Progress at 58%

AGI Now Has a Quantitative Benchmark! Led by Bengio, Progress at 58%

Defining AGI: Moving Beyond "Human‑Like Intelligence"

For decades, Artificial General Intelligence (AGI) has been loosely described as “human‑like intelligence” — often framed as “as smart as a human”.

But how smart is that, exactly?

Recently, Turing Award laureate Yoshua Bengio, alongside the Center for AI Safety, the University of California, Berkeley, and other institutions, proposed a new, measurable definition in their work: “A Definition of AGI.”

> AGI is artificial intelligence that can match or exceed the cognitive versatility and proficiency of a well‑educated adult.

image

---

Core Components of the Definition

The paper’s definition focuses on two key dimensions:

  • A Clear Reference Point
  • Using “well‑educated adult” as the benchmark avoids vague terms like “superhuman” and provides concrete, testable criteria.
  • Comprehensive Competence
  • AGI should meet benchmarks across multiple core cognitive domains — not just excel at one task but display balanced ability in reasoning, memory, perception, and more.

---

From Concept to Measurement

To make AGI measurable, the team built a quantitative evaluation framework based on the Cattell–Horn–Carroll (CHC) theory — a well‑established psychological model of human intelligence.

image

The CHC Model's Cognitive Domains

The CHC model breaks down general intelligence into 10 interrelated cognitive domains, ranging from perception to advanced reasoning. Researchers adapted classic human test questions to suit AI, removing items dependent on human physiology (like touch) or context-specific scenarios (like driving).

Final AGI Evaluation Domains:

  • Knowledge (K) – Common sense, sciences, history, culture
  • Reading/Writing (RW) – Comprehension, expression, composition
  • Mathematics (M) – Calculation, quantitative reasoning
  • Reasoning (R) – Logic, abstract problem‑solving
  • Working Memory (WM) – Short‑term retention and processing
  • Long‑Term Memory Storage (MS) – Stable knowledge preservation
  • Long‑Term Memory Retrieval (MR) – Efficient recall from storage
  • Visual (V) – Image recognition, spatial understanding
  • Auditory (A) – Speech and sound comprehension
  • Speed (S) – Rapid execution of simple cognitive tasks

---

Scoring System

Each domain is scored out of 10 points for a total of 100.

A score of 100 represents AGI‑level ability.

Benchmark Results

| Model | Score | Relative to AGI |

|-----------------|-------|-----------------|

| GPT‑4 (2023) | 27 | Far below AGI |

| GPT‑5 (2025) | 58 | Incomplete AGI |

In just two years, scores increased by 115%, but GPT‑5 still scores 0 in long‑term memory storage — a critical weakness.

image

---

Uneven Capabilities

Experiments reveal imbalances across domains:

image
  • Strongest Areas: Knowledge, Reading/Writing, Mathematics
  • Weakest Areas: Perception (Visual/Auditory), Long‑Term Memory, and certain forms of reasoning

---

Domain‑Specific Performance

Strengths: Textual & Symbolic Mastery

  • Knowledge (K): GPT‑5 scores above 8
  • Reading/Writing (RW): Above 8
  • Mathematics (M): Above 8
image
image
image

These strengths showcase pattern matching on massive datasets — tasks well‑suited to large language model training.

---

Weaknesses: Perception & Memory

Visual (V) & Auditory (A) Performance

image
image
  • GPT‑4: No image or sound processing
  • GPT‑5: Basic cat/dog classification, simple speech‑to‑text
  • Still far from human‑level complex scene interpretation or emotional recognition

Long‑Term Memory Deficits

image
image
  • Cannot reliably store and recall information over extended periods
  • Context window expansion only extends short‑term working memory, not true long‑term capability
  • Online search augments knowledge but introduces hallucination and outdated data risk

---

Why External Tools Don't Count

This AGI evaluation excludes:

  • Browser / internet search assistance
  • External APIs
  • Memory extensions beyond native capacity

The focus is purely on intrinsic cognitive ability.

No model with a zero score in a core domain can be considered AGI, regardless of its overall performance.

---

Outlook

With a measurable definition now available, the question shifts from “What is AGI?” to “When will we reach it?”.

Paper link: https://www.agidefinition.ai/paper.pdf

Reference: https://x.com/DanHendrycks/status/1978828377269117007

---

For AI Creators & Innovators

Platforms like AiToEarn官网 offer:

  • Open-source AI innovation tools
  • Cross-platform publishing (Douyin, Bilibili, X, LinkedIn)
  • Analytics and model ranking
  • Monetization workflows

Such ecosystems bridge the gap between AI’s growing potential and real‑world creative impact — helping developers and creators prepare for the true AGI era.

Read more