Large Models Chase the Stars and Seas: GPT and Gemini Win Gold at the International Astronomy Olympiad

Large Models Chase the Stars and Seas: GPT and Gemini Win Gold at the International Astronomy Olympiad
# Using AI to Observe the Stars  
*2025-10-13 12:18 Beijing*  

---

## Introduction

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-130.jpg)  

Artificial intelligence is advancing at a breathtaking pace. As one netizen humorously noted this morning:  
> “We have gone 0 days without a groundbreaking new development in AI to grab attention.”

![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-123.jpg)  

Three months ago, [OpenAI announced its reasoning model had won a gold medal at the International Mathematical Olympiad (IMO)](https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650980674&idx=1&sn=d36f17d0bfc60753d74cf6c09494ce4a&scene=21#wechat_redirect).

Now, it appears that **large language models (LLMs)** not only excel in mathematical reasoning but also in many other domains of scientific research.

---

## Breakthrough in Astronomy & Astrophysics

Recent developments show that top-tier large models are achieving **remarkable results in international science competitions**.  

A newly published paper benchmarked LLMs using the **International Olympiad on Astronomy & Astrophysics (IOAA)** and found that **GPT-5** and **Gemini 2.5 Pro** can achieve **gold medal–level performance** in astronomy and astrophysics.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-112.jpg)  

Greg Brockman, president and co-founder of OpenAI, was so excited he even mistyped GPT’s name when sharing this work:

![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-110.jpg)  

This milestone hints that, one day, when humanity reaches for the stars, **AI large models will be part of the journey**.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-102.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-93.jpg)  

---

**Paper Title:** *Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)*  
**Paper Link:** [https://arxiv.org/abs/2510.05016](https://arxiv.org/abs/2510.05016)  

---

## Why IOAA Was Chosen

The IOAA was selected as a benchmark for three main reasons:

1. **High ecological validity:**  
   IOAA problems require complex reasoning, creative problem-solving, and multi-step derivations — skills essential for real astronomical research.

2. **Comprehensive topic coverage:**  
   The syllabus spans cosmology, spherical trigonometry, stellar astrophysics, celestial mechanics, photometry, and observational instrumentation.

3. **Integration of theory, data, and physics:**  
   IOAA combines theoretical physics, observational constraints, and real astronomical data with mathematical derivations — a richer evaluation format than IMO, IPhO, or IOI.

Traditional astronomy benchmarks like **AstroBench** and **Astro-QA** rely mainly on multiple-choice and short-answer formats, which cannot fully assess complex reasoning skills. This study addresses that gap.

---

## Evaluation Focus

The evaluation focused on two IOAA components:  
- **49 Theoretical Questions**  
  - Category 1: Geometry / Space (celestial sphere geometry, spherical trigonometry)  
  - Category 2: Physics / Mathematics (astrophysical calculations without geometric visualization)  
- **8 Data Analysis Questions**

> **Note:** The observational section was excluded due to the digital nature of LLMs.

---

## Results: Gold Medal Performance

![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-91.jpg)  

**Key findings:**
- **GPT-5** and **Gemini 2.5 Pro** achieved the highest scores, surpassing other models by 7–25 percentage points.
- GPT-5 topped 2022 (93.0%), 2023 (89.6%), and 2025 (86.8%); Gemini 2.5 Pro led in 2024 (83.0%).
- GPT-5 scored better on harder problems, with fluctuations due to small question counts per difficulty, spatial reasoning gaps, and occasional astrophysics conceptual errors.

---

## Theoretical Exam Insights

Main observations:
1. **Small datasets can skew percentages** — few errors have large impact.
2. **Spatial/geometric reasoning** remains a challenge.
3. **Occasional astrophysics concept errors** in easy questions.

Other strong performers:
- **OpenAI o3**: 77.5% overall  
- **Claude Opus 4.1**: 64.7%  
- **Claude Sonnet 4**: 60.6%  

Performance declines with difficulty, revealing gaps in complex problem-solving despite good multiple-choice scores in other benchmarks.

---

## Data Analysis Exam Insights

GPT-5 scored **88.5%** — higher than its theoretical score — outperforming other models whose data analysis scores were **10–15% lower** than their theoretical scores.  

Strengths:
- Superior multimodal understanding
- Accurate image parsing and plotting reasoning  

Implication: Future benchmarks should include **ecologically valid multimodal data analysis tasks** to mimic real research workflows.

---

## Comparison with Human Scores

Medal criteria (based on ratios to median scores):  
- **Bronze:** 100–130%  
- **Silver:** 130–160%  
- **Gold:** >160%

Most LLMs reached gold level; only **Claude Sonnet 4** scored silver once (2023).  
GPT-5 (2022, 2023, 2025) and Gemini 2.5 Pro (2022, 2023) matched or exceeded top human contestant scores.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-83.jpg)  

---

## Performance Snapshots

### Theoretical Exam (2022–2025)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_009-76.jpg)  

### Data Analysis Exam (2022–2025)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_010-75.jpg)  

---

## Performance by Question Category

Category I: Geometry / Spatial  
Category II: Physics / Mathematics  

LLMs scored **15–26% higher** in Category II than Category I.

---

## Error Analysis

### Common Error Types
- **Conceptual mistakes:** wrong approaches, formula misuse, reasoning flaws.
- **Geometry/spatial reasoning gaps:** spherical trigonometry, timekeeping systems, 3D visualization challenges.
- **Data exam errors:** graphing, chart/image reading — notably in OpenAI o3 and Claude series.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_011-68.jpg)  

### Score Loss Distribution
- (a) IOAA Theoretical Exam 2022–2025  
- (b) IOAA Data Analysis Exam 2022–2025

![image](https://blog.aitoearn.ai/content/images/2025/10/img_012-57.jpg)  

---

## Tools for Sharing AI Research Results

Platforms like [AiToEarn官网](https://aitoearn.ai/) help researchers and creators:
- Generate AI-driven content  
- Publish simultaneously across major platforms (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)  
- Track engagement and model rankings ([AI模型排名](https://rank.aitoearn.ai))  

For astrophysics educators and competition trainers, this can facilitate **global sharing** of problem analyses, tutorials, and datasets.

---

[Read original article](2650995074)  
[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=bbf9314a&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzA3MzI4MjgzMw%3D%3D%26mid%3D2650995074%26idx%3D2%26sn%3D657ceac36a5f75b2303ceb29d4ce4f20)

Read more