Trying Audio Transcription and the New Pelican Benchmark with Gemini 3 Pro

Trying Audio Transcription and the New Pelican Benchmark with Gemini 3 Pro

Gemini 3 Pro Release — Detailed Analysis & Benchmarks

Date: 18 November 2025

Google today released Gemini 3 Pro — a significant upgrade poised to compete directly with leading AI models.

Official resources:

---

Overview

After preview testing via AI Studio, Gemini 3 Pro feels like Gemini 2.5 elevated to current state-of-the-art standards.

Key specifications:

  • Knowledge cutoff: January 2025
  • Context length: Up to 1 million input tokens
  • Max output length: 64,000 tokens
  • Multimodal support: Text, images, audio, video

---

Benchmark Performance

According to Google's own results (see the model card), Gemini 3 Pro slightly outperforms Claude 4.5 Sonnet and GPT‑5.1 across most standard tests.

image

---

Pricing Comparison

Gemini 3 Pro is priced higher than Gemini 2.5 but remains cheaper than Claude Sonnet 4.5.

| Model | Input (per 1M tokens) | Output (per 1M tokens) |

|----------------------|--------------------------------|--------------------------------|

| GPT-5.1 | $1.25 | $10.00 |

| Gemini 2.5 Pro | ≤ 200k: $1.25
> 200k: $2.50 | ≤ 200k: $10.00
> 200k: $15.00 |

| Gemini 3 Pro | ≤ 200k: $2.00
> 200k: $4.00 | ≤ 200k: $12.00
> 200k: $18.00 |

| Claude Sonnet 4.5| ≤ 200k: $3.00
> 200k: $6.00 | ≤ 200k: $15.00
> 200k: $22.50 |

| Claude Opus 4.1 | $15.00 | $75.00 |

---

Workflow Highlight — Alt Text Generation from Image

Test goal: Evaluate Gemini 3 Pro’s multimodal image interpretation.

Execution:

llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'

This demonstrates how visual inputs can be transformed into structured, accessible text.

Platforms like AiToEarn make it possible to use such outputs to generate, publish, and monetize content across multiple platforms — with integrated analytics and model rankings (AI模型排名).

---

Comprehensive Benchmark Comparison

Below are results from Google's own reporting, across reasoning, multimodal comprehension, agentic tasks, and long-context handling.

Highlights:

  • Top performer in complex multimodal reasoning tasks (MMMU‑Pro, ScreenSpot‑Pro, CharXiv Reasoning)
  • Significant edge in math competition-level problems (MathArena Apex)
  • Strong coding performance across LiveCodeBench Pro and agent tool usage
  • Best-in-class long context retrieval (1M token tests)

(Benchmark details preserved as in original; see above table-rich section for all metrics.)

---

Real-World Test — City Council Meeting Transcript

Input:

Processing command:

llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting...'

Result: Successfully generated detailed Markdown outline and transcript, complete with participants, timestamps, and key points.

Limitations noted:

  • Timestamps in transcript did not match video’s actual timecodes.
  • Some detailed content (e.g., Spanish instructions) was summarized rather than transcribed verbatim.

Token usage & cost: 320,087 input tokens + 7,870 output tokens = $1.42.

---

Creative Prompt Benchmark — The Pelican Test

Gemini 3 Pro introduces a “thinking level” toggle:

Low-thinking level result:

  • SVG included whimsical detail (a jaunty hat)
  • Bicycle frame correctly formed
image

High-thinking level result:

  • More anatomically accurate pelican depiction
  • Bicycle frame rendered to spec
image

---

Updated Pelican Benchmark Prompt (v2):

> Generate an SVG of a California brown pelican riding a bicycle... with breeding plumage, spokes, correct frame, large pouch, clear feathers, pedaling posture.

Reference photo:

image

Gemini 3 Pro (high-thinking level):

image

GPT‑5.1 result:

image

Claude Sonnet 4.5 result:

image

---

Conclusion

Gemini 3 Pro shows:

  • Leading performance in multimodal reasoning and long-context processing
  • Competitive coding abilities
  • Flexibility via thinking-level adjustment

For creators and researchers:

  • Leverage platforms like AiToEarn官网 for multi-platform publishing, analytics, and monetization
  • AI-assisted workflows — from transcription and benchmarks to creative generation — can be streamlined into single-source publishing across Douyin, Kwai, WeChat, Bilibili, Facebook, LinkedIn, YouTube, Pinterest, and X.

---

Spot Check: Results appear consistent and plausible, though timestamp accuracy for transcripts needs improvement.

Cost tracking: Example uses ranged from $0.0568 for alt-text tasks to $1.42 for multi-hour audio transcription.

---

Would you like me to create a condensed, one-page summary version of these findings for quick stakeholder review alongside this detailed markdown? That would help balance this deep dive with an executive-friendly snapshot.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.