AI news

You Might Not Be Fully Utilizing Your GPU Resources

Honghao Wang

26 Nov 2025 — 3 min read

Episode Notes

About Mithril

Mithril’s Omnicloud platform aggregates and orchestrates multi‑cloud GPUs, CPUs, and storage — giving you a single unified platform to access all your infrastructure.

Connect with Jared Quincy Davis

Twitter (X): @JaredQ_
LinkedIn: Profile

Community Shoutout

🎉 Razzi Abuissa earned the Populist badge on Stack Overflow for their high-scoring answer to: How to find last merge in git? — outperforming the accepted answer.

---

Transcript

> Ryan Donovan: Are you tired of database limitations and architectures that fail when scaling? Think beyond rows and columns. MongoDB is built by developers, for developers. It’s ACID‑compliant, enterprise‑ready, and fluent in AI. Start building faster at mongodb.com/build.

[Intro Music]

---

Introduction

Ryan Donovan: Welcome to the Stack Overflow Podcast — your place for all things software and technology. I’m your host, Ryan Donovan. Today’s topic: Is the GPU shortage really about availability, or is it an efficiency problem? To help us unpack this, we’re joined by Jared Quincy Davis, CEO and founder of Mithril.

---

Jared’s Journey into AI

Jared Quincy Davis:

My path into AI began — like many researchers — with a moment of inspiration. In 2015, DeepMind’s AlphaGo completely captured my imagination.

Before that, I was broadly interested in robotics, quantum computing, nuclear fusion, bio-computation, and bioinformatics. But AlphaGo convinced me AI had the potential to generalize across domains with similar mathematical structures. The underlying recipe could be applied far beyond Go — from solving protein folding (AlphaFold) to tackling other complex problems.

Technological progress through AI can turn zero-sum challenges into positive-sum opportunities by creating new value instead of just redistributing what exists. That’s why building better tools matters so much — and it's what I've devoted my work to.

---

GPU Shortage vs. GPU Inefficiency

Ryan Donovan: Many teams are scaling up hardware for AI, but you say this isn’t an availability problem, it’s an efficiency problem. How so?

Jared Quincy Davis:

There’s actually plenty of GPU capacity, but:

Defensive buying: Organizations over-provision for peak demand.
Idle resources: Locked-down capacity often sits unused.
Lost elasticity: In the early cloud days, elasticity let you scale up for an hour or down to zero instantly without wasted spend. That flexibility is largely gone in AI infrastructure.

---

Why GPU Flexibility Lags Behind CPU

Ryan Donovan: Why doesn’t GPU infrastructure adapt as flexibly as CPU infrastructure?

Response:

GPU workloads often require full, uninterrupted hardware access for predictable performance.
Virtualization exists (e.g., NVIDIA vGPU), but overhead and complexity make it less attractive for high-demand AI training.
GPUs are usually provisioned as whole units, not slices.

---

Jared Quincy Davis:

Large language models frequently exceed a single server’s GPU memory, forcing distributed, parallel computing across multiple nodes. Scheduling becomes a Tetris-like problem, where contiguous hardware matters and poor allocation leads to stranded capacity.

Many providers avoid this by selling long-term, single-tenant blocks of capacity — shifting complexity to customers. This recreates a pre-cloud model instead of delivering the original promise of abstraction and elasticity.

---

Multi-Cloud & Omni Cloud Strategies

Ryan Donovan: Your approach sounds like serverless, but for GPU workloads — and scheduling is at the core.

Jared Quincy Davis:

Yes. Our Omni Cloud concept assumes modern users are multi-cloud:

AWS/GCP for certain workloads
AI-native clouds (ours or competitors) for GPU-heavy tasks
On-premises or partner clouds where viable

By routing workloads dynamically — especially preemptive workloads on spot instances — we can use underutilized GPU resources efficiently across environments.

---

Workload Classes & Scheduling Design

Two workload types we optimize for:

Real-time / Low-latency
Web agents
AI co-pilots
Live chat sessions
Asynchronous / Cost-sensitive
Deep research tasks
Background coding agents (e.g., Codex)
Indexing pipelines

Key design principles:

Extreme preemptability
Auction-based congestion control
SKU-aware routing by location, compliance, interconnect quality, and storage performance

Outcome: Flexible SLAs, better economics — with up to 10x–20x savings for non-critical workloads.

---

Older vs. Newer GPUs

Ryan Donovan: In resource-limited regions, older GPUs are still in use. Is that worth considering more broadly?

Jared Quincy Davis:

Yes — older GPUs can:

Run distilled models efficiently
Serve live traffic for smaller workloads
Produce RL rollouts during training cycles

Lifecycle economics benefit from creative reuse. Heavy training goes to newest chips; smaller inference tasks fit on older hardware, extending CapEx value.

---

Deciding When to Upgrade

Upgrade when:

Current hardware creates performance bottlenecks
New GPUs bring qualitative improvements (precision formats, efficiency gains, new features)
Power delivery is constrained, requiring flops-per-watt optimization

---

Future: Specialty & Multi-Model Systems

Jared Quincy Davis:

The future is compound AI systems:

Specialty models
Mini-model ensembles
Large reasoning models paired with smaller, high-fidelity tool callers
Techniques like speculative decoding — small model drafts, large model verifies

This mirrors broader AI tooling ecosystems: specialized components working together for efficiency and innovation.

---

Closing & Contact

Ryan Donovan:

Shoutout again to Razzi Abuissa for their standout contribution on Stack Overflow.

Questions or topics for us? Email podcast@stackoverflow.com or connect with me on LinkedIn.

Jared Quincy Davis:

Find me on X @JaredQ_, LinkedIn, or at Mithril.ai.

---

Takeaway: Whether in GPU infrastructure or AI content creation, the themes are clear — efficiency, specialization, and smart orchestration unlock the full potential of your tools.

---

Do you want me to create a visual architecture diagram of Jared’s scheduling model to make the concept even clearer?

You Might Not Be Fully Utilizing Your GPU Resources

Honghao Wang

Episode Notes

About Mithril

Connect with Jared Quincy Davis

Community Shoutout

Transcript

Introduction

Jared’s Journey into AI

GPU Shortage vs. GPU Inefficiency

Why GPU Flexibility Lags Behind CPU

Multi-Cloud & Omni Cloud Strategies

Workload Classes & Scheduling Design

Older vs. Newer GPUs

Deciding When to Upgrade

Future: Specialty & Multi-Model Systems

Closing & Contact

Read more

Xiaoyuan Learning Tablet Wins 2025 IDEA International Design Award, Setting a New Benchmark for Study Devices

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Cloud Computing Giant Unveils 25 New Products in 10 Minutes — Kimi and MiniMax Debut

TopGear Picks 18 Cars of the Year, Only One from China