Voice AI

How to Build a Voice AI Agent with Open-Source Tools

Honghao Wang

22 Oct 2025 — 3 min read

# Building Custom Voice AI Agents with EchoKit

Voice is the next frontier of conversational AI — the most natural way for humans to interact with intelligent systems.

In the past year, **frontier AI labs** such as OpenAI, xAI, Anthropic, Meta, and Google have launched real-time voice services. However, building great voice applications is challenging because of **high requirements for latency, privacy, and customization** — making one-size-fits-all solutions impractical.

This guide shows you how to create open-source [voice AI agents](https://echokit.dev/) that run on your own computer, leverage your **custom knowledge base**, **voice style**, **actions**, and **fine-tuned AI models**.

---

## 📑 What We’ll Cover

1. [Prerequisites](#prerequisites)  
2. [What It Looks Like](#what-it-looks-like)  
3. [Two Voice AI Approaches](#two-voice-ai-approaches)  
4. [The Voice AI Orchestrator](#the-voice-ai-orchestrator)  
   - [Configure an ASR](#configuring-asr)  
   - [Run and Configure a VAD](#running-and-configuring-vad)  
   - [Configure an LLM](#configure-an-llm)  
   - [Configure a TTS](#configure-a-tts)  
   - [Configure MCP and Actions](#configuring-mcp-and-actions)  
5. [Local AI With LlamaEdge](#local-ai-with-llamaedge)  
6. [Conclusion](#conclusion)  

---

## Prerequisites

To follow along effectively, ensure you have:

- Access to a Linux-like system (**Mac** or **Windows WSL** work too).
- Comfort with **command-line tools (CLI)**.
- Ability to run server applications on Linux.
- Free API keys from:
  - [Groq](https://console.groq.com/keys)
  - [ElevenLabs](https://elevenlabs.io/app/sign-in?redirect=%2Fapp%2Fdevelopers%2Fapi-keys)
- *(Optional)* Ability to compile and build Rust source code.
- *(Optional)* An [EchoKit device](https://echokit.dev/echokit_diy.html) (or self-assembled equivalent).

---

## What It Looks Like

The core software is [**echokit_server**](https://github.com/second-state/echokit_server) — an open-source agent orchestrator for voice AI applications. It coordinates:

- LLMs  
- ASR  
- TTS  
- VAD  
- MCP  
- Search & knowledge DBs  
- Vector DBs

The server offers a **WebSocket interface** for clients to stream and receive voice data.

### Example Deployment Components

- **Hardware client (echokit_box)**: ESP32-based firmware to capture voice and play responses.
- **Web client**: JavaScript-based [reference page](https://echokit.dev/chat/) that connects to your server.

You can assemble your own EchoKit device or [buy one](https://echokit.dev/echokit_diy.html).

---

## Two Voice AI Approaches

### 1. End-to-End Voice Models

- Input: **Voice audio**
- Output: **Voice audio**
- Single-step processing → lower latency.
- Downsides:
  - Limited customization.
  - Hard to inject domain-specific context.
  - No voice personalization.

### 2. Agent Orchestration (Multi-Model Pipeline)

Breaks the process into stages:

1. **VAD** — Detects when the user finishes speaking.
2. **ASR / STT** — Converts speech → text.
3. **LLM** — Generates text reply (may call tools/actions).
4. **TTS** — Converts text → voice.

**Advantages**:
- Choose & customize each stage.
- Inject prompts & knowledge anytime.
- Personalize voice output.

**Challenge**:
- Potentially higher latency → solved with **streaming I/O**.

---

## The Voice AI Orchestrator

**echokit_server** is Rust-based for **speed, safety, and efficiency**.  
You can compile yourself or download binaries:

x86 / AMD64

curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz

unzip echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz

ARM64

curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz

unzip echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz


Run:

nohup ./echokit_server &


Config (`config.toml`):

addr = "0.0.0.0:8000"

hello_wav = "hello.wav"


---

## Configuring ASR

Example: **Groq Whisper ASR**

[asr]

url = "https://api.groq.com/openai/v1/audio/transcriptions"

api_key = "gsk_XYZ"

model = "whisper-large-v3"

lang = "en"

prompt = "Hello\n你好\n(noise)\n(bgm)\n(silence)\n"


---

## Running and Configuring VAD

Server-side **Silero VAD** (Rust port):

VAD_LISTEN=0.0.0.0:9094 nohup target/release/silero_vad_server &


Connect VAD to ASR config:

[asr]

vad_realtime_url = "ws://localhost:9094/v1/audio/realtime_vad"


---

## Configure an LLM

Example: **Groq** with `gpt-oss-20b`:

[llm]

llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"

api_key = "gsk_XYZ"

model = "openai/gpt-oss-20b"

history = 20


Add system prompt:

[[llm.sys_prompts]]

role = "system"

content = """

You are a comedian. Engage in humorous conversation.

"""


---

## Configure a TTS

Example: **ElevenLabs**:

[tts]

platform = "Elevenlabs"

token = "sk_xyz"

voice = "VOICE-ID-ABCD"


Or open-source **GPT-SoVITS** streaming:

[tts]

platform = "StreamGSV"

url = "http://gsv_tts.server:port/v1/audio/stream_speech"

speaker = "michael"


---

## Configuring MCP and Actions

Example MCP server: **ExamPrepAgent** at `port 8003`:

[[llm.mcp_server]]

server = "http://localhost:8003/mcp"

type = "http_streamable"


Workflow:
1. LLM outputs JSON calling `get_question`.
2. Server fetches DB entry.
3. LLM turns question + answer into user-friendly text.
4. TTS speaks response.

---

## Local AI with LlamaEdge

Install:

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s


Download model (Gemma):

curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf


Download server:

curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm


Run:

wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf llama-api-server.wasm -p gemma-3


Config EchoKit to use local LLM:

[llm]

llm_chat_url = "http://localhost:8080/v1/chat/completions"

api_key = "NONE"

model = "default"

history = 20


LlamaEdge also supports Whisper ASR & Piper TTS.

---

## Conclusion

The **EchoKit** stack enables **custom, privacy-friendly, real-time voice AI agents** that you control fully — from ASR and VAD to LLM and TTS, plus MCP tools and local AI via LlamaEdge.

You now have a blueprint to:

- **Assemble the pipeline** (VAD → ASR → LLM → TTS).
- **Customize each component** (voice, prompts, actions).
- **Run AI locally** for privacy/performance.
- **Integrate with wider publishing or monetization workflows**.

**Next step:** Build something! 🎙🤖🚀

How to Build a Voice AI Agent with Open-Source Tools

Honghao Wang

x86 / AMD64

ARM64

Read more

People Stop Buying Porsches, Decade-Long CEO Steps Down

The Cutest New Land Cruiser FJ Launch — Could This Be Equation Leopard’s Long-Lost Brother in Japan?

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布，AI 浏览器大乱斗...

Express Update | OpenAI’s Japanese Rival Sakana in Talks for Funding at $2.5 Billion Valuation