How to Build a Voice AI Agent with Open-Source Tools
# Building Custom Voice AI Agents with EchoKit
Voice is the next frontier of conversational AI — the most natural way for humans to interact with intelligent systems.
In the past year, **frontier AI labs** such as OpenAI, xAI, Anthropic, Meta, and Google have launched real-time voice services. However, building great voice applications is challenging because of **high requirements for latency, privacy, and customization** — making one-size-fits-all solutions impractical.
This guide shows you how to create open-source [voice AI agents](https://echokit.dev/) that run on your own computer, leverage your **custom knowledge base**, **voice style**, **actions**, and **fine-tuned AI models**.
---
## 📑 What We’ll Cover
1. [Prerequisites](#prerequisites)
2. [What It Looks Like](#what-it-looks-like)
3. [Two Voice AI Approaches](#two-voice-ai-approaches)
4. [The Voice AI Orchestrator](#the-voice-ai-orchestrator)
- [Configure an ASR](#configuring-asr)
- [Run and Configure a VAD](#running-and-configuring-vad)
- [Configure an LLM](#configure-an-llm)
- [Configure a TTS](#configure-a-tts)
- [Configure MCP and Actions](#configuring-mcp-and-actions)
5. [Local AI With LlamaEdge](#local-ai-with-llamaedge)
6. [Conclusion](#conclusion)
---
## Prerequisites
To follow along effectively, ensure you have:
- Access to a Linux-like system (**Mac** or **Windows WSL** work too).
- Comfort with **command-line tools (CLI)**.
- Ability to run server applications on Linux.
- Free API keys from:
- [Groq](https://console.groq.com/keys)
- [ElevenLabs](https://elevenlabs.io/app/sign-in?redirect=%2Fapp%2Fdevelopers%2Fapi-keys)
- *(Optional)* Ability to compile and build Rust source code.
- *(Optional)* An [EchoKit device](https://echokit.dev/echokit_diy.html) (or self-assembled equivalent).
---
## What It Looks Like
The core software is [**echokit_server**](https://github.com/second-state/echokit_server) — an open-source agent orchestrator for voice AI applications. It coordinates:
- LLMs
- ASR
- TTS
- VAD
- MCP
- Search & knowledge DBs
- Vector DBs
The server offers a **WebSocket interface** for clients to stream and receive voice data.
### Example Deployment Components
- **Hardware client (echokit_box)**: ESP32-based firmware to capture voice and play responses.
- **Web client**: JavaScript-based [reference page](https://echokit.dev/chat/) that connects to your server.
You can assemble your own EchoKit device or [buy one](https://echokit.dev/echokit_diy.html).
---
## Two Voice AI Approaches
### 1. End-to-End Voice Models
- Input: **Voice audio**
- Output: **Voice audio**
- Single-step processing → lower latency.
- Downsides:
- Limited customization.
- Hard to inject domain-specific context.
- No voice personalization.
### 2. Agent Orchestration (Multi-Model Pipeline)
Breaks the process into stages:
1. **VAD** — Detects when the user finishes speaking.
2. **ASR / STT** — Converts speech → text.
3. **LLM** — Generates text reply (may call tools/actions).
4. **TTS** — Converts text → voice.
**Advantages**:
- Choose & customize each stage.
- Inject prompts & knowledge anytime.
- Personalize voice output.
**Challenge**:
- Potentially higher latency → solved with **streaming I/O**.
---
## The Voice AI Orchestrator
**echokit_server** is Rust-based for **speed, safety, and efficiency**.
You can compile yourself or download binaries:
x86 / AMD64
curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz
unzip echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz
ARM64
curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz
unzip echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz
Run:
nohup ./echokit_server &
Config (`config.toml`):
addr = "0.0.0.0:8000"
hello_wav = "hello.wav"
---
## Configuring ASR
Example: **Groq Whisper ASR**
[asr]
url = "https://api.groq.com/openai/v1/audio/transcriptions"
api_key = "gsk_XYZ"
model = "whisper-large-v3"
lang = "en"
prompt = "Hello\n你好\n(noise)\n(bgm)\n(silence)\n"
---
## Running and Configuring VAD
Server-side **Silero VAD** (Rust port):
VAD_LISTEN=0.0.0.0:9094 nohup target/release/silero_vad_server &
Connect VAD to ASR config:
[asr]
vad_realtime_url = "ws://localhost:9094/v1/audio/realtime_vad"
---
## Configure an LLM
Example: **Groq** with `gpt-oss-20b`:
[llm]
llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"
api_key = "gsk_XYZ"
model = "openai/gpt-oss-20b"
history = 20
Add system prompt:
[[llm.sys_prompts]]
role = "system"
content = """
You are a comedian. Engage in humorous conversation.
"""
---
## Configure a TTS
Example: **ElevenLabs**:
[tts]
platform = "Elevenlabs"
token = "sk_xyz"
voice = "VOICE-ID-ABCD"
Or open-source **GPT-SoVITS** streaming:
[tts]
platform = "StreamGSV"
url = "http://gsv_tts.server:port/v1/audio/stream_speech"
speaker = "michael"
---
## Configuring MCP and Actions
Example MCP server: **ExamPrepAgent** at `port 8003`:
[[llm.mcp_server]]
server = "http://localhost:8003/mcp"
type = "http_streamable"
Workflow:
1. LLM outputs JSON calling `get_question`.
2. Server fetches DB entry.
3. LLM turns question + answer into user-friendly text.
4. TTS speaks response.
---
## Local AI with LlamaEdge
Install:
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s
Download model (Gemma):
curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf
Download server:
curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm
Run:
wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf llama-api-server.wasm -p gemma-3
Config EchoKit to use local LLM:
[llm]
llm_chat_url = "http://localhost:8080/v1/chat/completions"
api_key = "NONE"
model = "default"
history = 20
LlamaEdge also supports Whisper ASR & Piper TTS.
---
## Conclusion
The **EchoKit** stack enables **custom, privacy-friendly, real-time voice AI agents** that you control fully — from ASR and VAD to LLM and TTS, plus MCP tools and local AI via LlamaEdge.
You now have a blueprint to:
- **Assemble the pipeline** (VAD → ASR → LLM → TTS).
- **Customize each component** (voice, prompts, actions).
- **Run AI locally** for privacy/performance.
- **Integrate with wider publishing or monetization workflows**.
**Next step:** Build something! 🎙🤖🚀