How to Build a Voice AI Agent with Open-Source Tools

# Building Custom Voice AI Agents with EchoKit

Voice is the next frontier of conversational AI — the most natural way for humans to interact with intelligent systems.

In the past year, **frontier AI labs** such as OpenAI, xAI, Anthropic, Meta, and Google have launched real-time voice services. However, building great voice applications is challenging because of **high requirements for latency, privacy, and customization** — making one-size-fits-all solutions impractical.

This guide shows you how to create open-source [voice AI agents](https://echokit.dev/) that run on your own computer, leverage your **custom knowledge base**, **voice style**, **actions**, and **fine-tuned AI models**.

---

## 📑 What We’ll Cover

1. [Prerequisites](#prerequisites)  
2. [What It Looks Like](#what-it-looks-like)  
3. [Two Voice AI Approaches](#two-voice-ai-approaches)  
4. [The Voice AI Orchestrator](#the-voice-ai-orchestrator)  
   - [Configure an ASR](#configuring-asr)  
   - [Run and Configure a VAD](#running-and-configuring-vad)  
   - [Configure an LLM](#configure-an-llm)  
   - [Configure a TTS](#configure-a-tts)  
   - [Configure MCP and Actions](#configuring-mcp-and-actions)  
5. [Local AI With LlamaEdge](#local-ai-with-llamaedge)  
6. [Conclusion](#conclusion)  

---

## Prerequisites

To follow along effectively, ensure you have:

- Access to a Linux-like system (**Mac** or **Windows WSL** work too).
- Comfort with **command-line tools (CLI)**.
- Ability to run server applications on Linux.
- Free API keys from:
  - [Groq](https://console.groq.com/keys)
  - [ElevenLabs](https://elevenlabs.io/app/sign-in?redirect=%2Fapp%2Fdevelopers%2Fapi-keys)
- *(Optional)* Ability to compile and build Rust source code.
- *(Optional)* An [EchoKit device](https://echokit.dev/echokit_diy.html) (or self-assembled equivalent).

---

## What It Looks Like

The core software is [**echokit_server**](https://github.com/second-state/echokit_server) — an open-source agent orchestrator for voice AI applications. It coordinates:

- LLMs  
- ASR  
- TTS  
- VAD  
- MCP  
- Search & knowledge DBs  
- Vector DBs

The server offers a **WebSocket interface** for clients to stream and receive voice data.

### Example Deployment Components

- **Hardware client (echokit_box)**: ESP32-based firmware to capture voice and play responses.
- **Web client**: JavaScript-based [reference page](https://echokit.dev/chat/) that connects to your server.

You can assemble your own EchoKit device or [buy one](https://echokit.dev/echokit_diy.html).

---

## Two Voice AI Approaches

### 1. End-to-End Voice Models

- Input: **Voice audio**
- Output: **Voice audio**
- Single-step processing → lower latency.
- Downsides:
  - Limited customization.
  - Hard to inject domain-specific context.
  - No voice personalization.

### 2. Agent Orchestration (Multi-Model Pipeline)

Breaks the process into stages:

1. **VAD** — Detects when the user finishes speaking.
2. **ASR / STT** — Converts speech → text.
3. **LLM** — Generates text reply (may call tools/actions).
4. **TTS** — Converts text → voice.

**Advantages**:
- Choose & customize each stage.
- Inject prompts & knowledge anytime.
- Personalize voice output.

**Challenge**:
- Potentially higher latency → solved with **streaming I/O**.

---

## The Voice AI Orchestrator

**echokit_server** is Rust-based for **speed, safety, and efficiency**.  
You can compile yourself or download binaries:

x86 / AMD64

curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz

unzip echokit_server-v0.1.0-x86_64-unknown-linux-gnu.tar.gz

ARM64

curl -LO https://github.com/second-state/echokit_server/releases/download/v0.1.0/echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz

unzip echokit_server-v0.1.0-aarch64-unknown-linux-gnu.tar.gz


Run:

nohup ./echokit_server &


Config (`config.toml`):

addr = "0.0.0.0:8000"

hello_wav = "hello.wav"


---

## Configuring ASR

Example: **Groq Whisper ASR**

[asr]

url = "https://api.groq.com/openai/v1/audio/transcriptions"

api_key = "gsk_XYZ"

model = "whisper-large-v3"

lang = "en"

prompt = "Hello\n你好\n(noise)\n(bgm)\n(silence)\n"


---

## Running and Configuring VAD

Server-side **Silero VAD** (Rust port):

VAD_LISTEN=0.0.0.0:9094 nohup target/release/silero_vad_server &


Connect VAD to ASR config:

[asr]

vad_realtime_url = "ws://localhost:9094/v1/audio/realtime_vad"


---

## Configure an LLM

Example: **Groq** with `gpt-oss-20b`:

[llm]

llm_chat_url = "https://api.groq.com/openai/v1/chat/completions"

api_key = "gsk_XYZ"

model = "openai/gpt-oss-20b"

history = 20


Add system prompt:

[[llm.sys_prompts]]

role = "system"

content = """

You are a comedian. Engage in humorous conversation.

"""


---

## Configure a TTS

Example: **ElevenLabs**:

[tts]

platform = "Elevenlabs"

token = "sk_xyz"

voice = "VOICE-ID-ABCD"


Or open-source **GPT-SoVITS** streaming:

[tts]

platform = "StreamGSV"

url = "http://gsv_tts.server:port/v1/audio/stream_speech"

speaker = "michael"


---

## Configuring MCP and Actions

Example MCP server: **ExamPrepAgent** at `port 8003`:

[[llm.mcp_server]]

server = "http://localhost:8003/mcp"

type = "http_streamable"


Workflow:
1. LLM outputs JSON calling `get_question`.
2. Server fetches DB entry.
3. LLM turns question + answer into user-friendly text.
4. TTS speaks response.

---

## Local AI with LlamaEdge

Install:

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s


Download model (Gemma):

curl -LO https://huggingface.co/second-state/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf


Download server:

curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm


Run:

wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-3-4b-it-Q5_K_M.gguf llama-api-server.wasm -p gemma-3


Config EchoKit to use local LLM:

[llm]

llm_chat_url = "http://localhost:8080/v1/chat/completions"

api_key = "NONE"

model = "default"

history = 20


LlamaEdge also supports Whisper ASR & Piper TTS.

---

## Conclusion

The **EchoKit** stack enables **custom, privacy-friendly, real-time voice AI agents** that you control fully — from ASR and VAD to LLM and TTS, plus MCP tools and local AI via LlamaEdge.

You now have a blueprint to:

- **Assemble the pipeline** (VAD → ASR → LLM → TTS).
- **Customize each component** (voice, prompts, actions).
- **Run AI locally** for privacy/performance.
- **Integrate with wider publishing or monetization workflows**.

**Next step:** Build something! 🎙🤖🚀

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes.

ChatGPT Atlas 发布,AI 浏览器大乱斗...

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布,AI 浏览器大乱斗...

# AI Browsers: When LLM Companies Step In 原创 lencx · 2025-10-22 07:00 · 上海 --- ## Overview Large Language Model (LLM) companies are making moves into the **AI browser** space. From new entrants like **Dia**[1], **Comet**[2], and **ChatGPT Atlas**[3], to established browsers like **Chrome** and **Edge** (which now feature

By Honghao Wang