Realtime Voice Agents

Realtime agents enable voice-based interactions with AI assistants using bidirectional audio streaming. The adk-realtime crate provides a unified interface for building voice-enabled agents that work with OpenAI's Realtime API and Google's Gemini Live API.

Overview

Realtime agents differ from text-based LlmAgents in several key ways:

FeatureLlmAgentRealtimeAgent
InputTextAudio/Text
OutputTextAudio/Text
ConnectionHTTP requestsWebSocket
LatencyRequest/responseReal-time streaming
VADN/AServer-side voice detection

Architecture

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚              Agent Trait                β”‚
              β”‚  (name, description, run, sub_agents)   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚                       β”‚                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LlmAgent   β”‚      β”‚  RealtimeAgent    β”‚   β”‚  SequentialAgent  β”‚
β”‚ (text-based)β”‚      β”‚  (voice-based)    β”‚   β”‚   (workflow)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RealtimeAgent implements the same Agent trait as LlmAgent, sharing:

  • Instructions (static and dynamic)
  • Tool registration and execution
  • Callbacks (before_agent, after_agent, before_tool, after_tool)
  • Sub-agent handoffs

Quick Start

Installation

Add to your Cargo.toml:

[dependencies]
adk-realtime = { version = "0.2.0", features = ["openai"] }

Basic Usage

use adk_realtime::{
    RealtimeAgent, RealtimeModel, RealtimeConfig, ServerEvent,
    openai::OpenAIRealtimeModel,
};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key = std::env::var("OPENAI_API_KEY")?;

    // Create the realtime model
    let model: Arc<dyn RealtimeModel> = Arc::new(
        OpenAIRealtimeModel::new(&api_key, "gpt-4o-realtime-preview-2024-12-17")
    );

    // Build the realtime agent
    let agent = RealtimeAgent::builder("voice_assistant")
        .model(model.clone())
        .instruction("You are a helpful voice assistant. Be concise.")
        .voice("alloy")
        .server_vad()  // Enable voice activity detection
        .build()?;

    // Or use the low-level session API directly
    let config = RealtimeConfig::default()
        .with_instruction("You are a helpful assistant.")
        .with_voice("alloy")
        .with_modalities(vec!["text".to_string(), "audio".to_string()]);

    let session = model.connect(config).await?;

    // Send text and get response
    session.send_text("Hello!").await?;
    session.create_response().await?;

    // Process events
    while let Some(event) = session.next_event().await {
        match event? {
            ServerEvent::TextDelta { delta, .. } => print!("{}", delta),
            ServerEvent::AudioDelta { delta, .. } => {
                // Play audio (delta is base64-encoded PCM)
            }
            ServerEvent::ResponseDone { .. } => break,
            _ => {}
        }
    }

    Ok(())
}

Supported Providers

ProviderModelFeature FlagAudio Format
OpenAIgpt-4o-realtime-preview-2024-12-17openaiPCM16 24kHz
OpenAIgpt-realtimeopenaiPCM16 24kHz
Googlegemini-2.0-flash-live-preview-04-09geminiPCM16 16kHz/24kHz

Note: gpt-realtime is OpenAI's latest realtime model with improved speech quality, emotion, and function calling capabilities.

RealtimeAgent Builder

The RealtimeAgentBuilder provides a fluent API for configuring agents:

let agent = RealtimeAgent::builder("assistant")
    // Required
    .model(model)

    // Instructions (same as LlmAgent)
    .instruction("You are helpful.")
    .instruction_provider(|ctx| format!("User: {}", ctx.user_name()))

    // Voice settings
    .voice("alloy")  // Options: alloy, coral, sage, shimmer, etc.

    // Voice Activity Detection
    .server_vad()  // Use defaults
    .vad(VadConfig {
        mode: VadMode::ServerVad,
        threshold: Some(0.5),
        prefix_padding_ms: Some(300),
        silence_duration_ms: Some(500),
        interrupt_response: Some(true),
        eagerness: None,
    })

    // Tools (same as LlmAgent)
    .tool(Arc::new(weather_tool))
    .tool(Arc::new(search_tool))

    // Sub-agents for handoffs
    .sub_agent(booking_agent)
    .sub_agent(support_agent)

    // Callbacks (same as LlmAgent)
    .before_agent_callback(|ctx| async { Ok(()) })
    .after_agent_callback(|ctx, event| async { Ok(()) })
    .before_tool_callback(|ctx, tool, args| async { Ok(None) })
    .after_tool_callback(|ctx, tool, result| async { Ok(result) })

    // Realtime-specific callbacks
    .on_audio(|audio_chunk| { /* play audio */ })
    .on_transcript(|text| { /* show transcript */ })

    .build()?;

Voice Activity Detection (VAD)

VAD enables natural conversation flow by detecting when the user starts and stops speaking.

let agent = RealtimeAgent::builder("assistant")
    .model(model)
    .server_vad()  // Uses sensible defaults
    .build()?;

Custom VAD Configuration

use adk_realtime::{VadConfig, VadMode};

let vad = VadConfig {
    mode: VadMode::ServerVad,
    threshold: Some(0.5),           // Speech detection sensitivity (0.0-1.0)
    prefix_padding_ms: Some(300),   // Audio to include before speech
    silence_duration_ms: Some(500), // Silence before ending turn
    interrupt_response: Some(true), // Allow interrupting assistant
    eagerness: None,                // For SemanticVad mode
};

let agent = RealtimeAgent::builder("assistant")
    .model(model)
    .vad(vad)
    .build()?;

Semantic VAD (Gemini)

For Gemini models, you can use semantic VAD which considers meaning:

let vad = VadConfig {
    mode: VadMode::SemanticVad,
    eagerness: Some("high".to_string()),  // low, medium, high
    ..Default::default()
};

Tool Calling

Realtime agents support tool calling during voice conversations:

use adk_realtime::{config::ToolDefinition, ToolResponse};
use serde_json::json;

// Define tools
let tools = vec![
    ToolDefinition {
        name: "get_weather".to_string(),
        description: Some("Get weather for a location".to_string()),
        parameters: Some(json!({
            "type": "object",
            "properties": {
                "location": { "type": "string" }
            },
            "required": ["location"]
        })),
    },
];

let config = RealtimeConfig::default()
    .with_tools(tools)
    .with_instruction("Use tools to help the user.");

let session = model.connect(config).await?;

// Handle tool calls in the event loop
while let Some(event) = session.next_event().await {
    match event? {
        ServerEvent::FunctionCallDone { call_id, name, arguments, .. } => {
            // Execute the tool
            let result = execute_tool(&name, &arguments);

            // Send the response
            let response = ToolResponse::new(&call_id, result);
            session.send_tool_response(response).await?;
        }
        _ => {}
    }
}

Multi-Agent Handoffs

Transfer conversations between specialized agents:

// Create sub-agents
let booking_agent = Arc::new(RealtimeAgent::builder("booking_agent")
    .model(model.clone())
    .instruction("Help with reservations.")
    .build()?);

let support_agent = Arc::new(RealtimeAgent::builder("support_agent")
    .model(model.clone())
    .instruction("Help with technical issues.")
    .build()?);

// Create main agent with sub-agents
let receptionist = RealtimeAgent::builder("receptionist")
    .model(model)
    .instruction(
        "Route customers: bookings β†’ booking_agent, issues β†’ support_agent. \
         Use transfer_to_agent tool to hand off."
    )
    .sub_agent(booking_agent)
    .sub_agent(support_agent)
    .build()?;

When the model calls transfer_to_agent, the RealtimeRunner handles the handoff automatically.

Audio Formats

FormatSample RateBitsChannelsUse Case
PCM1624000 Hz16MonoOpenAI (default)
PCM1616000 Hz16MonoGemini input
G711 u-law8000 Hz8MonoTelephony
G711 A-law8000 Hz8MonoTelephony
use adk_realtime::{AudioFormat, AudioChunk};

// Create audio format
let format = AudioFormat::pcm16_24khz();

// Work with audio chunks
let chunk = AudioChunk::new(audio_bytes, format);
let base64 = chunk.to_base64();
let decoded = AudioChunk::from_base64(&base64, format)?;

Event Types

Server Events

EventDescription
SessionCreatedConnection established
AudioDeltaAudio chunk (base64 PCM)
TextDeltaText response chunk
TranscriptDeltaInput audio transcript
FunctionCallDoneTool call request
ResponseDoneResponse completed
SpeechStartedVAD detected speech start
SpeechStoppedVAD detected speech end
ErrorError occurred

Client Events

EventDescription
AudioInputSend audio chunk
AudioCommitCommit audio buffer
ItemCreateSend text or tool response
CreateResponseRequest a response
CancelResponseCancel current response
SessionUpdateUpdate configuration

Examples

Run the included examples:

# Basic text-only session
cargo run --example realtime_basic --features realtime-openai

# Voice assistant with VAD
cargo run --example realtime_vad --features realtime-openai

# Tool calling
cargo run --example realtime_tools --features realtime-openai

# Multi-agent handoffs
cargo run --example realtime_handoff --features realtime-openai

Best Practices

  1. Use Server VAD: Let the server handle speech detection for lower latency
  2. Handle interruptions: Enable interrupt_response for natural conversations
  3. Keep instructions concise: Voice responses should be brief
  4. Test with text first: Debug your agent logic with text before adding audio
  5. Handle errors gracefully: Network issues are common with WebSocket connections

Comparison with OpenAI Agents SDK

ADK-Rust's realtime implementation follows the OpenAI Agents SDK pattern:

FeatureOpenAI SDKADK-Rust
Agent base classAgentAgent trait
Realtime agentRealtimeAgentRealtimeAgent
ToolsFunction definitionsTool trait + ToolDefinition
Handoffstransfer_to_agentsub_agents + auto-generated tool
CallbacksHooksbefore_* / after_* callbacks

Previous: ← Graph Agents | Next: Model Providers β†’