रीयलटाइम वॉयस एजेंट

Realtime agent द्विदिशीय ऑडियो स्ट्रीमिंग का उपयोग करके AI असिस्टेंट के साथ वॉइस-आधारित इंटरैक्शन को सक्षम करते हैं। adk-realtime क्रेट OpenAI के Realtime API और Google के Gemini Live API के साथ काम करने वाले वॉइस-सक्षम एजेंट बनाने के लिए एक एकीकृत इंटरफ़ेस प्रदान करता है।

अवलोकन

Realtime agent टेक्स्ट-आधारित LlmAgents से कई मुख्य तरीकों से भिन्न होते हैं:

फ़ीचर	LlmAgent	RealtimeAgent
इनपुट	टेक्स्ट	ऑडियो/टेक्स्ट
आउटपुट	टेक्स्ट	ऑडियो/टेक्स्ट
कनेक्शन	HTTP अनुरोध	WebSocket
विलंबता	अनुरोध/प्रतिक्रिया	रीयल-टाइम स्ट्रीमिंग
VAD	लागू नहीं	सर्वर-साइड वॉइस डिटेक्शन

आर्किटेक्चर

              ┌─────────────────────────────────────────┐
              │              Agent Trait                │
              │  (name, description, run, sub_agents)   │
              └────────────────┬────────────────────────┘
                               │
       ┌───────────────────────┼───────────────────────┐
       │                       │                       │
┌──────▼──────┐      ┌─────────▼─────────┐   ┌─────────▼─────────┐
│  LlmAgent   │      │  RealtimeAgent    │   │  SequentialAgent  │
│ (text-based)│      │  (voice-based)    │   │   (workflow)      │
└─────────────┘      └───────────────────┘   └───────────────────┘

RealtimeAgent, LlmAgent के समान ही Agent ट्रेट को लागू करता है, जिसमें साझा किया जाता है:

निर्देश (स्थिर और गतिशील)
Tool पंजीकरण और निष्पादन
कॉलबैक (before_agent, after_agent, before_tool, after_tool)
Sub-agent हैंडऑफ़

त्वरित शुरुआत

इंस्टॉलेशन

अपने Cargo.toml में जोड़ें:

[dependencies]
adk-realtime = { version = "0.2.0", features = ["openai"] }

बुनियादी उपयोग

use adk_realtime::{
    RealtimeAgent, RealtimeModel, RealtimeConfig, ServerEvent,
    openai::OpenAIRealtimeModel,
};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key = std::env::var("OPENAI_API_KEY")?;

    // Realtime model बनाएं
    let model: Arc<dyn RealtimeModel> = Arc::new(
        OpenAIRealtimeModel::new(&api_key, "gpt-4o-realtime-preview-2024-12-17")
    );

    // Realtime agent बनाएं
    let agent = RealtimeAgent::builder("voice_assistant")
        .model(model.clone())
        .instruction("You are a helpful voice assistant. Be concise.")
        .voice("alloy")
        .server_vad()  // Enable voice activity detection
        .build()?;

    // या सीधे लो-लेवल सेशन API का उपयोग करें
    let config = RealtimeConfig::default()
        .with_instruction("You are a helpful assistant.")
        .with_voice("alloy")
        .with_modalities(vec!["text".to_string(), "audio".to_string()]);

    let session = model.connect(config).await?;

    // टेक्स्ट भेजें और प्रतिक्रिया प्राप्त करें
    session.send_text("Hello!").await?;
    session.create_response().await?;

    // इवेंट्स को प्रोसेस करें
    while let Some(event) = session.next_event().await {
        match event? {
            ServerEvent::TextDelta { delta, .. } => print!("{}", delta),
            ServerEvent::AudioDelta { delta, .. } => {
                // ऑडियो चलाएं (delta base64-एन्कोडेड PCM है)
            }
            ServerEvent::ResponseDone { .. } => break,
            _ => {}
        }
    }

    Ok(())
}

समर्थित प्रदाता

प्रदाता	मॉडल	फ़ीचर फ़्लैग	ऑडियो फ़ॉर्मेट
OpenAI	`gpt-4o-realtime-preview-2024-12-17`	`openai`	PCM16 24kHz
OpenAI	`gpt-realtime`	`openai`	PCM16 24kHz
Google	`gemini-2.0-flash-live-preview-04-09`	`gemini`	PCM16 16kHz/24kHz

नोट: gpt-realtime OpenAI का नवीनतम रियलटाइम मॉडल है जिसमें बेहतर स्पीच क्वालिटी, इमोशन और फंक्शन कॉलिंग क्षमताएँ हैं।

RealtimeAgent Builder

RealtimeAgentBuilder एजेंट को कॉन्फ़िगर करने के लिए एक फ़्लूएंट API प्रदान करता है:

let agent = RealtimeAgent::builder("assistant")
    // आवश्यक
    .model(model)

    // निर्देश (LlmAgent के समान)
    .instruction("You are helpful.")
    .instruction_provider(|ctx| format!("User: {}", ctx.user_name()))

    // आवाज़ सेटिंग्स
    .voice("alloy")  // विकल्प: alloy, coral, sage, shimmer, आदि।

    // आवाज़ गतिविधि पहचान
    .server_vad()  // डिफ़ॉल्ट का उपयोग करें
    .vad(VadConfig {
        mode: VadMode::ServerVad,
        threshold: Some(0.5),
        prefix_padding_ms: Some(300),
        silence_duration_ms: Some(500),
        interrupt_response: Some(true),
        eagerness: None,
    })

    // उपकरण (LlmAgent के समान)
    .tool(Arc::new(weather_tool))
    .tool(Arc::new(search_tool))

    // हैंडऑफ़ के लिए सब-एजेंट
    .sub_agent(booking_agent)
    .sub_agent(support_agent)

    // कॉलबैक (LlmAgent के समान)
    .before_agent_callback(|ctx| async { Ok(()) })
    .after_agent_callback(|ctx, event| async { Ok(()) })
    .before_tool_callback(|ctx, tool, args| async { Ok(None) })
    .after_tool_callback(|ctx, tool, result| async { Ok(result) })

    // रियलटाइम-विशिष्ट कॉलबैक
    .on_audio(|audio_chunk| { /* ऑडियो चलाएँ */ })
    .on_transcript(|text| { /* प्रतिलेख दिखाएँ */ })

    .build()?;

आवाज़ गतिविधि पहचान (VAD)

VAD उपयोगकर्ता के बोलने की शुरुआत और समाप्ति का पता लगाकर स्वाभाविक बातचीत के प्रवाह को सक्षम बनाता है।

Server VAD (अनुशंसित)

let agent = RealtimeAgent::builder("assistant")
    .model(model)
    .server_vad()  // समझदार डिफ़ॉल्ट का उपयोग करता है
    .build()?;

कस्टम VAD कॉन्फ़िगरेशन

use adk_realtime::{VadConfig, VadMode};

let vad = VadConfig {
    mode: VadMode::ServerVad,
    threshold: Some(0.5),           // भाषण का पता लगाने की संवेदनशीलता (0.0-1.0)
    prefix_padding_ms: Some(300),   // भाषण से पहले शामिल करने के लिए ऑडियो
    silence_duration_ms: Some(500), // बारी समाप्त होने से पहले मौन
    interrupt_response: Some(true), // सहायक को बाधित करने की अनुमति दें
    eagerness: None,                // SemanticVad मोड के लिए
};

let agent = RealtimeAgent::builder("assistant")
    .model(model)
    .vad(vad)
    .build()?;

Semantic VAD (Gemini)

Gemini मॉडल के लिए, आप Semantic VAD का उपयोग कर सकते हैं जो अर्थ पर विचार करता है:

let vad = VadConfig {
    mode: VadMode::SemanticVad,
    eagerness: Some("high".to_string()),  // कम, मध्यम, उच्च
    ..Default::default()
};

टूल कॉलिंग

Realtime Agent वॉयस वार्तालापों के दौरान टूल कॉलिंग का समर्थन करते हैं:

use adk_realtime::{config::ToolDefinition, ToolResponse};
use serde_json::json;

// Define tools
let tools = vec![
    ToolDefinition {
        name: "get_weather".to_string(),
        description: Some("Get weather for a location".to_string()),
        parameters: Some(json!({
            "type": "object",
            "properties": {
                "location": { "type": "string" }
            },
            "required": ["location"]
        })),
    },
];

let config = RealtimeConfig::default()
    .with_tools(tools)
    .with_instruction("Use tools to help the user.");

let session = model.connect(config).await?;

// Handle tool calls in the event loop
while let Some(event) = session.next_event().await {
    match event? {
        ServerEvent::FunctionCallDone { call_id, name, arguments, .. } => {
            // Execute the tool
            let result = execute_tool(&name, &arguments);

            // Send the response
            let response = ToolResponse::new(&call_id, result);
            session.send_tool_response(response).await?;
        }
        _ => {}
    }
}

मल्टी-एजेंट हैंडऑफ़

विशेषज्ञता प्राप्त Agent के बीच वार्तालापों को स्थानांतरित करें:

// Create sub-agents
let booking_agent = Arc::new(RealtimeAgent::builder("booking_agent")
    .model(model.clone())
    .instruction("Help with reservations.")
    .build()?);

let support_agent = Arc::new(RealtimeAgent::builder("support_agent")
    .model(model.clone())
    .instruction("Help with technical issues.")
    .build()?);

// Create main agent with sub-agents
let receptionist = RealtimeAgent::builder("receptionist")
    .model(model)
    .instruction(
        "Route customers: bookings → booking_agent, issues → support_agent. \
         Use transfer_to_agent tool to hand off."
    )
    .sub_agent(booking_agent)
    .sub_agent(support_agent)
    .build()?;

जब मॉडल transfer_to_agent को कॉल करता है, तो RealtimeRunner हैंडऑफ़ को स्वचालित रूप से संभालता है।

ऑडियो प्रारूप

प्रारूप	सैंपल दर	बिट्स	चैनल	उपयोग का मामला
PCM16	24000 Hz	16	मोनो	OpenAI (डिफ़ॉल्ट)
PCM16	16000 Hz	16	मोनो	Gemini इनपुट
G711 u-law	8000 Hz	8	मोनो	टेलीफोनी
G711 A-law	8000 Hz	8	मोनो	टेलीफोनी

use adk_realtime::{AudioFormat, AudioChunk};

// Create audio format
let format = AudioFormat::pcm16_24khz();

// Work with audio chunks
let chunk = AudioChunk::new(audio_bytes, format);
let base64 = chunk.to_base64();
let decoded = AudioChunk::from_base64(&base64, format)?;

इवेंट प्रकार

सर्वर इवेंट

इवेंट	विवरण
`SessionCreated`	कनेक्शन स्थापित
`AudioDelta`	ऑडियो चंक (base64 PCM)
`TextDelta`	टेक्स्ट प्रतिक्रिया चंक
`TranscriptDelta`	इनपुट ऑडियो ट्रांसक्रिप्ट
`FunctionCallDone`	टूल कॉल अनुरोध
`ResponseDone`	प्रतिक्रिया पूरी हुई
`SpeechStarted`	VAD द्वारा वाक् प्रारंभ का पता चला
`SpeechStopped`	VAD द्वारा वाक् अंत का पता चला
`Error`	त्रुटि हुई

क्लाइंट इवेंट

इवेंट	विवरण
`AudioInput`	ऑडियो चंक भेजें
`AudioCommit`	ऑडियो बफर कमिट करें
`ItemCreate`	टेक्स्ट या टूल प्रतिक्रिया भेजें
`CreateResponse`	प्रतिक्रिया का अनुरोध करें
`CancelResponse`	वर्तमान प्रतिक्रिया रद्द करें
`SessionUpdate`	कॉन्फ़िगरेशन अपडेट करें

उदाहरण

शामिल उदाहरण चलाएं:

# Basic text-only session
cargo run --example realtime_basic --features realtime-openai

# Voice assistant with VAD
cargo run --example realtime_vad --features realtime-openai

# Tool calling
cargo run --example realtime_tools --features realtime-openai

# Multi-agent handoffs
cargo run --example realtime_handoff --features realtime-openai

सर्वोत्तम अभ्यास

सर्वर वीएडी का उपयोग करें: कम विलंबता के लिए सर्वर को भाषण पहचान का काम संभालने दें
बाधाओं को संभालें: स्वाभाविक बातचीत के लिए interrupt_response सक्षम करें
निर्देशों को संक्षिप्त रखें: ध्वनि प्रतिक्रियाएँ संक्षिप्त होनी चाहिए
पहले टेक्स्ट से परीक्षण करें: ऑडियो जोड़ने से पहले टेक्स्ट के साथ अपने एजेंट लॉजिक को डीबग करें
त्रुटियों को शालीनता से संभालें: WebSocket कनेक्शनों के साथ नेटवर्क समस्याएँ आम हैं

OpenAI Agents SDK के साथ तुलना

ADK-Rust का रीयलटाइम कार्यान्वयन OpenAI Agents SDK पैटर्न का पालन करता है:

विशेषता	OpenAI SDK	ADK-Rust
Agent बेस क्लास	`Agent`	`Agent` trait
Realtime agent	`RealtimeAgent`	`RealtimeAgent`
Tools	फ़ंक्शन परिभाषाएँ	`Tool` trait + `ToolDefinition`
Handoffs	`transfer_to_agent`	`sub_agents` + स्वतः-जनरेटेड टूल
Callbacks	Hooks	`before_` / `after_` callbacks

पिछला: ← Graph Agents | अगला: Model Providers →