mistral.rs Integration

Führen Sie LLMs lokal mit nativer Rust-Inferenz aus – keine externen Server, keine API-Schlüssel.

Was ist mistral.rs?

mistral.rs ist eine hochleistungsfähige Rust-Inferenz-Engine, die LLMs direkt auf Ihrer Hardware ausführt. ADK-Rust integriert sie über das adk-mistralrs Crate.

Wichtige Merkmale:

🦀 Natives Rust – Kein Python, keine externen Server

🔒 Vollständig offline – Keine API-Schlüssel oder Internet erforderlich

⚡ Hardwarebeschleunigung – CUDA-, Metal-, CPU-Optimierungen

📦 Quantisierung – Große Modelle auf begrenzter Hardware ausführen

🔧 LoRA-Adapter – Unterstützung für feinabgestimmte Modelle mit Hot-Swapping

👁️ Vision-Modelle – Fähigkeiten zum Bildverständnis

🎯 Multi-Modell – Mehrere Modelle von einer Instanz aus bereitstellen

Schritt 1: Abhängigkeiten hinzufügen

Da adk-mistralrs von git-Repositorys abhängt, kann es nicht auf crates.io veröffentlicht werden. Fügen Sie es über git hinzu:

[package]
name = "my-local-agent"
version = "0.1.0"
edition = "2024"

[dependencies]
adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust" }
adk-agent = { git = "https://github.com/zavora-ai/adk-rust" }
adk-rust = { git = "https://github.com/zavora-ai/adk-rust" }
tokio = { version = "1", features = ["full"] }
anyhow = "1.0"

Für Hardwarebeschleunigung fügen Sie Feature-Flags hinzu:

# macOS mit Apple Silicon
adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["metal"] }

# NVIDIA GPU (erfordert CUDA-Toolkit)
adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["cuda"] }

Schritt 2: Grundlegendes Beispiel

Laden Sie ein Modell von HuggingFace und führen Sie es lokal aus:

use adk_agent::LlmAgentBuilder;
use adk_mistralrs::{Llm, MistralRsConfig, MistralRsModel, ModelSource};
use adk_rust::Launcher;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Modell von HuggingFace laden (wird beim ersten Lauf heruntergeladen)
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct"))
        .build();

    println!("Loading model (this may take a while on first run)...");
    let model = MistralRsModel::new(config).await?;
    println!("Model loaded: {}", model.name());

    // Agent erstellen
    let agent = LlmAgentBuilder::new("local_assistant")
        .description("Local AI assistant powered by mistral.rs")
        .instruction("You are a helpful assistant running locally. Be concise.")
        .model(Arc::new(model))
        .build()?;

    // Interaktiven Chat ausführen
    Launcher::new(Arc::new(agent)).run().await?;

    Ok(())
}

Was passiert:

Beim ersten Lauf wird das Modell von HuggingFace heruntergeladen (~2-8GB je nach Modell)
Das Modell wird lokal in ~/.cache/huggingface/ zwischengespeichert
Nachfolgende Läufe laden sofort aus dem Cache

Schritt 3: Speicher reduzieren mit Quantisierung

Große Modelle benötigen viel RAM. Verwenden Sie ISQ (In-Situ Quantisierung), um den Speicher zu reduzieren:

use adk_agent::LlmAgentBuilder;
use adk_mistralrs::{Llm, MistralRsConfig, MistralRsModel, ModelSource, QuantizationLevel};
use adk_rust::Launcher;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load model with 4-bit quantization for reduced memory
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct"))
        .isq(QuantizationLevel::Q4_0) // 4-bit quantization
        .paged_attention(true) // Memory-efficient attention
        .build();

    println!("Loading quantized model...");
    let model = MistralRsModel::new(config).await?;
    println!("Model loaded: {}", model.name());

    let agent = LlmAgentBuilder::new("quantized_assistant")
        .instruction("You are a helpful assistant. Be concise.")
        .model(Arc::new(model))
        .build()?;

    Launcher::new(Arc::new(agent)).run().await?;

    Ok(())
}

Quantisierungsstufen:

Level	Speicherreduzierung	Qualität	Am besten geeignet für
`Q4_0`	~75%	Gut	Begrenzter RAM (8 GB)
`Q4_1`	~70%	Besser	Ausgewogen
`Q8_0`	~50%	Hoch	Qualitätsorientiert
`Q8_1`	~50%	Höchste	Beste Qualität

Schritt 4: LoRA-Adapter (Feinabgestimmte Modelle)

Laden Sie Modelle mit LoRA-Adaptern für spezialisierte Aufgaben:

use adk_agent::LlmAgentBuilder;
use adk_mistralrs::{AdapterConfig, Llm, MistralRsAdapterModel, MistralRsConfig, ModelSource};
use adk_rust::Launcher;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load base model with LoRA adapter
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("meta-llama/Llama-3.2-3B-Instruct"))
        .adapter(AdapterConfig::lora("username/my-lora-adapter"))
        .build();

    println!("Loading model with LoRA adapter...");
    let model = MistralRsAdapterModel::new(config).await?;
    println!("Model loaded: {}", model.name());
    println!("Available adapters: {:?}", model.available_adapters());

    let agent = LlmAgentBuilder::new("lora_assistant")
        .instruction("You are a helpful assistant with specialized knowledge.")
        .model(Arc::new(model))
        .build()?;

    Launcher::new(Arc::new(agent)).run().await?;

    Ok(())
}

Adapter zur Laufzeit wechseln:

model.swap_adapter("another-adapter").await?;

Schritt 5: Vision-Modelle (Bildverständnis)

Verarbeiten Sie Bilder mit Vision-Sprachmodellen:

use adk_mistralrs::{Llm, MistralRsConfig, MistralRsVisionModel, ModelSource};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-vision-instruct"))
        .build();

    println!("Loading vision model...");
    let model = MistralRsVisionModel::new(config).await?;
    println!("Model loaded: {}", model.name());

    // Analyze an image
    let image = image::open("photo.jpg")?;
    let response = model.generate_with_image("Describe this image.", vec![image]).await?;

    Ok(())
}

Schritt 6: Multi-Modell-Bereitstellung

Bereitstellung mehrerer Modelle von einer einzigen Instanz aus:

use adk_mistralrs::{MistralRsConfig, MistralRsMultiModel, ModelSource};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let multi = MistralRsMultiModel::new();

    // Add models
    let phi_config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct"))
        .build();
    multi.add_model("phi", phi_config).await?;

    let gemma_config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("google/gemma-2-2b-it"))
        .build();
    multi.add_model("gemma", gemma_config).await?;

    // Set default and route requests
    multi.set_default("phi").await?;
    println!("Available models: {:?}", multi.model_names().await);

    // Route to specific model
    // multi.generate_with_model(Some("gemma"), request, false).await?;

    Ok(())
}

Modellquellen

HuggingFace Hub (Standard)

ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct")

Lokales Verzeichnis

ModelSource::local("/path/to/model")

Vorquantisiertes GGUF

ModelSource::gguf("/path/to/model.Q4_K_M.gguf")

Empfohlene Modelle

Modell	Größe	Benötigter RAM	Am besten geeignet für
`microsoft/Phi-3.5-mini-instruct`	3.8B	8GB	Schnell, universell einsetzbar
`microsoft/Phi-3.5-vision-instruct`	4.2B	10GB	Vision + Text
`Qwen/Qwen2.5-3B-Instruct`	3B	6GB	Mehrsprachig, Programmierung
`google/gemma-2-2b-it`	2B	4GB	Leichtgewichtig
`mistralai/Mistral-7B-Instruct-v0.3`	7B	16GB	Hohe Qualität

Hardware-Beschleunigung

macOS (Apple Silicon)

adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["metal"] }

Die Metal-Beschleunigung ist auf M1/M2/M3 Macs automatisch.

NVIDIA GPU

adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["cuda"] }

Erfordert CUDA Toolkit 11.8+.

Nur CPU

Keine Features erforderlich – CPU ist die Standardeinstellung.

Beispiele ausführen

# Basic usage
cargo run --bin basic

# With quantization
cargo run --bin quantized

# LoRA adapters
cargo run --bin lora

# Multi-model setup
cargo run --bin multimodel

# Vision models
cargo run --bin vision

Fehlerbehebung

Speichermangel

// Enable quantization
.isq(QuantizationLevel::Q4_0)
// Enable paged attention
.paged_attention(true)

Langsamer erster Ladevorgang

Der erste Lauf lädt das Modell herunter (~2-8 GB)
Nachfolgende Läufe verwenden das zwischengespeicherte Modell

Modell nicht gefunden

Überprüfen Sie, ob die HuggingFace Modell-ID korrekt ist
Stellen Sie die Internetverbindung für den ersten Download sicher

Verwandt

Modellanbieter - Cloud LLM-Anbieter
Ollama - Alternativer lokaler Modellserver
LlmAgent - Verwendung von Modellen mit Agents

Zurück: ← Ollama (Lokal) | Weiter: Function Tools →