mistral.rs 集成

使用原生 Rust 推理在本地运行 LLM——无需外部服务器,无需 API 密钥。


什么是 mistral.rs?

mistral.rs 是一个高性能的 Rust 推理引擎,可以直接在您的硬件上运行 LLM。ADK-Rust 通过 adk-mistralrs crate 集成它。

主要亮点:

  • 🦀 原生 Rust - 无需 Python,无需外部服务器
  • 🔒 完全离线 - 无需 API keys 或互联网
  • 硬件加速 - CUDA, Metal, CPU 优化
  • 📦 量化 - 在有限硬件上运行大型模型
  • 🔧 LoRA adapters - 支持热插拔的微调模型
  • 👁️ 视觉模型 - 图像理解能力
  • 🎯 多模型 - 从一个实例服务多个模型

步骤 1: 添加依赖

由于 adk-mistralrs 依赖于 git 仓库,因此无法发布到 crates.io。通过 git 添加它:

[package]
name = "my-local-agent"
version = "0.1.0"
edition = "2024"

[dependencies]
adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust" }
adk-agent = { git = "https://github.com/zavora-ai/adk-rust" }
adk-rust = { git = "https://github.com/zavora-ai/adk-rust" }
tokio = { version = "1", features = ["full"] }
anyhow = "1.0"

为了硬件加速,添加 feature flags:

# macOS with Apple Silicon
adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["metal"] }

# NVIDIA GPU (requires CUDA toolkit)
adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["cuda"] }

步骤 2: 基本示例

从 HuggingFace 加载模型并在本地运行:

use adk_agent::LlmAgentBuilder;
use adk_mistralrs::{Llm, MistralRsConfig, MistralRsModel, ModelSource};
use adk_rust::Launcher;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load model from HuggingFace (downloads on first run)
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct"))
        .build();

    println!("Loading model (this may take a while on first run)...");
    let model = MistralRsModel::new(config).await?;
    println!("Model loaded: {}", model.name());

    // Create agent
    let agent = LlmAgentBuilder::new("local_assistant")
        .description("Local AI assistant powered by mistral.rs")
        .instruction("You are a helpful assistant running locally. Be concise.")
        .model(Arc::new(model))
        .build()?;

    // Run interactive chat
    Launcher::new(Arc::new(agent)).run().await?;

    Ok(())
}

发生了什么:

  1. 首次运行会从 HuggingFace 下载模型(根据模型不同,约 2-8GB)
  2. 模型在本地缓存到 ~/.cache/huggingface/
  3. 后续运行会立即从缓存加载

步骤 3:使用量化减少内存

大型模型需要大量 RAM。使用 ISQ(In-Situ Quantization,原位量化)来减少内存:

use adk_agent::LlmAgentBuilder;
use adk_mistralrs::{Llm, MistralRsConfig, MistralRsModel, ModelSource, QuantizationLevel};
use adk_rust::Launcher;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load model with 4-bit quantization for reduced memory
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct"))
        .isq(QuantizationLevel::Q4_0) // 4-bit quantization
        .paged_attention(true) // Memory-efficient attention
        .build();

    println!("Loading quantized model...");
    let model = MistralRsModel::new(config).await?;
    println!("Model loaded: {}", model.name());

    let agent = LlmAgentBuilder::new("quantized_assistant")
        .instruction("You are a helpful assistant. Be concise.")
        .model(Arc::new(model))
        .build()?;

    Launcher::new(Arc::new(agent)).run().await?;

    Ok(())
}

量化级别

级别内存减少质量最适合
Q4_0~75%良好有限内存 (8GB)
Q4_1~70%更好均衡
Q8_0~50%注重质量
Q8_1~50%最高最佳质量

步骤 4:LoRA 适配器(微调模型)

加载带有 LoRA 适配器的模型以执行专业任务:

use adk_agent::LlmAgentBuilder;
use adk_mistralrs::{AdapterConfig, Llm, MistralRsAdapterModel, MistralRsConfig, ModelSource};
use adk_rust::Launcher;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load base model with LoRA adapter
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("meta-llama/Llama-3.2-3B-Instruct"))
        .adapter(AdapterConfig::lora("username/my-lora-adapter"))
        .build();

    println!("Loading model with LoRA adapter...");
    let model = MistralRsAdapterModel::new(config).await?;
    println!("Model loaded: {}", model.name());
    println!("Available adapters: {:?}", model.available_adapters());

    let agent = LlmAgentBuilder::new("lora_assistant")
        .instruction("You are a helpful assistant with specialized knowledge.")
        .model(Arc::new(model))
        .build()?;

    Launcher::new(Arc::new(agent)).run().await?;

    Ok(())
}

在运行时热插拔适配器

model.swap_adapter("another-adapter").await?;

步骤 5:视觉模型(图像理解)

使用视觉-语言模型处理图像:

use adk_mistralrs::{Llm, MistralRsConfig, MistralRsVisionModel, ModelSource};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-vision-instruct"))
        .build();

    println!("Loading vision model...");
    let model = MistralRsVisionModel::new(config).await?;
    println!("Model loaded: {}", model.name());

    // Analyze an image
    let image = image::open("photo.jpg")?;
    let response = model.generate_with_image("Describe this image.", vec![image]).await?;

    Ok(())
}

步骤 6:多模型服务

从单个实例提供多个模型:

use adk_mistralrs::{MistralRsConfig, MistralRsMultiModel, ModelSource};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let multi = MistralRsMultiModel::new();

    // Add models
    let phi_config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct"))
        .build();
    multi.add_model("phi", phi_config).await?;

    let gemma_config = MistralRsConfig::builder()
        .model_source(ModelSource::huggingface("google/gemma-2-2b-it"))
        .build();
    multi.add_model("gemma", gemma_config).await?;

    // Set default and route requests
    multi.set_default("phi").await?;
    println!("Available models: {:?}", multi.model_names().await);

    // Route to specific model
    // multi.generate_with_model(Some("gemma"), request, false).await?;

    Ok(())
}

模型来源

HuggingFace Hub(默认)

ModelSource::huggingface("microsoft/Phi-3.5-mini-instruct")

本地目录

ModelSource::local("/path/to/model")

预量化 GGUF

ModelSource::gguf("/path/to/model.Q4_K_M.gguf")

推荐模型

Model大小所需内存最适合
microsoft/Phi-3.5-mini-instruct3.8B8GB快速、通用
microsoft/Phi-3.5-vision-instruct4.2B10GB视觉 + 文本
Qwen/Qwen2.5-3B-Instruct3B6GB多语言、编码
google/gemma-2-2b-it2B4GB轻量级
mistralai/Mistral-7B-Instruct-v0.37B16GB高质量

硬件加速

macOS (Apple Silicon)

adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["metal"] }

Metal 加速在 M1/M2/M3 Mac 上是自动的。

NVIDIA GPU

adk-mistralrs = { git = "https://github.com/zavora-ai/adk-rust", features = ["cuda"] }

需要 CUDA toolkit 11.8+。

仅限 CPU

无需任何功能 - CPU 是默认选项。


运行示例

# Basic usage
cargo run --bin basic

# With quantization
cargo run --bin quantized

# LoRA adapters
cargo run --bin lora

# Multi-model setup
cargo run --bin multimodel

# Vision models
cargo run --bin vision

故障排除

内存不足

// Enable quantization
.isq(QuantizationLevel::Q4_0)
// Enable paged attention
.paged_attention(true)

首次加载缓慢

  • 首次运行会下载模型(~2-8GB)
  • 后续运行使用缓存的模型

未找到模型

  • 检查 HuggingFace model ID 是否正确
  • 确保首次下载时有网络连接

相关


上一页: ← Ollama (Local) | 下一页: Function Tools →