AirLLM 安装教程：单张 4GB GPU 跑 70B 大模型（Linux / Windows / macOS）-无双技术网

AirLLM 是一款轻量级的大模型推理优化工具，核心能力是让 70B 参数级别的大语言模型（如 Llama 3 70B）在仅 4GB 显存的单 GPU 上流畅运行，不需要量化、蒸馏或剪枝。甚至可以在 8GB 显存上运行 405B 的 Llama 3.1。

项目地址：https://github.com/lyogavin/airllm

目前 GitHub 已获 21,000+ Star，采用 Apache 2.0 许可证。

核心特性

超低显存需求：70B 模型仅需 4GB GPU，405B 仅需 8GB
无需量化预处理：直接加载原模型推理，不依赖量化、蒸馏或剪枝
模型压缩加速：v2.0+ 支持块级量化压缩，推理速度提升 3 倍
广泛模型支持：Llama 3/3.1、Qwen 2.5、ChatGLM、Baichuan、Mistral、InternLM 等主流开源模型
CPU 推理：v2.10+ 无 GPU 也能运行
macOS 支持：Apple Silicon（M 系列芯片）可跑 70B 模型

支持哪些调用方式？

AirLLM 本身不提供 REST API 或 HTTP 服务，它仅作为一个 Python 库暴露推理接口。用户在 Python 代码中通过 AutoModel 类加载模型并调用推理。

如果需要对外提供 HTTP API 服务，可以用 FastAPI 或 Flask 将推理包装为 REST 接口。具体实现请参考下方的「自己封装 REST API」章节。

支持的模型（完整列表）

AirLLM 通过 AutoModel 自动检测模型类型，以下模型已得到官方验证支持：

模型系列	示例模型
Llama 3 / 3.1	meta-llama/Meta-Llama-3-70B-Instruct、Llama 3.1 405B
Llama 2	Llama-2-70B、Platypus2-70B
Qwen	Qwen-7B、Qwen2.5-72B
ChatGLM	THUDM/chatglm3-6b-base
Baichuan	baichuan-inc/Baichuan2-7B-Base
Mistral	mistralai/Mistral-7B-Instruct-v0.1
InternLM	internlm/internlm-20b
Mixtral	mixtral 8x7B（v2.7+ 支持）

只要模型使用 safetensors 格式保存，AutoModel 即可自动识别并加载，无需指定模型类。

一、Linux 安装（Ubuntu / Debian / CentOS）

Linux 是最推荐的运行环境，对 CUDA 和 PyTorch 支持最好。

1. 环境要求

Python 3.8+（推荐 3.10）
NVIDIA GPU + CUDA 11.8+（推荐 CUDA 12.1）
PyTorch 2.0+
足够的磁盘空间：70B 模型约需 140GB，405B 模型约需 800GB

2. 安装步骤

# 第1步：安装 CUDA（如果还没装）
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run
export PATH=/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH

# 第2步：创建 Python 虚拟环境
python3 -m venv airllm_env
source airllm_env/bin/activate

# 第3步：安装 PyTorch（CUDA 12.1 版本）
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 第4步：安装 AirLLM
pip install airllm

# 第5步（可选）：启用压缩加速
pip install -U bitsandbytes

3. 运行推理示例

from airllm import AutoModel

model = AutoModel.from_pretrained(
    "garage-bAInd/Platypus2-70B-instruct"
)

input_text = ["What is the capital of United States?"]
input_tokens = model.tokenizer(
    input_text, return_tensors="pt", truncation=True, max_length=128
)

output = model.generate(
    input_tokens["input_ids"].cuda(),
    max_new_tokens=20
)
result = model.tokenizer.decode(output[0])
print(result)

4. 启用压缩加速（4bit / 8bit）

model = AutoModel.from_pretrained(
    "garage-bAInd/Platypus2-70B-instruct",
    compression="4bit"  # 或 "8bit"
)

启用后加载体积缩小约 50%（4bit）或 25%（8bit），推理速度最高提升 3 倍，精度损失几乎可以忽略。原理是 AirLLM 的瓶颈主要在磁盘 I/O（逐层加载），量化权重后每层体积变小，加载更快。

二、Windows 安装

Windows 下安装稍复杂，需注意 CUDA 和编译工具链的配置。

1. 环境要求

Python 3.8+（推荐 3.10 或 3.11）
NVIDIA GPU + CUDA 11.8+
Visual Studio Build Tools（bitsandbytes 编译需要）

2. 安装步骤

# 第1步：安装 CUDA 12.1
下载地址：https://developer.nvidia.com/cuda-12-1-0-download-archive
选择 Windows → exe(local) 安装包，运行安装即可

# 第2步：安装 Visual Studio Build Tools
下载地址：https://visualstudio.microsoft.com/visual-cpp-build-tools/
运行 Visual Studio Installer → 选择「使用 C++ 的桌面开发」→ 安装

# 第3步：创建虚拟环境
python -m venv airllm_env
airllm_env\Scripts\activate

# 第4步：安装 PyTorch（CUDA 12.1 版本）
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 第5步：安装 AirLLM
pip install airllm

# 第6步（可选）：如需压缩加速
pip install -U bitsandbytes

注意：如果 bitsandbytes 在 Windows 上编译失败，可以跳过压缩功能，AirLLM 的基础推理不需要它。

三、macOS 安装（仅 Apple Silicon）

仅支持 Apple Silicon（M1/M2/M3/M4 系列芯片），Intel Mac 暂不支持。

# 第1步：安装 MLX 框架
pip install mlx

# 第2步：创建虚拟环境
python3 -m venv airllm_env
source airllm_env/bin/activate

# 第3步：安装 PyTorch
pip install torch torchvision torchaudio

# 第4步：安装 AirLLM
pip install airllm

macOS 上 AirLLM 会利用 MPS（Metal Performance Shaders）加速。70B 模型在 M2 Ultra/128GB 上可以流畅运行。

四、Docker 部署（推荐生产环境）

# 拉取 PyTorch CUDA 镜像
docker pull pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

# 启动容器（挂载模型目录）
docker run --gpus all -it --name airllm \
  -v /path/to/models:/models \
  pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime bash

# 容器内安装
pip install airllm

五、自己封装 REST API（使用 FastAPI）

AirLLM 本身只提供 Python 类调用，如果想通过 HTTP 请求对外提供推理服务，可以用 FastAPI 包装一层。下面给出完整实现。

1. 安装依赖

# 在已安装 airllm 的基础上额外安装
pip install fastapi uvicorn pydantic

2. 创建 API 服务脚本（app.py）

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from airllm import AutoModel
import torch

app = FastAPI(title="AirLLM API")

# 全局加载模型（只加载一次，避免每次请求重复加载）
# 注意：首次加载需要分解模型到磁盘，耗时较长（几分钟到十几分钟）
MODEL_NAME = "garage-bAInd/Platypus2-70B-instruct"
print(f"Loading model: {MODEL_NAME}...")
model = AutoModel.from_pretrained(MODEL_NAME)
print("Model loaded!")

# 定义请求体格式
class GenerateRequest(BaseModel):
    prompt: str                 # 输入提示词
    max_new_tokens: int = 50    # 最大生成 token 数（默认50）
    temperature: float = 0.7    # 温度参数（控制随机性）

class GenerateResponse(BaseModel):
    generated_text: str         # 生成的文本

@app.get("/")
async def root():
    return {"message": "AirLLM API is running", "model": MODEL_NAME}

@app.get("/health")
async def health():
    """健康检查接口"""
    return {"status": "ok", "model_loaded": model is not None}

@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    """文本生成接口"""
    try:
        # 分词
        input_tokens = model.tokenizer(
            [request.prompt],
            return_tensors="pt",
            truncation=True,
            max_length=128
        )
        # 推理
        with torch.no_grad():
            output = model.generate(
                input_tokens["input_ids"].cuda(),
                max_new_tokens=request.max_new_tokens,
                use_cache=True,
                return_dict_in_generate=True
            )
        # 解码
        result = model.tokenizer.decode(output.sequences[0], skip_special_tokens=True)
        return GenerateResponse(generated_text=result)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

3. 启动 API 服务

python app.py

启动后服务默认监听 http://0.0.0.0:8000。

4. 测试 API

# 测试健康检查
curl http://localhost:8000/health

# 测试文本生成
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?", "max_new_tokens": 20}'

预期返回结果：

{
  "generated_text": "The capital of France is Paris."
}

5. 使用 Python 调用 API

import requests

response = requests.post(
    "http://localhost:8000/generate",
    json={"prompt": "Explain quantum computing in one sentence", "max_new_tokens": 30}
)
print(response.json()["generated_text"])

6. 生产环境部署建议

使用 Gunicorn + Uvicorn 多进程：gunicorn -w 1 -k uvicorn.workers.UvicornWorker app:app（注意 AirLLM 单进程模型加载，多 worker 会重复加载消耗过多显存）
加请求队列：使用 Celery 或简单的 asyncio 队列处理并发请求
设置超时：AirLLM 逐层推理较慢，API 超时建议设置 300 秒以上
Docker 部署：将 app.py 和 requirements.txt 打包到 Docker 镜像中一键部署

六、Colab 免费体验

零安装体验 Llama 3.1 405B：在 Colab 上运行

体验多模型加载：多模型 Colab

七、AutoModel 配置参数说明

model = AutoModel.from_pretrained(
    model_name_or_path,       # 必填：HuggingFace 模型 ID 或本地路径
    compression=None,         # 可选："4bit" / "8bit" / None
    layer_shards_saving_path, # 可选：分解层文件的保存路径
    hf_token=None,            # 可选：访问受限模型（如 meta-llama）的 HuggingFace Token
    prefetching=True,         # 可选：启用预加载加速（默认开启，仅 AirLLMLlama2 支持）
    delete_original=False,    # 可选：分解后删除原始模型以节省磁盘空间
    profiling_mode=False,     # 可选：输出各阶段耗时
)