llm-benchmark

  1. LLM benchmark
  2. 工具
    1. vLLM
    2. TensorRT-LLM

LLM benchmark

本地化部署的大模型考量的指标是什么?

  1. Token生成的速度 【token/s】

这是衡量QPS的最核心指标

指标 含义 对QPS的影响
Decode tokens/s 每秒生成 token 决定吞吐量 QPS
Prefill tokens/s 输入处理速度 影响首字时间
KV cache 显存占用 每 token 需要多少缓存 限制并发数
Max batch size 单次支持的 batch 直接决定 QPS
  1. 显存占用【VRAM footprint】

一般机器要预留20%左右的显存空间

  1. prefill latency【输入阶段延迟】

影响首 token 时间(TTFT)。

工具

vLLM

使用 VLLM 来部署 QWen3-8b模型,测试一下 Tesla-T4 的吞吐量

Qwen3-8B 会炸炉子,还是测试Qwen3-4B吧

官方的文档:

  1. vLLM 提供官方 Docker 镜像进行部署。该镜像可用于运行 OpenAI 兼容服务器,并可在 Docker Hub 上以 vllm/vllm-openai 形式获取。
docker pull vllm/vllm-openai:nightly-x86_64
  1. 启动服务
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:nightly-x86_64 \
    --model Qwen/Qwen3-4B

测试是否跑通

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

测试情况

  1. 进入docker中使用自带的工具来测试
vllm serve --model Qwen/Qwen3-4B \
--port 8000 \
--max-model-len 10240 \
--gpu-memory-utilization 0.95
  • –max-model-len

  • –gpu-memory-utilization

不增加 --max-model-len 会使用模型最大的token数,比如4B的最大token数为40960,Tesla-T4会直接炸炉子

–gpu-memory-utilization 设置 gpu 的使用率,默认为0.9

初步启动显存占用情况

进入容器中再启动LLM服务

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    -p 8000:8000 \
    --ipc=host \
    -it --entrypoint /bin/bash \
    vllm/vllm-openai:nightly-x86_64

benchmark 测试

vllm bench serve \
    --model Qwen/Qwen3-8B \
    --host 127.0.0.1 \
    --port 8000 \
    --random-input-len 1024 \
    --random-output-len 1024  \
    --num-prompts  5

运行结果

测试结果

tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Benchmark duration (s):                  40.76     
Total input tokens:                      5120      
Total generated tokens:                  5120      
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         125.63    
Peak output token throughput (tok/s):    140.00    
Peak concurrent requests:                5.00      
Total Token throughput (tok/s):          251.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          934.99    
Median TTFT (ms):                        1306.70   
P99 TTFT (ms):                           1343.47   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.88     
Median TPOT (ms):                        38.53     
P99 TPOT (ms):                           39.67     
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.88     
Median ITL (ms):                         38.46     
P99 ITL (ms):                            41.42     
==================================================

TensorRT-LLM

TensorRT-LLM是针对 A100 / H100 显卡专门设计的,T4有问题

使用 官方的Docker 镜像环境来启动服务

NV 官方的镜像网站,提供了各种深度学习环境

官方的教程

docker run --rm -it --ipc host --gpus all \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3

进入 docker 下的 bash

docker exec -it <container_id> bash

启动 LLM 服务

trtllm-serve "Qwen/Qwen3-8B"

访问服务

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 1024,
        "temperature": 0
    }'

运行结果

压力测试

TensorRT-LLM 基准测试 — TensorRT-LLM

官方推荐的方式

# Step 1: 准备数据集
python benchmarks/cpp/prepare_dataset.py \
    --stdout \
    --tokenizer Qwen/Qwen3-8B \
    token-norm-dist \
    --input-mean 128 \
    --output-mean 128 \
    --input-stdev 0 \
    --output-stdev 0 \
    --num-requests 3000 \
    > /tmp/synthetic_128_128.txt

# Step 2: 构建 Engine
trtllm-bench \
    --model Qwen/Qwen3-8B \
    build \
    --dataset /tmp/synthetic_128_128.txt 

# Step 3: 吞吐量 Benchmark
trtllm-bench \
    --model Qwen/Qwen3-8B \
    throughput \
    --dataset /tmp/synthetic_128_128.txt \
    --engine_dir /tmp/Qwen/Qwen3-8B/tp_1_pp_1

测试结果如下:

===========================================================
= PYTORCH BACKEND
===========================================================
Model:                  Qwen/Qwen3-8B
Model Path:             None
TensorRT LLM Version:   1.2
Dtype:                  bfloat16
KV Cache Dtype:         None
Quantization:           None

===========================================================
= MACHINE DETAILS 
===========================================================
NVIDIA A100 80GB PCIe, memory 79.25 GB, 1.51 GHz

===========================================================
= REQUEST DETAILS 
===========================================================
Number of requests:             3000
Number of concurrent requests:  2235.6361
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
===========================================================
= WORLD + RUNTIME INFORMATION 
===========================================================
TP Size:                1
PP Size:                1
EP Size:                None
Max Runtime Batch Size: 1792
Max Runtime Tokens:     3328
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   8.9362E+13

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     42.9775
Total Output Throughput (tokens/sec):             5501.1246
Total Token Throughput (tokens/sec):              11002.2492
Total Latency (ms):                               69803.9090
Average request latency (ms):                     52018.7121
Per User Output Throughput [w/ ctx] (tps/user):   2.6867
Per GPU Output Throughput (tps/gpu):              5501.1246

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 49434.4831
[Latency] P90    : 69053.3882
[Latency] P95    : 69401.8254
[Latency] P99    : 69515.5505
[Latency] MINIMUM: 29191.0916
[Latency] MAXIMUM: 69536.4003
[Latency] AVERAGE: 52018.7121

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /tmp/synthetic_128_128.txt
Number of Sequences:  3000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000           128.0000           256.0000
MAX:   128.0000           128.0000           256.0000
AVG:   128.0000           128.0000           256.0000
P50:   128.0000           128.0000           256.0000
P90:   128.0000           128.0000           256.0000
P95:   128.0000           128.0000           256.0000
P99:   128.0000           128.0000           256.0000
===========================================================

github