LLM inference with vLLM¶
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM provides efficient implementations of LLMs for inference, which claims to deliver "up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes."
Note: vLLM support on Alvis is under development at this point. The scripts and benchmark score here are provided in hope of them being helpful. Please verify your setup and results according to what you need. Suggestions for improvements are welcome!
Performance considerations¶
To make sure you make good use of the resources, it is important that you check the capacity of different hardware we have on Alvis (GPU hardware details). We tested some popular LLM models and listed their performance for inferences below.
Model (HuggingFace) | Node Setup | avg. prompt (token/sec.) | avg. generation (token/sec.) |
---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct | 1x A100:1 | ≈ 3500 | ≈ 4200 |
1x A40:1 | ≈ 2100 | ≈ 2600 | |
neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 | 1x A100:1 | ≈ 3500 | ≈ 4400 |
1x A40:1 | ≈ 1900 | ≈ 2300 | |
meta-llama/Llama-3.1-70B-Instruct | 1x A40:4 | ≈ 600 | ≈ 800 |
neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | 1x A100:2 | ≈ 800 | ≈ 1000 |
1x A40:2 | ≈ 400 | ≈ 500 | |
meta-llama/Llama-3.1-405B-Instruct-FP8 | 3x A100fat:4 | ≈ 800 | ≈ 900 |
6x A100:4 | ≈ 700 | ≈ 800 | |
neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 | 1x A100fat:4 | ≈ 400 | ≈ 500 |
2x A100:4 | ≈ 500 | ≈ 600 | |
2x A40:4 | ≈ 300 | ≈ 300 |
The above setups are verified on the Alvis with the below toy example, the node setups required for the same LLM still depend on how you arrange the memory placement with vLLM, see the vLLM documentation for further discussion.
Example scripts¶
The vLLM server is intended for asynchronous serving of LLM models, for best performance, you should start vLLM as a server and let the client query the generation outputs asynchronously. Below is an example of how this can be prepared as a Slurm job script: the clients wait for the server to launch, and the server is terminated after the inference finishes.
Example python client¶
This is an example script that use the vLLM server to generate text
asynchronously. Note that the script needs to get the port number of
vLLM and the LLM model used through environment variables VLLM_PORT
and HF_MODEL
:
#!/usr/bin/env python3
"""
async_text_gen.py: example python script for text generation with vLLM
"""
import aiohttp
import asyncio
import os
from time import time
vllm_port = os.environ["VLLM_PORT"]
async def post_request(session, url, data):
try:
async with session.post(url, json=data) as response:
if response.status == 200:
response_data = await response.json()
choices = response_data.get('choices', [{}])
text = choices[0].get('text', 'no text')
return data['prompt'], f"[text] {text}"
else:
return data['prompt'], f"[error] {response.status}"
except asyncio.TimeoutError:
return data['prompt'], "[error] timeout"
async def main():
url = f"http://localhost:{vllm_port}/v1/completions"
# Create a TCPConnector with a limited number of connections
conn = aiohttp.TCPConnector(limit=512)
async with aiohttp.ClientSession(connector=conn) as session:
tasks = [
post_request(session, url, {
"model": os.environ['HF_MODEL'],
"prompt": f"Hint is {idx}, make up some random things."
})
for idx in range(10000)
]
responses = await asyncio.gather(*tasks)
for prompt, response in responses:
print(f"prompt: {prompt}; response: {response}")
t0 = time()
asyncio.run(main())
print(f"Inference taks {time() - t0} seconds")
Example single-node server¶
#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --nodes 1
#SBATCH --gpus-per-node "A40:4"
export HF_MODEL=meta-llama/Llama-3.1-70B-Instruct
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.6.3
# start vllm server
export VLLM_OPT="--tensor-parallel-size=4 --max-model-len=10000"
export VLLM_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
echo "Starting server node"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
--port ${VLLM_PORT} ${VLLM_OPT} \
> vllm.out 2> vllm.err &
VLLM_PID=$!
sleep 20
# wait at most 5 min for the model to start, otherwise abort
if timeout 300 bash -c "tail -f vllm.err | grep -q 'Uvicorn running on socket'"; then
echo "Starting client"
apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
> client.out 2> client.err
else
echo "vLLM doesn't seem to start, aborting"
fi
echo "Terminating VLLM" && kill -15 ${VLLM_PID}
Example multi-node server¶
#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --ntasks-per-node=1 --cpus-per-task=64 --nodes 3
#SBATCH --gpus-per-node "A40:4"
export HEAD_HOSTNAME="$(hostname)"
export HF_MODEL=neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.6.3
# start ray cluster
export RAY_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
export RAY_CMD_HEAD="ray start --block --head --port=${RAY_PORT}"
export RAY_CMD_WORKER="ray start --block --address=${HEAD_HOSTNAME}:${RAY_PORT}"
srun -J "head ray node-step-%J" \
-N 1 --tasks-per-node=1 -w ${HEAD_HOSTNAME} \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!
sleep 10
srun -J "worker ray node-step-%J" \
-N $(( SLURM_NNODES-1 )) --tasks-per-node=1 -x ${HEAD_HOSTNAME} \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
RAY_WORKER_PID=$!
sleep 10
# start vllm
export VLLM_OPT="--tensor-parallel-size=4 --pipeline-parallel-size=${SLURM_NNODES} --max-model-len=10000"
export VLLM_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
echo "Starting server"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
--port ${VLLM_PORT} ${VLLM_OPT} \
> vllm.out 2> vllm.err &
VLLM_PID=$!
sleep 20
# wait at most 5 min for the model to start, otherwise abort
if timeout 300 bash -c "tail -f vllm.err | grep -q 'Uvicorn running on socket'"; then
echo "Starting client"
apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
> client.out 2> client.err
else
echo "vLLM doesn't seem to start, aborting"
fi
echo "Terminating VLLM" && kill -15 ${VLLM_PID}
echo "Terminating Ray workers" && kill -15 ${RAY_WORKER_PID}
echo "Terminating Ray head" && kill -15 ${RAY_HEAD_PID}