LLM inference with vLLM¶
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM provides efficient implementations of LLMs for inference, which claims to deliver "up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes."
Performance considerations¶
To make sure you make good use of the resources, it is important that you check the capacity of different hardware we have on Alvis (GPU hardware details). We tested some popular LLM models and listed their performance for inferences below.
Model (HuggingFace) | Node Setup | avg. prompt (token/sec.) | avg. generation (token/sec.) |
---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct | 1x A100:1 | ≈ 3500 | ≈ 4200 |
1x A40:1 | ≈ 2100 | ≈ 2600 | |
neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 | 1x A100:1 | ≈ 3500 | ≈ 4400 |
1x A40:1 | ≈ 1900 | ≈ 2300 | |
meta-llama/Llama-3.1-70B-Instruct | 1x A40:4 | ≈ 600 | ≈ 800 |
neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | 1x A100:2 | ≈ 800 | ≈ 1000 |
1x A40:2 | ≈ 400 | ≈ 500 | |
meta-llama/Llama-3.1-405B-Instruct-FP8 | 3x A100fat:4 | ≈ 800 | ≈ 900 |
6x A100:4 | ≈ 700 | ≈ 800 | |
neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 | 1x A100fat:4 | ≈ 400 | ≈ 500 |
2x A100:4 | ≈ 500 | ≈ 600 | |
2x A40:4 | ≈ 300 | ≈ 300 |
The above setups are verified on the Alvis with the below toy example, the node setups required for the same LLM still depend on how you arrange the memory placement with vLLM, see the [vLLM documentation][vLLM performance] for further discussion.
Example scripts¶
The vLLM server is intended for asynchronous serving of LLM models, for best performance, you should start vLLM as a server and let the client query the generation outputs asynchronously. Below is an example of how this can be prepared as a Slurm job script: the clients wait for the server to launch, and the server is terminated after the inference finishes.
Example python client¶
This is an example script that use the vLLM server to generate text
asynchronously. Note that the script needs to get the port number of
vLLM and the LLM model used through environment variables API_PORT
and HF_MODEL
:
#!/usr/bin/env python3
"""
async_text_gen.py: example python script for text generation with vLLM
"""
import aiohttp
import asyncio
import os
from time import time
vllm_port = os.environ["API_PORT"]
async def post_request(session, url, data):
try:
async with session.post(url, json=data) as response:
if response.status == 200:
response_data = await response.json()
choices = response_data.get('choices', [{}])
text = choices[0].get('text', 'no text')
return data['prompt'], f"[text] {text}"
else:
return data['prompt'], f"[error] {response.status}"
except asyncio.TimeoutError:
return data['prompt'], "[error] timeout"
async def main():
url = f"http://localhost:{vllm_port}/v1/completions"
# Create a TCPConnector with a limited number of connections
conn = aiohttp.TCPConnector(limit=512)
async with aiohttp.ClientSession(connector=conn) as session:
tasks = [
post_request(session, url, {
"model": os.environ['HF_MODEL'],
"prompt": f"Hint is {idx}, make up some random things."
})
for idx in range(10000)
]
responses = await asyncio.gather(*tasks)
for prompt, response in responses:
print(f"prompt: {prompt}; response: {response}")
t0 = time()
asyncio.run(main())
print(f"Inference taks {time() - t0} seconds")
#!/usr/bin/env python3
"""
async_viz_chat.py: example python script for chatting with vision models with vLLM
"""
import aiohttp
import asyncio
import os
from time import time
vllm_port = os.environ["API_PORT"]
async def post_request(session, url, data):
try:
async with session.post(url, json=data) as response:
if response.status == 200:
response_data = await response.json()
choices = response_data.get('choices', [{}])
text = choices[0].get('message').get('content')
return data['messages'][0]['content'], f"[text] {text}"
else:
return data['messages'][0]['content'], f"[error] {response.status}"
except asyncio.TimeoutError:
return data['messages'][0]['content'], "[error] timeout"
async def main():
url = f"http://localhost:{vllm_port}/v1/chat/completions"
# Create a TCPConnector with a limited number of connections
conn = aiohttp.TCPConnector(limit=512)
async with aiohttp.ClientSession(connector=conn) as session:
tasks = [
post_request(session, url, {
"model": os.environ['HF_MODEL'],
"messages": [{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://picsum.photos/200/300", # random images
}
},
{
"type": "text",
"text": "What is the image?"
},
],
}],
})
for idx in range(10)
]
responses = await asyncio.gather(*tasks)
for prompt, response in responses:
print(f"prompt: {prompt}; response: {response}")
t0 = time()
asyncio.run(main())
print(f"Inference taks {time() - t0} seconds")
Example single-node server¶
#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --nodes 1
#SBATCH --gpus-per-node "A40:4"
module purge
export HF_MODEL=meta-llama/Llama-3.1-70B-Instruct
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.7.3
# start vllm server
vllm_opts="--tensor-parallel-size=${SLURM_GPUS_ON_NODE} --max-model-len=10000"
# options for vision models
vllm_vision_opts="--max-num-seqs=16 \
--enforce-eager \
--limit-mm-per-prompt=image=2,video=1 \
--allowed-local-media-path=${HOME}"
vllm_vision_opts="" # comment out this line to use vision model
export API_PORT=$(find_ports)
echo "Starting server node"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
--port ${API_PORT} ${vllm_opts} ${vllm_vision_opts} \
> vllm.out 2> vllm.err &
VLLM_PID=$!
sleep 20
# wait at most 10 min for the model to start, otherwise abort
if timeout 600 bash -c "tail -f vllm.err | grep -q 'Application startup complete'"; then
echo "Starting client"
apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
> client.out 2> client.err
else
echo "vLLM doesn't seem to start, aborting"
fi
echo "Terminating VLLM" && kill -15 ${VLLM_PID}
Example multi-node server¶
#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --ntasks-per-node=1 --cpus-per-task=64 --nodes 3
#SBATCH --gpus-per-node "A40:4"
export HEAD_HOSTNAME="$(hostname)"
export HF_MODEL=neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.6.3
# start ray cluster
export RAY_PORT=$(find_ports)
export RAY_CMD_HEAD="ray start --block --head --port=${RAY_PORT}"
export RAY_CMD_WORKER="ray start --block --address=${HEAD_HOSTNAME}:${RAY_PORT}"
srun -J "head ray node-step-%J" \
-N 1 --tasks-per-node=1 -w ${HEAD_HOSTNAME} \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!
sleep 10
srun -J "worker ray node-step-%J" \
-N $(( SLURM_NNODES-1 )) --tasks-per-node=1 -x ${HEAD_HOSTNAME} \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
RAY_WORKER_PID=$!
sleep 10
# start vllm
vllm_opts="--tensor-parallel-size=${SLURM_GPUS_ON_NODE} --pipeline-parallel-size=${SLURM_NNODES} --max-model-len=10000"
export API_PORT=$(find_ports)
echo "Starting server"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
--port ${API_PORT} ${vllm_opts} \
> vllm.out 2> vllm.err &
VLLM_PID=$!
sleep 20
# wait at most 10 min for the model to start, otherwise abort
if timeout 600 bash -c "tail -f vllm.err | grep -q 'Uvicorn running on socket'"; then
echo "Starting client"
apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
> client.out 2> client.err
else
echo "vLLM doesn't seem to start, aborting"
fi
echo "Terminating VLLM" && kill -15 ${VLLM_PID}
echo "Terminating Ray workers" && kill -15 ${RAY_WORKER_PID}
echo "Terminating Ray head" && kill -15 ${RAY_HEAD_PID}