Skip to content

LLM inference with vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM provides efficient implementations of LLMs for inference, which claims to deliver "up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes."

Note: vLLM support on Alvis is under development at this point. The scripts and benchmark score here are provided in hope of them being helpful. Please verify your setup and results according to what you need. Suggestions for improvements are welcome!

Performance considerations

To make sure you make good use of the resources, it is important that you check the capacity of different hardware we have on Alvis (GPU hardware details). We tested some popular LLM models and listed their performance for inferences below.

Model (HuggingFace) Node Setup avg. prompt (token/sec.) avg. generation (token/sec.)
meta-llama/Llama-3.1-8B-Instruct 1x A100:1 ≈ 3500 ≈ 4200
1x A40:1 ≈ 2100 ≈ 2600
neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 1x A100:1 ≈ 3500 ≈ 4400
1x A40:1 ≈ 1900 ≈ 2300
meta-llama/Llama-3.1-70B-Instruct 1x A40:4 ≈ 600 ≈ 800
neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 1x A100:2 ≈ 800 ≈ 1000
1x A40:2 ≈ 400 ≈ 500
meta-llama/Llama-3.1-405B-Instruct-FP8 3x A100fat:4 ≈ 800 ≈ 900
6x A100:4 ≈ 700 ≈ 800
neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 1x A100fat:4 ≈ 400 ≈ 500
2x A100:4 ≈ 500 ≈ 600
2x A40:4 ≈ 300 ≈ 300

The above setups are verified on the Alvis with the below toy example, the node setups required for the same LLM still depend on how you arrange the memory placement with vLLM, see the vLLM documentation for further discussion.

Example scripts

The vLLM server is intended for asynchronous serving of LLM models, for best performance, you should start vLLM as a server and let the client query the generation outputs asynchronously. Below is an example of how this can be prepared as a Slurm job script: the clients wait for the server to launch, and the server is terminated after the inference finishes.

Example python client

This is an example script that use the vLLM server to generate text asynchronously. Note that the script needs to get the port number of vLLM and the LLM model used through environment variables VLLM_PORT and HF_MODEL:

#!/usr/bin/env python3

"""
async_text_gen.py: example python script for text generation with vLLM
"""

import aiohttp
import asyncio
import os
from time import time

vllm_port = os.environ["VLLM_PORT"]

async def post_request(session, url, data):
    try:
        async with session.post(url, json=data) as response:
            if response.status == 200:
                response_data = await response.json()
                choices = response_data.get('choices', [{}])
                text = choices[0].get('text', 'no text')
                return data['prompt'], f"[text] {text}"
            else:
                return data['prompt'], f"[error] {response.status}"
    except asyncio.TimeoutError:
        return data['prompt'], "[error] timeout"


async def main():
    url = f"http://localhost:{vllm_port}/v1/completions"

    # Create a TCPConnector with a limited number of connections
    conn = aiohttp.TCPConnector(limit=512)

    async with aiohttp.ClientSession(connector=conn) as session:
        tasks = [
            post_request(session, url, {
                "model": os.environ['HF_MODEL'],
                "prompt": f"Hint is {idx}, make up some random things."
            })
            for idx in range(10000)
        ]
        responses = await asyncio.gather(*tasks)

    for prompt, response in responses:
        print(f"prompt: {prompt}; response: {response}")


t0 = time()
asyncio.run(main())
print(f"Inference taks {time() - t0} seconds")

Example single-node server

#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --nodes 1
#SBATCH --gpus-per-node "A40:4"

export HF_MODEL=meta-llama/Llama-3.1-70B-Instruct
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.6.3

# start vllm server
export VLLM_OPT="--tensor-parallel-size=4 --max-model-len=10000"
export VLLM_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

echo "Starting server node"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
   --port ${VLLM_PORT} ${VLLM_OPT} \
   > vllm.out 2> vllm.err &
VLLM_PID=$!
sleep 20

# wait at most 5 min for the model to start, otherwise abort
if timeout 300 bash -c "tail -f vllm.err | grep -q 'Uvicorn running on socket'"; then
  echo "Starting client"
  apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
    > client.out 2> client.err
else
  echo "vLLM doesn't seem to start, aborting"
fi

echo "Terminating VLLM" && kill -15 ${VLLM_PID}

Example multi-node server

#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --ntasks-per-node=1 --cpus-per-task=64 --nodes 3
#SBATCH --gpus-per-node "A40:4"

export HEAD_HOSTNAME="$(hostname)"
export HF_MODEL=neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.6.3

# start ray cluster
export RAY_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
export RAY_CMD_HEAD="ray start --block --head --port=${RAY_PORT}"
export RAY_CMD_WORKER="ray start --block --address=${HEAD_HOSTNAME}:${RAY_PORT}"

srun -J "head ray node-step-%J" \
  -N 1 --tasks-per-node=1 -w ${HEAD_HOSTNAME} \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!
sleep 10

srun -J "worker ray node-step-%J" \
  -N $(( SLURM_NNODES-1 )) --tasks-per-node=1 -x ${HEAD_HOSTNAME} \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
RAY_WORKER_PID=$!
sleep 10

# start vllm
export VLLM_OPT="--tensor-parallel-size=4 --pipeline-parallel-size=${SLURM_NNODES} --max-model-len=10000"
export VLLM_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

echo "Starting server"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
  --port ${VLLM_PORT} ${VLLM_OPT} \
  > vllm.out  2> vllm.err &
VLLM_PID=$!
sleep 20

# wait at most 5 min for the model to start, otherwise abort
if timeout 300 bash -c "tail -f vllm.err | grep -q 'Uvicorn running on socket'"; then
  echo "Starting client"
  apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
    > client.out 2> client.err
else
  echo "vLLM doesn't seem to start, aborting"
fi

echo "Terminating VLLM" && kill -15 ${VLLM_PID}
echo "Terminating Ray workers" && kill -15 ${RAY_WORKER_PID}
echo "Terminating Ray head" && kill -15 ${RAY_HEAD_PID}

Useful external references