LLM inference with vLLM¶

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM provides efficient implementations of LLMs for inference, which claims to deliver "up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes."

Performance considerations¶

To make sure you make good use of the resources, it is important that you check the capacity of different hardware we have on Alvis (GPU hardware details). We tested some popular LLM models and listed their performance for inferences below.

Model (HuggingFace)	Node Setup	Output Throughput (token/sec.)	Total Token Throughput (token/sec.)
meta-llama/Llama-3.1-8B-Instruct	1x ZEN4:64	≈ 33	≈ 305
	1x A100:1	≈ 1066	≈ 9960
	1x A40:1	≈ 461	≈ 4300
neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16	1x A100:1	≈ 869	≈ 8019
	1x A40:1	≈ 370	≈ 3416
meta-llama/Llama-3.1-70B-Instruct	1x ZEN4:64	≈ 2	≈ 23
	1x A40:4	≈ 126	≈ 1465
neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16	1x A100:2	≈ 133	≈ 1707
	1x A40:2	≈ 59	≈ 761
meta-llama/Llama-3.1-405B-Instruct-FP8	3x A100fat:4	≈ 75	≈ 2025
	6x A100:4	≈ 75	≈ 2864
neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16	1x A100fat:4	≈ 76	≈ 818
	2x A100:4	≈ 122	≈ 1299
	2x A40:4	≈ 43	≈ 464
unsloth--Llama-3.3-70B-Instruct	1x ZEN4:64	≈ 3	≈ 29

The above setups are verified on the Alvis with the below toy example, the node setups required for the same LLM still depend on how you arrange the memory placement with vLLM, see the [vLLM documentation][vLLM performance] for further discussion.

Example scripts¶

The vLLM server is intended for asynchronous serving of LLM models, for best performance, you should start vLLM as a server and let the client query the generation outputs asynchronously. Below is an example of how this can be prepared as a Slurm job script: the clients wait for the server to launch, and the server is terminated after the inference finishes.

Example python client¶

This is an example script that use the vLLM server to generate text asynchronously. Note that the script needs to get the port number of vLLM and the LLM model used through environment variables API_PORT and HF_MODEL:

async_text_gen.py async_viz_chat.py

#!/usr/bin/env python3

"""
async_text_gen.py: example python script for text generation with vLLM
"""

import aiohttp
import asyncio
import os
from time import time

vllm_port = os.environ["API_PORT"]

async def post_request(session, url, data):
    try:
        async with session.post(url, json=data) as response:
            if response.status == 200:
                response_data = await response.json()
                choices = response_data.get('choices', [{}])
                text = choices[0].get('text', 'no text')
                return data['prompt'], f"[text] {text}"
            else:
                return data['prompt'], f"[error] {response.status}"
    except asyncio.TimeoutError:
        return data['prompt'], "[error] timeout"


async def main():
    url = f"http://localhost:{vllm_port}/v1/completions"

    # Create a TCPConnector with a limited number of connections
    conn = aiohttp.TCPConnector(limit=512)

    async with aiohttp.ClientSession(connector=conn) as session:
        tasks = [
            post_request(session, url, {
                "model": os.environ['HF_MODEL'],
                "prompt": f"Hint is {idx}, make up some random things."
            })
            for idx in range(10000)
        ]
        responses = await asyncio.gather(*tasks)

    for prompt, response in responses:
        print(f"prompt: {prompt}; response: {response}")


t0 = time()
asyncio.run(main())
print(f"Inference taks {time() - t0} seconds")

#!/usr/bin/env python3

"""
async_viz_chat.py: example python script for chatting with vision models with vLLM
"""

import aiohttp
import asyncio
import os
from time import time

vllm_port = os.environ["API_PORT"]

async def post_request(session, url, data):
    try:
        async with session.post(url, json=data) as response:
            if response.status == 200:
                response_data = await response.json()
                choices = response_data.get('choices', [{}])
                text = choices[0].get('message').get('content')
                return data['messages'][0]['content'], f"[text] {text}"
            else:
                return data['messages'][0]['content'], f"[error] {response.status}"
    except asyncio.TimeoutError:
        return data['messages'][0]['content'], "[error] timeout"


async def main():
    url = f"http://localhost:{vllm_port}/v1/chat/completions"

    # Create a TCPConnector with a limited number of connections
    conn = aiohttp.TCPConnector(limit=512)

    async with aiohttp.ClientSession(connector=conn) as session:
        tasks = [
            post_request(session, url, {
                "model": os.environ['HF_MODEL'],
                "messages": [{
                    "role": "user",
                    "content": [
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://picsum.photos/200/300", # random images
                            }
                        },
                        {
                            "type": "text",
                            "text": "What is the image?"
                        },
                    ],
                }],
            })
            for idx in range(10)
        ]
        responses = await asyncio.gather(*tasks)

    for prompt, response in responses:
        print(f"prompt: {prompt}; response: {response}")


t0 = time()
asyncio.run(main())
print(f"Inference taks {time() - t0} seconds")

Example single-node server¶

#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --nodes 1
#SBATCH --gpus-per-node "A40:4"

module purge

export HF_MODEL=meta-llama/Llama-3.1-70B-Instruct
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.7.3

# start vllm server
vllm_opts="--tensor-parallel-size=${SLURM_GPUS_ON_NODE} --max-model-len=10000"
# options for vision models
vllm_vision_opts="--max-num-seqs=16 \
                  --enforce-eager \
                  --limit-mm-per-prompt=image=2,video=1 \
                  --allowed-local-media-path=${HOME}"
vllm_vision_opts="" # comment out this line to use vision model
export API_PORT=$(find_ports)

echo "Starting server node"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
   --port ${API_PORT} ${vllm_opts} ${vllm_vision_opts} \
   > vllm.out 2> vllm.err &
VLLM_PID=$!
sleep 20

# wait at most 10 min for the model to start, otherwise abort
if timeout 600 bash -c "tail -f vllm.err | grep -q 'Application startup complete'"; then
  echo "Starting client"
  apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
    > client.out 2> client.err
else
  echo "vLLM doesn't seem to start, aborting"
fi

echo "Terminating VLLM" && kill -15 ${VLLM_PID}

Example multi-node server¶

#!/bin/bash -l
#SBATCH -t 1:00:00
#SBATCH --ntasks-per-node=1 --cpus-per-task=64 --nodes 3
#SBATCH --gpus-per-node "A40:4"

export HEAD_HOSTNAME="$(hostname)"
export HF_MODEL=neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
export HF_HOME=/path/to/your/model/cache
export SIF_IMAGE=/path/to/vllm.sif
# e.g. apptainer build vllm.sif docker://vllm/vllm-openai:v0.6.3

# start ray cluster
export RAY_PORT=$(find_ports)
export RAY_CMD_HEAD="ray start --block --head --port=${RAY_PORT}"
export RAY_CMD_WORKER="ray start --block --address=${HEAD_HOSTNAME}:${RAY_PORT}"

srun -J "head ray node-step-%J" \
  -N 1 --tasks-per-node=1 -w ${HEAD_HOSTNAME} \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!
sleep 10

srun -J "worker ray node-step-%J" \
  -N $(( SLURM_NNODES-1 )) --tasks-per-node=1 -x ${HEAD_HOSTNAME} \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
RAY_WORKER_PID=$!
sleep 10

# start vllm
vllm_opts="--tensor-parallel-size=${SLURM_GPUS_ON_NODE} --pipeline-parallel-size=${SLURM_NNODES} --max-model-len=10000"
export API_PORT=$(find_ports)

echo "Starting server"
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
  --port ${API_PORT} ${vllm_opts} \
  > vllm.out  2> vllm.err &
VLLM_PID=$!
sleep 20

# wait at most 10 min for the model to start, otherwise abort
if timeout 600 bash -c "tail -f vllm.err | grep -q 'Uvicorn running on socket'"; then
  echo "Starting client"
  apptainer exec ${SIF_IMAGE} python3 async_text_gen.py \
    > client.out 2> client.err
else
  echo "vLLM doesn't seem to start, aborting"
fi

echo "Terminating VLLM" && kill -15 ${VLLM_PID}
echo "Terminating Ray workers" && kill -15 ${RAY_WORKER_PID}
echo "Terminating Ray head" && kill -15 ${RAY_HEAD_PID}