Model Deployment: vLLM, TGI, ONNX, Quantization, GPU Optimization

Introduction

Deploying large language models for production inference requires specialized infrastructure. Unlike traditional ML models, LLMs demand gigabytes of GPU memory, specialized attention kernels, and careful batching strategies to achieve acceptable throughput. This article covers the major deployment frameworks and optimization techniques.

vLLM

vLLM is the most popular open-source LLM serving framework, featuring PagedAttention for efficient memory management:

# Using vLLM's OpenAI-compatible API server

# Start server:

# python -m vllm.entrypoints.openai.api_server \

# --model meta-llama/Llama-3.1-8B-Instruct \

# --tensor-parallel-size 2 \

# --gpu-memory-utilization 0.95 \

# --max-model-len 8192 \

# --dtype bfloat16

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:8000/v1",

api_key="token-not-needed",

)

response = client.chat.completions.create(

model="meta-llama/Llama-3.1-8B-Instruct",

messages=[{"role": "user", "content": "What is vLLM?"}],

temperature=0.7,

max_tokens=1024,

stream=True,

)

for chunk in response:

if chunk.choices[0].delta.content:

print(chunk.choices[0].delta.content, end="")

vLLM's PagedAttention manages the KV cache in fixed-size blocks, eliminating fragmentation and enabling near-100% GPU memory utilization. It supports continuous batching, meaning new requests can start processing as soon as previous ones complete generation.

Performance Tuning

# Key vLLM performance flags

--max-num-seqs 256 # Max concurrent sequences

--max-num-batched-tokens 8192 # Tokens processed per batch

--enable-chunked-prefill # Longer prompts handled efficiently

--enforce-eager # Disable CUDA graphs (saves memory)

Hugging Face TGI

Text Generation Inference (TGI) is Hugging Face's optimized serving solution:

# docker-compose.yml for TGI

version: "3.8"

services:

tgi:

image: ghcr.io/huggingface/text-generation-inference:latest

environment:

- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3

- NUM_SHARD=2

- MAX_INPUT_TOKENS=4096

- MAX_TOTAL_TOKENS=8192

- HF_TOKEN=${HF_TOKEN}

ports:

- "8080:80"

volumes:

- ~/.cache/huggingface:/data

deploy:

resources:

reservations:

devices:

- driver: nvidia

count: 2

capabilities: [gpu]

import requests

response = requests.post(

"http://localhost:8080/generate",

json={

"inputs": "Explain quantization in ML:",

"parameters": {

"max_new_tokens": 256,

"temperature": 0.7,

"top_p": 0.95,

)

print(response.json()["generated_text"])

TGI provides native support for tensor parallelism across GPUs, watermarking, and speculative decoding for faster generation.

ONNX Runtime

ONNX Runtime enables deployment across GPU and CPU with hardware-specific optimizations:

import onnxruntime as ort

from transformers import AutoTokenizer, AutoConfig

import numpy as np

# Load ONNX-optimized model

session = ort.InferenceSession(

"model_optimized.onnx",

providers=["CUDAExecutionProvider", "CPUExecutionProvider"],

)

tokenizer = AutoTokenizer.from_pretrained("model-name")

# Prepare inputs

inputs = tokenizer("Explain model quantization.", return_tensors="np")

onnx_inputs = {

"input_ids": inputs["input_ids"],

"attention_mask": inputs["attention_mask"],

}

# Run inference

outputs = session.run(None, onnx_inputs)

ONNX models require an initial conversion step but benefit from aggressive graph optimizations and operator fusion.

Quantization

Quantization reduces model size and accelerates inference by using lower-precision numbers:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

import torch

# 4-bit quantization with bitsandbytes

quant_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_compute_dtype=torch.bfloat16,

bnb_4bit_use_double_quant=True,

bnb_4bit_quant_type="nf4",

)

model = AutoModelForCausalLM.from_pretrained(

"meta-llama/Llama-3.1-8B-Instruct",

quantization_config=quant_config,

device_map="auto",

)

|-----------|-----------|----------------|--------------|--------------|

| INT8 | 8-bit | 4x vs FP32 | 2-3x | Minimal |

| NF4 | 4-bit | 8x vs FP32 | 3-4x | Minor |

GPU Optimization

Beyond framework choice, several techniques maximize GPU utilization:

# Flash Attention 2: memory-efficient attention

model = AutoModelForCausalLM.from_pretrained(

"model-name",

attn_implementation="flash_attention_2",

torch_dtype=torch.bfloat16,

)

# Continuous batching: process multiple requests concurrently

# (built into vLLM and TGI)

# Prefix caching: reuse KV cache for shared prompt prefixes

# (vLLM: --enable-prefix-caching)

Conclusion

Deploying LLMs requires selecting the right serving framework and optimization level. vLLM offers the best memory efficiency with PagedAttention and continuous batching. TGI excels with Hugging Face ecosystem integration. ONNX provides cross-platform deployment. Quantization with 4-bit or 8-bit formats reduces memory requirements by 4-8x with minimal quality loss. Match your deployment stack to your latency, throughput, and budget requirements.

Model Deployment: vLLM, TGI, ONNX, Quantization, GPU Optimization

Model Deployment: vLLM, TGI, ONNX, Quantization, GPU Optimization

Introduction

vLLM

Performance Tuning

Hugging Face TGI

ONNX Runtime

Quantization

GPU Optimization

Conclusion

Related Articles