Model Deployment: vLLM, TGI, ONNX, Quantization, GPU Optimization


Introduction





Deploying large language models for production inference requires specialized infrastructure. Unlike traditional ML models, LLMs demand gigabytes of GPU memory, specialized attention kernels, and careful batching strategies to achieve acceptable throughput. This article covers the major deployment frameworks and optimization techniques.





vLLM





vLLM is the most popular open-source LLM serving framework, featuring PagedAttention for efficient memory management:






# Using vLLM's OpenAI-compatible API server


# Start server:


# python -m vllm.entrypoints.openai.api_server \


# --model meta-llama/Llama-3.1-8B-Instruct \


# --tensor-parallel-size 2 \


# --gpu-memory-utilization 0.95 \


# --max-model-len 8192 \


# --dtype bfloat16




from openai import OpenAI




client = OpenAI(


base_url="http://localhost:8000/v1",


api_key="token-not-needed",


)




response = client.chat.completions.create(


model="meta-llama/Llama-3.1-8B-Instruct",


messages=[{"role": "user", "content": "What is vLLM?"}],


temperature=0.7,


max_tokens=1024,


stream=True,


)




for chunk in response:


if chunk.choices[0].delta.content:


print(chunk.choices[0].delta.content, end="")







vLLM's PagedAttention manages the KV cache in fixed-size blocks, eliminating fragmentation and enabling near-100% GPU memory utilization. It supports continuous batching, meaning new requests can start processing as soon as previous ones complete generation.





Performance Tuning






# Key vLLM performance flags


--max-num-seqs 256 # Max concurrent sequences


--max-num-batched-tokens 8192 # Tokens processed per batch


--enable-chunked-prefill # Longer prompts handled efficiently


--enforce-eager # Disable CUDA graphs (saves memory)







Hugging Face TGI





Text Generation Inference (TGI) is Hugging Face's optimized serving solution:






# docker-compose.yml for TGI


version: "3.8"


services:


tgi:


image: ghcr.io/huggingface/text-generation-inference:latest


environment:


- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3


- NUM_SHARD=2


- MAX_INPUT_TOKENS=4096


- MAX_TOTAL_TOKENS=8192


- HF_TOKEN=${HF_TOKEN}


ports:


- "8080:80"


volumes:


- ~/.cache/huggingface:/data


deploy:


resources:


reservations:


devices:


- driver: nvidia


count: 2


capabilities: [gpu]








import requests




response = requests.post(


"http://localhost:8080/generate",


json={


"inputs": "Explain quantization in ML:",


"parameters": {


"max_new_tokens": 256,


"temperature": 0.7,


"top_p": 0.95,


},


},


)


print(response.json()["generated_text"])







TGI provides native support for tensor parallelism across GPUs, watermarking, and speculative decoding for faster generation.





ONNX Runtime





ONNX Runtime enables deployment across GPU and CPU with hardware-specific optimizations:






import onnxruntime as ort


from transformers import AutoTokenizer, AutoConfig


import numpy as np




# Load ONNX-optimized model


session = ort.InferenceSession(


"model_optimized.onnx",


providers=["CUDAExecutionProvider", "CPUExecutionProvider"],


)




tokenizer = AutoTokenizer.from_pretrained("model-name")




# Prepare inputs


inputs = tokenizer("Explain model quantization.", return_tensors="np")


onnx_inputs = {


"input_ids": inputs["input_ids"],


"attention_mask": inputs["attention_mask"],


}




# Run inference


outputs = session.run(None, onnx_inputs)







ONNX models require an initial conversion step but benefit from aggressive graph optimizations and operator fusion.





Quantization





Quantization reduces model size and accelerates inference by using lower-precision numbers:






from transformers import AutoModelForCausalLM, BitsAndBytesConfig


import torch




# 4-bit quantization with bitsandbytes


quant_config = BitsAndBytesConfig(


load_in_4bit=True,


bnb_4bit_compute_dtype=torch.bfloat16,


bnb_4bit_use_double_quant=True,


bnb_4bit_quant_type="nf4",


)




model = AutoModelForCausalLM.from_pretrained(


"meta-llama/Llama-3.1-8B-Instruct",


quantization_config=quant_config,


device_map="auto",


)







| Technique | Bit Width | Size Reduction | Speed Impact | Quality Loss |


|-----------|-----------|----------------|--------------|--------------|


| FP16/BF16 | 16-bit | 2x vs FP32 | 1.5-2x | None |


| INT8 | 8-bit | 4x vs FP32 | 2-3x | Minimal |


| INT4 (GPTQ) | 4-bit | 8x vs FP32 | 3-4x | Minor |


| INT4 (AWQ) | 4-bit | 8x vs FP32 | 3-4x | Minor |


| NF4 | 4-bit | 8x vs FP32 | 3-4x | Minor |





GPU Optimization





Beyond framework choice, several techniques maximize GPU utilization:






# Flash Attention 2: memory-efficient attention


model = AutoModelForCausalLM.from_pretrained(


"model-name",


attn_implementation="flash_attention_2",


torch_dtype=torch.bfloat16,


)




# Continuous batching: process multiple requests concurrently


# (built into vLLM and TGI)




# Prefix caching: reuse KV cache for shared prompt prefixes


# (vLLM: --enable-prefix-caching)







Conclusion





Deploying LLMs requires selecting the right serving framework and optimization level. vLLM offers the best memory efficiency with PagedAttention and continuous batching. TGI excels with Hugging Face ecosystem integration. ONNX provides cross-platform deployment. Quantization with 4-bit or 8-bit formats reduces memory requirements by 4-8x with minimal quality loss. Match your deployment stack to your latency, throughput, and budget requirements.