Introduction
Running large language models locally has become practical thanks to quantization techniques, efficient inference engines, and a thriving open-source ecosystem. Whether for privacy, cost savings, or offline availability, local LLMs offer a compelling alternative to cloud APIs for many workloads. This guide covers the two most popular local LLM platforms — Ollama and LM Studio.
Ollama
[Ollama](https://ollama.ai) is the most popular tool for running LLMs locally, known for its simplicity and command-line focus.
Installation
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from https://ollama.ai/download
Getting Started
Ollama makes running a model a single command:
# Pull and run a model
ollama run llama3.2:3b
# List available models
ollama list
# Pull a specific model without running
ollama pull mistral:7b
Popular Models for Ollama
| Model | Size | RAM Required | Best For |
|-------|------|-------------|----------|
| Llama 3.2 3B | 2.0 GB | 4 GB | Fast responses, simple tasks |
| Llama 3.1 8B | 4.7 GB | 8 GB | General purpose Q&A |
| Mistral 7B | 4.1 GB | 8 GB | Code, reasoning, instruction following |
| Qwen2.5 7B | 4.8 GB | 8 GB | Strong multilingual, coding |
| Mixtral 8x7B | 26 GB | 48 GB | High quality, close to GPT-3.5 |
| DeepSeek-R1 7B | 4.5 GB | 8 GB | Strong reasoning, step-by-step |
Using Ollama Programmatically
Ollama provides a REST API at `http://localhost:11434`:
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.2:3b",
"prompt": "Explain quantum computing in three sentences.",
"stream": False
})
print(response.json()["response"])
Or use the official Python library:
import ollama
response = ollama.chat(model="llama3.2:3b", messages=[
{"role": "user", "content": "What is the capital of France?"}
])
print(response["message"]["content"])
Custom Modelfiles
Create custom models with system prompts and parameters:
FROM llama3.1:8b
# Set system prompt
SYSTEM "You are a helpful coding assistant. Provide concise code examples."
# Configure parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
Build and run:
ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant
LM Studio
[LM Studio](https://lmstudio.ai) is a GUI-focused alternative that excels for users who prefer visual interfaces and easy model browsing.
Key Features
Setup
2. Open the app and browse the model catalog
3. Download a model (start with Llama 3.2 3B or Mistral 7B)
4. Load the model and start chatting
API Server
LM Studio can serve models via an OpenAI-compatible API:
http://localhost:1234/v1/chat/completions
This means any tool that works with OpenAI's API can use your local model by changing the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}]
)
Performance Optimization
Quantization Levels
Quantization reduces model size at the cost of some accuracy:
Rule of thumb: each quantization step roughly doubles model size but improves quality marginally. Q4_K_M is the sweet spot.
GPU Acceleration
Both Ollama and LM Studio support GPU acceleration via CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan (AMD):
# Ollama uses Metal automatically on Apple Silicon
# For NVIDIA, install CUDA and Ollama detects it
# Check which device is being used
ollama run llama3.2:3b --verbose
Context Window
Larger context windows consume more memory. A 128K context with Q4_K_M requires approximately:
Use Cases for Local LLMs
Conclusion
Running LLMs locally is easier than ever with Ollama and LM Studio. Ollama offers command-line simplicity and a rich set of pre-built models. LM Studio provides a polished GUI and OpenAI-compatible API. Start with a 7B model at Q4_K_M quantization on a machine with 8-16 GB of RAM, and scale up as your needs grow. Local LLMs won't replace cloud APIs for every use case, but they are an essential tool in the AI practitioner's toolkit.