Running LLMs Locally

Introduction

Running large language models locally has become practical thanks to quantization techniques, efficient inference engines, and a thriving open-source ecosystem. Whether for privacy, cost savings, or offline availability, local LLMs offer a compelling alternative to cloud APIs for many workloads. This guide covers the two most popular local LLM platforms — Ollama and LM Studio.

Ollama

[Ollama](https://ollama.ai) is the most popular tool for running LLMs locally, known for its simplicity and command-line focus.

Installation

# macOS

brew install ollama

# Linux

curl -fsSL https://ollama.ai/install.sh | sh

# Windows

# Download from https://ollama.ai/download

Getting Started

Ollama makes running a model a single command:

# Pull and run a model

ollama run llama3.2:3b

# List available models

ollama list

# Pull a specific model without running

ollama pull mistral:7b

Popular Models for Ollama

|-------|------|-------------|----------|

| Llama 3.2 3B | 2.0 GB | 4 GB | Fast responses, simple tasks |

| Llama 3.1 8B | 4.7 GB | 8 GB | General purpose Q&A; |

| Mistral 7B | 4.1 GB | 8 GB | Code, reasoning, instruction following |

| Qwen2.5 7B | 4.8 GB | 8 GB | Strong multilingual, coding |

| Mixtral 8x7B | 26 GB | 48 GB | High quality, close to GPT-3.5 |

| DeepSeek-R1 7B | 4.5 GB | 8 GB | Strong reasoning, step-by-step |

Using Ollama Programmatically

Ollama provides a REST API at `http://localhost:11434`:

import requests

response = requests.post("http://localhost:11434/api/generate", json={

"model": "llama3.2:3b",

"prompt": "Explain quantum computing in three sentences.",

"stream": False

})

print(response.json()["response"])

Or use the official Python library:

import ollama

response = ollama.chat(model="llama3.2:3b", messages=[

{"role": "user", "content": "What is the capital of France?"}

])

print(response["message"]["content"])

Custom Modelfiles

Create custom models with system prompts and parameters:

FROM llama3.1:8b

# Set system prompt

SYSTEM "You are a helpful coding assistant. Provide concise code examples."

# Configure parameters

PARAMETER temperature 0.3

PARAMETER top_p 0.9

Build and run:

ollama create my-coding-assistant -f Modelfile

ollama run my-coding-assistant

LM Studio

[LM Studio](https://lmstudio.ai) is a GUI-focused alternative that excels for users who prefer visual interfaces and easy model browsing.

Key Features

* **Built-in model browser**: Search and download models from Hugging Face

* **GUI chat interface**: Familiar ChatGPT-like experience

* **Local API server**: OpenAI-compatible API endpoint

* **Model configuration**: Easy sliders for context length, GPU offloading, and temperature

* **Multi-model support**: Load multiple models and switch between them

Setup

* Download from [lmstudio.ai](https://lmstudio.ai)

2\. Open the app and browse the model catalog

3\. Download a model (start with Llama 3.2 3B or Mistral 7B)

4\. Load the model and start chatting

API Server

LM Studio can serve models via an OpenAI-compatible API:

http://localhost:1234/v1/chat/completions

This means any tool that works with OpenAI's API can use your local model by changing the base URL:

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:1234/v1",

api_key="not-needed"

)

response = client.chat.completions.create(

model="local-model",

messages=[{"role": "user", "content": "Hello!"}]

)

Performance Optimization

Quantization Levels

Quantization reduces model size at the cost of some accuracy:

* **Q4_K_M**: Best balance of quality and size (4-bit, recommended)

* **Q5_K_M**: Higher quality, larger (5-bit, use if you have RAM headroom)

* **Q8_0**: Near-full quality, 2x RAM requirement

* **Q2_K**: Minimal RAM, noticeable quality loss

Rule of thumb: each quantization step roughly doubles model size but improves quality marginally. Q4_K_M is the sweet spot.

GPU Acceleration

Both Ollama and LM Studio support GPU acceleration via CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan (AMD):

# Ollama uses Metal automatically on Apple Silicon

# For NVIDIA, install CUDA and Ollama detects it

# Check which device is being used

ollama run llama3.2:3b --verbose

Context Window

Larger context windows consume more memory. A 128K context with Q4_K_M requires approximately:

* 7B model: ~8 GB total

* 13B model: ~14 GB total

* 70B model: ~48 GB total

Use Cases for Local LLMs

* **Privacy-sensitive data**: Medical records, legal documents, personal information

* **Offline environments**: Air-gapped systems, travel, remote locations

* **Cost-sensitive workloads**: High-volume batch processing

* **Experimentation**: Rapid testing of different models without API costs

* **Latency-critical applications**: No network calls for inference

Conclusion

Running LLMs locally is easier than ever with Ollama and LM Studio. Ollama offers command-line simplicity and a rich set of pre-built models. LM Studio provides a polished GUI and OpenAI-compatible API. Start with a 7B model at Q4_K_M quantization on a machine with 8-16 GB of RAM, and scale up as your needs grow. Local LLMs won't replace cloud APIs for every use case, but they are an essential tool in the AI practitioner's toolkit.

Running LLMs Locally

Running LLMs Locally

Introduction

Ollama

Installation

Getting Started

Popular Models for Ollama

Using Ollama Programmatically

Custom Modelfiles

LM Studio

Key Features

Setup

API Server

Performance Optimization

Quantization Levels

GPU Acceleration

Context Window

Use Cases for Local LLMs

Conclusion

Related Articles