LLM Fine-Tuning Strategies and Techniques

Fine-tuning adapts a pre-trained language model to specific tasks or domains. Different fine-tuning approaches offer trade-offs between customization, cost, and performance.


Full Fine-Tuning


Full fine-tuning updates all model parameters on domain-specific data. This achieves the highest task performance but requires significant computational resources. Full fine-tuning of a 7B parameter model requires 4-8 GPUs with 80GB memory each.


Full fine-tuning is appropriate for domain adaptation (legal, medical, code) where broad knowledge transfer is needed. Training data should be 10,000-100,000 high-quality examples. The resulting model weights are 2x the original size (for AdamW optimizer states during training).


LoRA


Low-Rank Adaptation (LoRA) freezes the original model weights and inserts trainable rank decomposition matrices. This reduces trainable parameters by 10,000x and memory requirements by 4x. LoRA adapters are small (10-100MB) and swappable at runtime.


Key hyperparameters: rank (r=8-64 for most tasks, higher for complex adaptation), alpha (scaling factor, typically 2x the rank), target modules (attention projections for most tasks, MLP layers for deeper adaptation). Train multiple LoRA adapters for different tasks from the same base model.


QLoRA


QLoRA combines 4-bit quantization with LoRA. It quantizes the base model to 4 bits (NF4 format) and trains LoRA adapters at full precision. This enables fine-tuning 65B models on a single 48GB GPU. QLoRA achieves performance within 1% of full fine-tuning on most benchmarks.


Double quantization reduces memory further by quantizing the quantization constants. Paged optimizers use CPU memory for optimizer states during memory spikes. QLoRA makes fine-tuning accessible without expensive GPU clusters.


RLHF


Reinforcement Learning from Human Feedback aligns models with human preferences. The three-stage process: supervised fine-tuning on demonstrations, reward model training on human comparisons, and PPO training using the reward model.


RLHF improves helpfulness, reduces harmful outputs, and follows instructions more accurately. The quality of preference data matters more than quantity. DPO (Direct Preference Optimization) simplifies RLHF by treating alignment as a classification problem.


Data Preparation


High-quality training data is the most important factor. Use 1000+ examples for noticeable improvement. Deduplicate, filter low-quality examples, and balance label distribution. Include adversarial examples for robustness. Test on held-out validation sets.