What Is Normalization?
Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables and defining relationships between them. The goal is to ensure that each piece of data is stored in exactly one place.
The Problem Normalization Solves
Consider an unnormalized table:
| OrderID | Customer | Address | Product | Price |
|---------|----------|---------|---------|-------|
| 1 | Alice | 123 Main St | Laptop | 1200 |
| 2 | Alice | 123 Main St | Mouse | 25 |
| 3 | Bob | 456 Oak Ave | Laptop | 1200 |
Problems:
Normal Forms
First Normal Form (1NF)
Each column contains atomic (indivisible) values, and each row is unique.
| Violation | Corrected |
|-----------|-----------|
| Products: "Laptop, Mouse, Keyboard" | One product per row |
| Phone: "555-0100, 555-0200" | Separate rows for each phone |
-- Violates 1NF
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
product_names TEXT -- "Laptop,Mouse,Keyboard"
);
-- 1NF compliant
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_name TEXT,
PRIMARY KEY (order_id, product_name)
);
Second Normal Form (2NF)
Must be in 1NF, and all non-key columns must depend on the entire primary key (not just part of it).
-- Violates 2NF (composite PK: order_id, product_id)
CREATE TABLE order_details (
order_id INTEGER,
product_id INTEGER,
product_name TEXT, -- Depends only on product_id, not on order_id
product_price DECIMAL, -- Depends only on product_id, not on order_id
quantity INTEGER, -- Depends on full PK
PRIMARY KEY (order_id, product_id)
);
-- 2NF compliant
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price DECIMAL NOT NULL
);
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Third Normal Form (3NF)
Must be in 2NF, and no transitive dependencies (non-key columns must not depend on other non-key columns).
-- Violates 3NF
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER,
customer_name TEXT, -- Depends on customer_id, not on order_id
customer_email TEXT, -- Depends on customer_id, not on order_id
order_date DATE
);
-- 3NF compliant
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
order_date DATE NOT NULL
);
Boyce-Codd Normal Form (BCNF)
A stricter version of 3NF where every determinant must be a candidate key.
-- Violates BCNF
-- Each student has one major, each professor teaches one course,
-- but a professor can teach different courses in different majors
CREATE TABLE enrollment (
student_id INTEGER,
major TEXT,
professor TEXT,
course TEXT,
PRIMARY KEY (student_id, course)
-- professor -> course (professor determines course, but is not a key)
);
-- BCNF compliant: split into multiple tables
CREATE TABLE student_major (student_id INTEGER, major TEXT);
CREATE TABLE professor_course (professor TEXT, course TEXT);
CREATE TABLE enrollment (student_id INTEGER, professor TEXT, PRIMARY KEY (student_id, professor));
Normal Forms Comparison
| Normal Form | Rule | Violation Example |
|-------------|------|-------------------|
| 1NF | Atomic values | Multiple values in one column |
| 2NF | Full PK dependency | Column depends on part of composite PK |
| 3NF | No transitive dependency | Column depends on non-key column |
| BCNF | Every determinant is a key | Column determines another non-key column |
| 4NF | No multi-valued dependencies | Independent 1:N relationships in same table |
| 5NF | Join dependencies | Cannot reconstruct from smaller tables |
Denormalization: When to Break the Rules
Normalization is not always the best choice. Denormalization trades storage and update complexity for read performance.
When to Denormalize
2. **High-read, low-write workloads**: Blog posts, product catalogs.
3. **Distributed databases**: Avoiding JOINs across shards.
4. **Caching computed values**: Order totals, comment counts.
-- Denormalized for read performance
CREATE TABLE products_with_category (
id SERIAL PRIMARY KEY,
name TEXT,
price DECIMAL,
category_name TEXT, -- Denormalized from categories table
category_slug TEXT, -- Denormalized from categories table
review_count INTEGER, -- Cached aggregate
avg_rating DECIMAL -- Cached aggregate
);
Normalization in Practice
Most production databases aim for 3NF in transactional systems and selectively denormalize for performance. The common approach:
2. Profile performance with real queries.
3. Denormalize specific tables or add computed columns when queries are too slow.
4. Document all denormalization decisions and their rationale.
Summary
Normalization eliminates data redundancy and protects data integrity by enforcing single-truth storage. Aim for 3NF in transactional databases, ensure every non-key column depends on the key, the whole key, and nothing but the key. Denormalize selectively for read-heavy workloads, reporting, and distributed architectures, always documenting the trade-off being made.