Columnar databases store data by column rather than by row. This seemingly simple difference has profound implications for analytical query performance, compression, and storage efficiency. This article explains how columnar databases work, when to use them, and how they compare to traditional row-oriented databases.
Row-Oriented vs Columnar Storage
Row-Oriented Storage
Row-oriented databases (PostgreSQL, MySQL, SQL Server) store all columns of a row together on disk.
Row 1: [id=1, name=Alice, age=30, city=NYC]
Row 2: [id=2, name=Bob, age=25, city=SFO]
Row 3: [id=3, name=Charlie, age=35, city=NYC]
**Best for**: Transactions (OLTP). When you need to read or write entire rows (SELECT * WHERE id = N), row-oriented is optimal.
Columnar Storage
Columnar databases store each column's values together on disk.
Column id: [1, 2, 3]
Column name: [Alice, Bob, Charlie]
Column age: [30, 25, 35]
Column city: [NYC, SFO, NYC]
**Best for**: Analytics (OLAP). When you need to scan many rows but only a few columns (SELECT SUM(amount) WHERE date > ...), columnar is dramatically faster.
Compression Techniques
Columnar storage enables compression techniques that are not effective on row-oriented data, because values in a single column tend to be similar.
Run-Length Encoding (RLE)
RLE replaces consecutive identical values with a count. It is highly effective on sorted columns with low cardinality.
City column before RLE: [NYC, NYC, NYC, SFO, SFO, NYC, NYC, NYC, NYC]
City column after RLE: [NYC:3, SFO:2, NYC:4]
If the city column is sorted, compression improves further:
Sorted city column: [NYC, NYC, NYC, NYC, NYC, NYC, NYC, SFO, SFO]
RLE after sorting: [NYC:7, SFO:2]
Dictionary Encoding
Dictionary encoding replaces repeated values with smaller integer codes.
Dictionary:
0: NYC
1: SFO
2: CHI
Column stored as integers: [0, 0, 1, 0, 2, 2, 1]
Integer array can be further compressed with RLE or bit-packing.
Delta Encoding
Delta encoding stores the difference between consecutive values rather than the values themselves. It is effective for monotonically increasing values like timestamps or sequential IDs.
Original: [1000, 1005, 1010, 1020, 1030]
Deltas: [1000, 5, 5, 10, 10]
Bitmap Indexing
For low-cardinality columns, columnar databases create bitmap indexes. Each distinct value has a bitmap where each bit represents whether a row has that value.
City: [NYC, SFO, NYC, CHI, SFO, NYC]
NYC: [1, 0, 1, 0, 0, 1]
SFO: [0, 1, 0, 0, 1, 0]
CHI: [0, 0, 0, 1, 0, 0]
Bitmap operations (AND, OR, NOT) are extremely fast using CPU bitwise instructions.
Compression Ratios
Columnar databases typically achieve 5-10x compression compared to row-oriented storage on analytical workloads. This means more data fits in memory, further improving performance.
Query Performance
Columnar databases excel at queries that scan many rows but access few columns.
The Advantage
-- Analytical query: Poor performance on row-oriented DB
-- Needs to scan all rows but only reads one column
SELECT region, SUM(revenue)
FROM sales
WHERE year = 2026
GROUP BY region;
In a row-oriented database, this query reads all columns for every matching row, wasting I/O and memory on unused data.
In a columnar database:
2. Only the `region` and `revenue` columns are read for matching rows.
3. Columnar compression reduces the data volume.
4. The database can use vectorized processing (SIMD instructions) on compressed column data.
Vectorized Query Execution
Columnar databases use vectorized execution: instead of processing one row at a time, they process batches of rows (typically 1024 or 4096) using CPU SIMD instructions.
# Pseudocode: Vectorized processing vs row-by-row
# Row-by-row (slow)
total = 0
for row in rows:
if row['region'] == target_region:
total += row['revenue']
# Vectorized (fast)
region_column = load_column('region')
revenue_column = load_column('revenue')
mask = region_column == target_region # SIMD comparison
filtered_revenue = revenue_column[mask] # SIMD gather
total = sum(filtered_revenue) # SIMD reduction
Columnar Database Platforms
ClickHouse
ClickHouse is an open-source columnar database designed for real-time analytics. It is known for exceptional query speed on large datasets.
-- ClickHouse: Create a MergeTree table
CREATE TABLE sales (
order_id UInt64,
order_date Date,
customer_id UInt32,
amount Float64,
region String
) ENGINE = MergeTree()
ORDER BY (order_date, region)
PARTITION BY toYYYYMM(order_date);
-- ClickHouse: Aggregation query
SELECT
region,
toMonth(order_date) as month,
count() as order_count,
sum(amount) as total_revenue
FROM sales
WHERE order_date >= '2026-01-01'
GROUP BY region, month
ORDER BY total_revenue DESC;
**Best for**: Real-time analytics, time-series data, event logging, monitoring dashboards.
Amazon Redshift
Redshift uses columnar storage with automatic compression encoding selection.
-- Redshift: Create a columnar table with sort key
CREATE TABLE sales (
order_id BIGINT ENCODE ZSTD,
order_date DATE ENCODE DELTA,
region VARCHAR(50) ENCODE BYTEDICT,
amount DECIMAL(10,2) ENCODE ZSTD
)
DISTKEY(region)
SORTKEY(order_date, region);
Redshift automatically selects the best compression encoding based on sample data.
Google BigQuery
BigQuery uses the Capacitor columnar storage format, which is proprietary and automatically optimized.
-- BigQuery: Partitioned and clustered table
CREATE TABLE mydataset.sales
PARTITION BY DATE(order_date)
CLUSTER BY region, customer_id
AS SELECT * FROM source_table;
BigQuery charges by bytes processed, not storage. Columnar storage and clustering minimize bytes scanned.
Columnar vs Row-Oriented Comparison
| Aspect | Row-Oriented | Columnar |
|--------|--------------|----------|
| Read pattern | Full rows | Few columns, many rows |
| Write pattern | Single rows | Batches preferred |
| Compression | 1-2x | 5-10x |
| Range scan | Efficient | Very efficient |
| Point query | Very efficient | Less efficient |
| Inserts/updates | Efficient | Expensive |
| Concurrency | High | Lower |
| Use case | OLTP | OLAP / Analytics |
When to Use Columnar
When NOT to Use Columnar
Hybrid Approaches
Some databases use hybrid storage to handle both OLTP and OLAP workloads.
PostgreSQL + Columnar Extensions
-- PostgreSQL with pg_analytics extension for columnar execution
CREATE EXTENSION pg_analytics;
-- Set a table to use columnar storage
ALTER TABLE large_analytics_table SET ACCESS METHOD columnar;
HTAP (Hybrid Transactional/Analytical Processing)
HTAP databases handle both workloads in a single system:
Conclusion
Columnar databases are essential for modern analytical workloads. Their storage format enables aggressive compression (5-10x), vectorized query execution, and dramatically faster queries when scanning many rows but few columns. Choose ClickHouse for real-time analytics, Redshift for cloud data warehousing, BigQuery for serverless analytics, and columnar extensions or HTAP databases when you need hybrid capabilities. Use row-oriented databases for transactional workloads and switch to columnar for analytics.