Columnar databases store data by column rather than by row. This seemingly simple difference has profound implications for analytical query performance, compression, and storage efficiency. This article explains how columnar databases work, when to use them, and how they compare to traditional row-oriented databases.


Row-Oriented vs Columnar Storage


Row-Oriented Storage


Row-oriented databases (PostgreSQL, MySQL, SQL Server) store all columns of a row together on disk.



Row 1: [id=1, name=Alice, age=30, city=NYC]

Row 2: [id=2, name=Bob, age=25, city=SFO]

Row 3: [id=3, name=Charlie, age=35, city=NYC]


**Best for**: Transactions (OLTP). When you need to read or write entire rows (SELECT * WHERE id = N), row-oriented is optimal.


Columnar Storage


Columnar databases store each column's values together on disk.



Column id:     [1, 2, 3]

Column name:   [Alice, Bob, Charlie]

Column age:    [30, 25, 35]

Column city:   [NYC, SFO, NYC]


**Best for**: Analytics (OLAP). When you need to scan many rows but only a few columns (SELECT SUM(amount) WHERE date > ...), columnar is dramatically faster.


Compression Techniques


Columnar storage enables compression techniques that are not effective on row-oriented data, because values in a single column tend to be similar.


Run-Length Encoding (RLE)


RLE replaces consecutive identical values with a count. It is highly effective on sorted columns with low cardinality.



City column before RLE: [NYC, NYC, NYC, SFO, SFO, NYC, NYC, NYC, NYC]

City column after RLE:  [NYC:3, SFO:2, NYC:4]


If the city column is sorted, compression improves further:



Sorted city column:     [NYC, NYC, NYC, NYC, NYC, NYC, NYC, SFO, SFO]

RLE after sorting:      [NYC:7, SFO:2]


Dictionary Encoding


Dictionary encoding replaces repeated values with smaller integer codes.



Dictionary:

  0: NYC

  1: SFO

  2: CHI



Column stored as integers: [0, 0, 1, 0, 2, 2, 1]

Integer array can be further compressed with RLE or bit-packing.


Delta Encoding


Delta encoding stores the difference between consecutive values rather than the values themselves. It is effective for monotonically increasing values like timestamps or sequential IDs.



Original: [1000, 1005, 1010, 1020, 1030]

Deltas:   [1000, 5, 5, 10, 10]


Bitmap Indexing


For low-cardinality columns, columnar databases create bitmap indexes. Each distinct value has a bitmap where each bit represents whether a row has that value.



City:    [NYC, SFO, NYC, CHI, SFO, NYC]

NYC:     [1, 0, 1, 0, 0, 1]

SFO:     [0, 1, 0, 0, 1, 0]

CHI:     [0, 0, 0, 1, 0, 0]


Bitmap operations (AND, OR, NOT) are extremely fast using CPU bitwise instructions.


Compression Ratios


Columnar databases typically achieve 5-10x compression compared to row-oriented storage on analytical workloads. This means more data fits in memory, further improving performance.


Query Performance


Columnar databases excel at queries that scan many rows but access few columns.


The Advantage



-- Analytical query: Poor performance on row-oriented DB

-- Needs to scan all rows but only reads one column

SELECT region, SUM(revenue)

FROM sales

WHERE year = 2026

GROUP BY region;


In a row-oriented database, this query reads all columns for every matching row, wasting I/O and memory on unused data.


In a columnar database:

  • Only the `year` column is scanned to find matching rows.
  • 2. Only the `region` and `revenue` columns are read for matching rows.

    3. Columnar compression reduces the data volume.

    4. The database can use vectorized processing (SIMD instructions) on compressed column data.


    Vectorized Query Execution


    Columnar databases use vectorized execution: instead of processing one row at a time, they process batches of rows (typically 1024 or 4096) using CPU SIMD instructions.


    
    # Pseudocode: Vectorized processing vs row-by-row
    
    # Row-by-row (slow)
    
    total = 0
    
    for row in rows:
    
        if row['region'] == target_region:
    
            total += row['revenue']
    
    
    
    # Vectorized (fast)
    
    region_column = load_column('region')
    
    revenue_column = load_column('revenue')
    
    mask = region_column == target_region  # SIMD comparison
    
    filtered_revenue = revenue_column[mask]  # SIMD gather
    
    total = sum(filtered_revenue)  # SIMD reduction
    
    

    Columnar Database Platforms


    ClickHouse


    ClickHouse is an open-source columnar database designed for real-time analytics. It is known for exceptional query speed on large datasets.


    
    -- ClickHouse: Create a MergeTree table
    
    CREATE TABLE sales (
    
        order_id UInt64,
    
        order_date Date,
    
        customer_id UInt32,
    
        amount Float64,
    
        region String
    
    ) ENGINE = MergeTree()
    
    ORDER BY (order_date, region)
    
    PARTITION BY toYYYYMM(order_date);
    
    
    
    -- ClickHouse: Aggregation query
    
    SELECT
    
        region,
    
        toMonth(order_date) as month,
    
        count() as order_count,
    
        sum(amount) as total_revenue
    
    FROM sales
    
    WHERE order_date >= '2026-01-01'
    
    GROUP BY region, month
    
    ORDER BY total_revenue DESC;
    
    

    **Best for**: Real-time analytics, time-series data, event logging, monitoring dashboards.


    Amazon Redshift


    Redshift uses columnar storage with automatic compression encoding selection.


    
    -- Redshift: Create a columnar table with sort key
    
    CREATE TABLE sales (
    
        order_id BIGINT ENCODE ZSTD,
    
        order_date DATE ENCODE DELTA,
    
        region VARCHAR(50) ENCODE BYTEDICT,
    
        amount DECIMAL(10,2) ENCODE ZSTD
    
    )
    
    DISTKEY(region)
    
    SORTKEY(order_date, region);
    
    

    Redshift automatically selects the best compression encoding based on sample data.


    Google BigQuery


    BigQuery uses the Capacitor columnar storage format, which is proprietary and automatically optimized.


    
    -- BigQuery: Partitioned and clustered table
    
    CREATE TABLE mydataset.sales
    
    PARTITION BY DATE(order_date)
    
    CLUSTER BY region, customer_id
    
    AS SELECT * FROM source_table;
    
    

    BigQuery charges by bytes processed, not storage. Columnar storage and clustering minimize bytes scanned.


    Columnar vs Row-Oriented Comparison


    | Aspect | Row-Oriented | Columnar |

    |--------|--------------|----------|

    | Read pattern | Full rows | Few columns, many rows |

    | Write pattern | Single rows | Batches preferred |

    | Compression | 1-2x | 5-10x |

    | Range scan | Efficient | Very efficient |

    | Point query | Very efficient | Less efficient |

    | Inserts/updates | Efficient | Expensive |

    | Concurrency | High | Lower |

    | Use case | OLTP | OLAP / Analytics |


    When to Use Columnar


  • Reporting and dashboards.
  • Time-series analysis.
  • Log analytics.
  • Business intelligence workloads.
  • Data warehousing.
  • Any query that aggregates large datasets.

  • When NOT to Use Columnar


  • Transactional workloads with frequent single-row inserts and updates.
  • Applications requiring high concurrency with many small queries.
  • Systems needing real-time point lookups by primary key.
  • Write-heavy workloads (columnar compresses data in batches, individual inserts are expensive).

  • Hybrid Approaches


    Some databases use hybrid storage to handle both OLTP and OLAP workloads.


    PostgreSQL + Columnar Extensions


    
    -- PostgreSQL with pg_analytics extension for columnar execution
    
    CREATE EXTENSION pg_analytics;
    
    
    
    -- Set a table to use columnar storage
    
    ALTER TABLE large_analytics_table SET ACCESS METHOD columnar;
    
    

    HTAP (Hybrid Transactional/Analytical Processing)


    HTAP databases handle both workloads in a single system:


  • **SingleStore**: Row-oriented for transactions, columnar for analytics, automatically routed.
  • **Amazon Aurora**: Row-oriented MySQL/PostgreSQL compatible with parallel query for analytical workloads.
  • **MySQL HeatWave**: In-memory columnar engine for MySQL with automatic query offloading.

  • Conclusion


    Columnar databases are essential for modern analytical workloads. Their storage format enables aggressive compression (5-10x), vectorized query execution, and dramatically faster queries when scanning many rows but few columns. Choose ClickHouse for real-time analytics, Redshift for cloud data warehousing, BigQuery for serverless analytics, and columnar extensions or HTAP databases when you need hybrid capabilities. Use row-oriented databases for transactional workloads and switch to columnar for analytics.