Data Lake vs Data Warehouse vs Lakehouse

The data landscape has evolved from simple databases to complex architectures spanning data lakes, data warehouses, and the emerging lakehouse paradigm. Understanding the differences between these architectures is essential for building a modern data platform.

Data Warehouse

A data warehouse is a centralized repository optimized for structured, processed data used in reporting and analytics.

Characteristics

**Schema-on-write**: Data must conform to a schema before loading. This ensures quality but adds friction at ingestion time.

**Optimized for reads**: Columnar storage, pre-computed aggregations, materialized views, and indexing for fast analytical queries.

**Clean, transformed data**: Data goes through ETL/ELT pipelines before it is available for querying.

**ACID transactions**: Warehouses support transactions, making them suitable for reliable reporting.

When to Use a Data Warehouse

Business intelligence dashboards.

Financial reporting requiring accuracy.

Any use case requiring SQL on clean, structured data.

Scenarios where query performance is critical (sub-second response).

Limitations

Expensive storage for raw data. Storing terabytes of raw logs in a data warehouse is cost-prohibitive.

Rigid schema changes require careful migration.

Not designed for unstructured data (images, videos, documents).

Data must be transformed before it is useful, adding latency.

Data Lake

A data lake stores data in its raw, native format. It is a single repository for all data, regardless of structure.

Characteristics

**Schema-on-read**: Data is stored as-is. Schemas are applied when the data is read, not when it is ingested.

**Cheap storage**: Object storage (S3, ADLS, GCS) is an order of magnitude cheaper than warehouse storage.

**Any data type**: Structured, semi-structured (JSON, Parquet), unstructured (images, audio, video).

**ELT friendly**: Data lands raw, transformations happen later in the lake.

When to Use a Data Lake

Data science and machine learning (access raw data for feature engineering).

Log and event data storage at petabyte scale.

Archival and data retention.

Exploratory analytics where the schema is unknown upfront.

Limitations

**Performance**: Querying raw data directly is slow compared to a warehouse.

**No ACID transactions**: Multiple concurrent writers can corrupt data.

**Data quality**: Without schema enforcement, data lakes often become "data swamps" with unreliable data.

**Governance**: Cataloging and discovering data in a lake requires additional tooling (AWS Glue, Hive Metastore).

Lakehouse Architecture

The lakehouse combines the flexibility of a data lake with the reliability and performance of a data warehouse. It stores data in object storage with a metadata layer that provides ACID transactions, schema enforcement, and performance optimization.

Key Innovations

**ACID transactions on object storage**: Atomic commits, rollbacks, and concurrent reader/writer isolation.

2. **Schema enforcement and evolution**: Enforce schemas on write while allowing controlled evolution.

3. **Performance optimization**: File layout statistics, compaction, indexing, and caching.

4. **Unified batch and streaming**: Same storage for both batch and streaming workloads.

5. **Open formats**: Data stored in Apache Parquet with metadata tracked by open table formats.

Table Format Comparison

The lakehouse is enabled by three major open-source table formats.

Delta Lake

Delta Lake, developed by Databricks, adds ACID transactions to Parquet files.


-- Delta Lake: Create a table

CREATE TABLE sales (

    order_id LONG,

    customer_id STRING,

    amount DOUBLE,

    order_date DATE

) USING DELTA;



-- Delta Lake: Time travel query

SELECT * FROM sales TIMESTAMP AS OF '2026-05-01';



-- Delta Lake: Vacuum old versions

VACUUM sales RETAIN 168 HOURS;

Best for: Databricks environments, general purpose lakehouse, Spark workloads.

Apache Iceberg

Iceberg was developed at Netflix and is now an Apache top-level project. It is designed for high-performance table management at scale.


-- Iceberg: Create a partitioned table

CREATE TABLE sales (

    order_id BIGINT,

    customer_id STRING,

    amount DECIMAL(10,2),

    order_date DATE

) USING ICEBERG

PARTITIONED BY (month(order_date));



-- Iceberg: Incremental read (only new data)

SELECT * FROM sales

WHERE order_date BETWEEN '2026-05-01' AND '2026-05-12';



-- Iceberg: Schema evolution (add column)

ALTER TABLE sales ADD COLUMN discount DECIMAL(5,2);

Best for: Multi-engine environments (Spark, Flink, Trino, Hive), cloud-agnostic deployments.

Apache Hudi

Hudi (Hadoop Upserts Deletes and Incrementals) was developed at Uber for real-time data ingestion.


-- Hudi: Create a Copy-on-Write table

CREATE TABLE uber_trips (

    trip_id BIGINT,

    driver_id STRING,

    fare DOUBLE,

    trip_date DATE

) USING HUDI

TBLPROPERTIES (

    primaryKey = 'trip_id',

    preCombineField = 'trip_date'

);



-- Hudi: Upsert new records

MERGE INTO uber_trips AS target

USING new_trips AS source

ON target.trip_id = source.trip_id

WHEN MATCHED THEN UPDATE SET *

WHEN NOT MATCHED THEN INSERT *;

Best for: Real-time ingestion, upsert-heavy workloads, incremental processing pipelines.

Medallion Architecture

The medallion architecture organizes data lakehouse data into layers of increasing quality.

Bronze Layer (Raw)

Raw data ingested from source systems. No transformations. Schema-on-read. Data is append-only.


bronze/orders/

  order_date=2026-05-01/

    file1.parquet

    file2.parquet

  order_date=2026-05-02/

    file3.parquet

Silver Layer (Cleaned)

Data is deduplicated, validated, and lightly transformed. Null handling, type casting, and simple joins. This layer serves as the source of truth for analysts.


# Bronze to Silver transformation

def bronze_to_silver(spark):

    bronze_df = spark.read.format("delta").load("/data/bronze/orders")

    

    silver_df = bronze_df \

        .dropDuplicates(["order_id"]) \

        .filter(col("amount").isNotNull()) \

        .withColumn("order_date", to_date(col("order_raw_date"))) \

        .withColumn("amount", col("amount").cast("decimal(10,2)"))

    

    silver_df.write \

        .format("delta") \

        .mode("overwrite") \

        .save("/data/silver/orders")

Gold Layer (Aggregated)

Business-level aggregations and metrics. Data marts for specific teams or use cases.


-- Gold layer: Daily sales by region

CREATE TABLE gold.daily_sales_by_region

USING DELTA

AS

SELECT

    order_date,

    region,

    COUNT(DISTINCT customer_id) as unique_customers,

    COUNT(*) as order_count,

    SUM(amount) as total_revenue

FROM silver.orders

GROUP BY order_date, region;

Architecture Decision Matrix

|--------|-----------|-----------|-----------|

| Storage cost | High | Low | Low |

| ACID transactions | Yes | No | Yes |

Migration Path

Moving from warehouse-only or lake-only architectures to a lakehouse:

**Start with S3/ADLS as a data lake** for raw data ingestion.

2. **Continue using the warehouse** for BI and reporting.

3. **Introduce an open table format** (Delta, Iceberg, Hudi) on the lake for ACID properties.

4. **Move bronze/silver layers to the lake** while keeping gold aggregations in the warehouse.

5. **Adopt a lakehouse query engine** (Trino, Athena, Databricks SQL) for direct lake queries.

6. **Migrate gold layer to the lakehouse** once performance is acceptable.

Conclusion

Data warehouses remain essential for BI and reporting. Data lakes are ideal for raw storage and ML workloads. The lakehouse architecture bridges both worlds by bringing ACID transactions, schema enforcement, and performance optimization to object storage. Choose the medallion architecture (bronze, silver, gold) to organize lakehouse data. Start with a data warehouse, add a lake for raw data, and converge on a lakehouse as your platform matures.

Data Lake vs Data Warehouse vs Lakehouse

Data Warehouse

Characteristics

When to Use a Data Warehouse

Limitations

Data Lake

Characteristics

When to Use a Data Lake

Limitations

Lakehouse Architecture

Key Innovations

Table Format Comparison

Delta Lake

Apache Iceberg

Apache Hudi

Medallion Architecture

Bronze Layer (Raw)

Silver Layer (Cleaned)

Gold Layer (Aggregated)

Architecture Decision Matrix

Migration Path

Conclusion

Related Articles