The data landscape has evolved from simple databases to complex architectures spanning data lakes, data warehouses, and the emerging lakehouse paradigm. Understanding the differences between these architectures is essential for building a modern data platform.
Data Warehouse
A data warehouse is a centralized repository optimized for structured, processed data used in reporting and analytics.
Characteristics
When to Use a Data Warehouse
Limitations
Data Lake
A data lake stores data in its raw, native format. It is a single repository for all data, regardless of structure.
Characteristics
When to Use a Data Lake
Limitations
Lakehouse Architecture
The lakehouse combines the flexibility of a data lake with the reliability and performance of a data warehouse. It stores data in object storage with a metadata layer that provides ACID transactions, schema enforcement, and performance optimization.
Key Innovations
2. **Schema enforcement and evolution**: Enforce schemas on write while allowing controlled evolution.
3. **Performance optimization**: File layout statistics, compaction, indexing, and caching.
4. **Unified batch and streaming**: Same storage for both batch and streaming workloads.
5. **Open formats**: Data stored in Apache Parquet with metadata tracked by open table formats.
Table Format Comparison
The lakehouse is enabled by three major open-source table formats.
Delta Lake
Delta Lake, developed by Databricks, adds ACID transactions to Parquet files.
-- Delta Lake: Create a table
CREATE TABLE sales (
order_id LONG,
customer_id STRING,
amount DOUBLE,
order_date DATE
) USING DELTA;
-- Delta Lake: Time travel query
SELECT * FROM sales TIMESTAMP AS OF '2026-05-01';
-- Delta Lake: Vacuum old versions
VACUUM sales RETAIN 168 HOURS;
Best for: Databricks environments, general purpose lakehouse, Spark workloads.
Apache Iceberg
Iceberg was developed at Netflix and is now an Apache top-level project. It is designed for high-performance table management at scale.
-- Iceberg: Create a partitioned table
CREATE TABLE sales (
order_id BIGINT,
customer_id STRING,
amount DECIMAL(10,2),
order_date DATE
) USING ICEBERG
PARTITIONED BY (month(order_date));
-- Iceberg: Incremental read (only new data)
SELECT * FROM sales
WHERE order_date BETWEEN '2026-05-01' AND '2026-05-12';
-- Iceberg: Schema evolution (add column)
ALTER TABLE sales ADD COLUMN discount DECIMAL(5,2);
Best for: Multi-engine environments (Spark, Flink, Trino, Hive), cloud-agnostic deployments.
Apache Hudi
Hudi (Hadoop Upserts Deletes and Incrementals) was developed at Uber for real-time data ingestion.
-- Hudi: Create a Copy-on-Write table
CREATE TABLE uber_trips (
trip_id BIGINT,
driver_id STRING,
fare DOUBLE,
trip_date DATE
) USING HUDI
TBLPROPERTIES (
primaryKey = 'trip_id',
preCombineField = 'trip_date'
);
-- Hudi: Upsert new records
MERGE INTO uber_trips AS target
USING new_trips AS source
ON target.trip_id = source.trip_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Best for: Real-time ingestion, upsert-heavy workloads, incremental processing pipelines.
Medallion Architecture
The medallion architecture organizes data lakehouse data into layers of increasing quality.
Bronze Layer (Raw)
Raw data ingested from source systems. No transformations. Schema-on-read. Data is append-only.
bronze/orders/
order_date=2026-05-01/
file1.parquet
file2.parquet
order_date=2026-05-02/
file3.parquet
Silver Layer (Cleaned)
Data is deduplicated, validated, and lightly transformed. Null handling, type casting, and simple joins. This layer serves as the source of truth for analysts.
# Bronze to Silver transformation
def bronze_to_silver(spark):
bronze_df = spark.read.format("delta").load("/data/bronze/orders")
silver_df = bronze_df \
.dropDuplicates(["order_id"]) \
.filter(col("amount").isNotNull()) \
.withColumn("order_date", to_date(col("order_raw_date"))) \
.withColumn("amount", col("amount").cast("decimal(10,2)"))
silver_df.write \
.format("delta") \
.mode("overwrite") \
.save("/data/silver/orders")
Gold Layer (Aggregated)
Business-level aggregations and metrics. Data marts for specific teams or use cases.
-- Gold layer: Daily sales by region
CREATE TABLE gold.daily_sales_by_region
USING DELTA
AS
SELECT
order_date,
region,
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(*) as order_count,
SUM(amount) as total_revenue
FROM silver.orders
GROUP BY order_date, region;
Architecture Decision Matrix
| Factor | Warehouse | Data Lake | Lakehouse |
|--------|-----------|-----------|-----------|
| Data freshness | Hours to days | Minutes to hours | Minutes |
| Query performance | Sub-second | Seconds to minutes | Sub-second to seconds |
| Storage cost | High | Low | Low |
| Schema flexibility | Rigid | Flexible | Schema on write with evolution |
| ACID transactions | Yes | No | Yes |
| ML/AI support | Limited | Good | Good |
| BI tool support | Excellent | Poor | Good to excellent |
| Complexity | Low | Medium | Medium-high |
Migration Path
Moving from warehouse-only or lake-only architectures to a lakehouse:
2. **Continue using the warehouse** for BI and reporting.
3. **Introduce an open table format** (Delta, Iceberg, Hudi) on the lake for ACID properties.
4. **Move bronze/silver layers to the lake** while keeping gold aggregations in the warehouse.
5. **Adopt a lakehouse query engine** (Trino, Athena, Databricks SQL) for direct lake queries.
6. **Migrate gold layer to the lakehouse** once performance is acceptable.
Conclusion
Data warehouses remain essential for BI and reporting. Data lakes are ideal for raw storage and ML workloads. The lakehouse architecture bridges both worlds by bringing ACID transactions, schema enforcement, and performance optimization to object storage. Choose the medallion architecture (bronze, silver, gold) to organize lakehouse data. Start with a data warehouse, add a lake for raw data, and converge on a lakehouse as your platform matures.