A data warehouse is a central repository for integrated data from multiple sources, optimized for analytical queries and reporting. Unlike operational databases that handle transactions, data warehouses are designed for read-heavy, complex queries across large datasets.
Schema Design: Star vs Snowflake
Star Schema
The star schema is the simplest and most common data warehouse schema. It has one or more fact tables referencing multiple dimension tables.
Fact_Sales
|-- sale_id (PK)
|-- date_id (FK -> Dim_Date)
|-- product_id (FK -> Dim_Product)
|-- customer_id (FK -> Dim_Customer)
|-- store_id (FK -> Dim_Store)
|-- quantity
|-- amount
Dim_Date
|-- date_id (PK)
|-- date
|-- year
|-- quarter
|-- month
|-- day
Dim_Product
|-- product_id (PK)
|-- product_name
|-- category
|-- price
Star schemas are easy to understand, perform well with simple joins, and are the preferred schema for most OLAP tools.
Snowflake Schema
The snowflake schema normalizes dimension tables into multiple related tables. A product dimension might split into product, category, and department tables.
Fact_Sales
|-- product_id (FK -> Dim_Product)
Dim_Product
|-- product_id (PK)
|-- product_name
|-- category_id (FK -> Dim_Category)
Dim_Category
|-- category_id (PK)
|-- category_name
|-- department_id (FK -> Dim_Department)
Snowflake schemas save storage by eliminating redundancy in dimensions. However, they require more joins for queries, which can impact performance. In practice, most modern data warehouses use star schemas or heavily denormalized designs.
Star vs Snowflake: When to Use
| Aspect | Star | Snowflake |
|--------|------|-----------|
| Query simplicity | Simple joins | Complex joins |
| Query performance | Faster | Slightly slower |
| Storage | More (denormalized) | Less (normalized) |
| ETL complexity | Simple | Moderate |
| Dimensional updates | Requires updates in place | Easier partial updates |
Most modern data warehouses use star schemas with some denormalization for performance-critical dimensions.
ETL vs ELT
ETL (Extract, Transform, Load)
ETL transforms data before loading it into the warehouse. Transformations happen on a dedicated staging server.
Extract -> Transform -> Load
**When ETL makes sense**:
ELT (Extract, Load, Transform)
ELT loads raw data into the warehouse first, then transforms it using the warehouse's compute power.
Extract -> Load -> Transform
**When ELT makes sense**:
ELT is the dominant pattern for modern data warehouses because cloud warehouses compute separately from storage, making transformation compute essentially unlimited.
Modern Data Warehouse Platforms
Snowflake
Snowflake is a cloud-native data warehouse with unique architecture separating storage and compute.
Key features:
-- Snowflake: Create a virtual warehouse
CREATE WAREHOUSE analytics_wh
WITH WAREHOUSE_SIZE = 'XSMALL'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
-- Snowflake: Clone a table with time travel
CREATE TABLE sales_restore CLONE sales
AT (TIMESTAMP => '2026-05-01 12:00:00'::TIMESTAMP);
Google BigQuery
BigQuery is Google's serverless data warehouse. There is no infrastructure to manage — storage and compute are fully managed.
Key features:
-- BigQuery: Partitioned table for efficient queries
CREATE TABLE mydataset.sales_partitioned
(order_id INT64, amount FLOAT64, order_date DATE)
PARTITION BY order_date
OPTIONS(require_partition_filter = TRUE);
-- BigQuery ML: Create a linear regression model
CREATE MODEL mydataset.sales_forecast
OPTIONS(model_type='linear_reg') AS
SELECT amount, order_date
FROM mydataset.sales;
Amazon Redshift
Redshift is AWS's petabyte-scale data warehouse. It is based on a modified PostgreSQL and uses a leader-node/follower-node architecture.
Key features:
-- Redshift: Create table with distribution and sort key
CREATE TABLE sales (
sale_id BIGINT DISTKEY,
customer_id INT,
sale_date DATE SORTKEY,
amount DECIMAL(10,2)
);
-- Redshift Spectrum: Query external data in S3
CREATE EXTERNAL TABLE spectrum.sales (
sale_id BIGINT,
amount DECIMAL(10,2)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://my-bucket/sales-parquet/';
Materialized Views
Materialized views pre-compute and store query results. They trade storage for query performance and are essential for dashboards and recurring reports.
-- Create a materialized view for daily sales aggregations
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
d.date,
p.category,
COUNT(*) as sale_count,
SUM(s.amount) as total_sales
FROM fact_sales s
JOIN dim_date d ON s.date_id = d.date_id
JOIN dim_product p ON s.product_id = p.product_id
GROUP BY d.date, p.category;
-- Refresh the materialized view
REFRESH MATERIALIZED VIEW daily_sales_summary;
-- Query the materialized view (much faster than the base query)
SELECT * FROM daily_sales_summary
WHERE date >= '2026-01-01';
Materialized views are most useful for:
Conclusion
Modern data warehousing has evolved from on-premise appliances to cloud-native platforms with elastic scaling, columnar storage, and separated compute and storage. Star schemas remain the standard for analytical modeling. ELT has largely replaced ETL for cloud warehouses. Snowflake, BigQuery, and Redshift each offer different trade-offs in terms of management overhead, scaling, and query performance. Choose based on your team's expertise and specific workload requirements.