Data Warehousing Concepts and Modern Tools

A data warehouse is a central repository for integrated data from multiple sources, optimized for analytical queries and reporting. Unlike operational databases that handle transactions, data warehouses are designed for read-heavy, complex queries across large datasets.

Schema Design: Star vs Snowflake

Star Schema

The star schema is the simplest and most common data warehouse schema. It has one or more fact tables referencing multiple dimension tables.

**Fact tables**: Contain quantitative data (sales amounts, quantities, counts). They grow rapidly and are typically the largest tables in the warehouse.

**Dimension tables**: Contain descriptive attributes (customer names, product categories, dates). They are smaller and change slowly.


Fact_Sales

  |-- sale_id (PK)

  |-- date_id (FK -> Dim_Date)

  |-- product_id (FK -> Dim_Product)

  |-- customer_id (FK -> Dim_Customer)

  |-- store_id (FK -> Dim_Store)

  |-- quantity

  |-- amount



Dim_Date

  |-- date_id (PK)

  |-- date

  |-- year

  |-- quarter

  |-- month

  |-- day



Dim_Product

  |-- product_id (PK)

  |-- product_name

  |-- category

  |-- price

Star schemas are easy to understand, perform well with simple joins, and are the preferred schema for most OLAP tools.

Snowflake Schema

The snowflake schema normalizes dimension tables into multiple related tables. A product dimension might split into product, category, and department tables.


Fact_Sales

  |-- product_id (FK -> Dim_Product)



Dim_Product

  |-- product_id (PK)

  |-- product_name

  |-- category_id (FK -> Dim_Category)



Dim_Category

  |-- category_id (PK)

  |-- category_name

  |-- department_id (FK -> Dim_Department)

Snowflake schemas save storage by eliminating redundancy in dimensions. However, they require more joins for queries, which can impact performance. In practice, most modern data warehouses use star schemas or heavily denormalized designs.

Star vs Snowflake: When to Use

| Aspect | Star | Snowflake |

|--------|------|-----------|

| Query simplicity | Simple joins | Complex joins |

| Query performance | Faster | Slightly slower |

| Storage | More (denormalized) | Less (normalized) |

| ETL complexity | Simple | Moderate |

| Dimensional updates | Requires updates in place | Easier partial updates |

Most modern data warehouses use star schemas with some denormalization for performance-critical dimensions.

ETL vs ELT

ETL (Extract, Transform, Load)

ETL transforms data before loading it into the warehouse. Transformations happen on a dedicated staging server.


Extract -> Transform -> Load

**When ETL makes sense**:

Legacy systems with limited compute power.

Data requires significant cleaning before loading.

Compliance requirements mandate data masking before storage.

ELT (Extract, Load, Transform)

ELT loads raw data into the warehouse first, then transforms it using the warehouse's compute power.


Extract -> Load -> Transform

**When ELT makes sense**:

Modern cloud data warehouses with elastic compute (Snowflake, BigQuery).

Raw data is needed for future analysis (data lakehouse pattern).

Transformations are exploratory and iterative.

You want to decouple data ingestion from transformation.

ELT is the dominant pattern for modern data warehouses because cloud warehouses compute separately from storage, making transformation compute essentially unlimited.

Modern Data Warehouse Platforms

Snowflake

Snowflake is a cloud-native data warehouse with unique architecture separating storage and compute.

Key features:

**Storage**: Compressed columnar format in cloud object storage (S3, Azure Blob).

**Compute**: Virtual warehouses (clusters) that can scale independently and can be paused.

**Concurrency**: Multiple virtual warehouses can access the same data without contention.

**Time travel**: Query data as it existed at any point in the past 90 days.

**Zero-copy cloning**: Instantly clone databases or schemas without copying data.


-- Snowflake: Create a virtual warehouse

CREATE WAREHOUSE analytics_wh

  WITH WAREHOUSE_SIZE = 'XSMALL'

  AUTO_SUSPEND = 300

  AUTO_RESUME = TRUE;



-- Snowflake: Clone a table with time travel

CREATE TABLE sales_restore CLONE sales

  AT (TIMESTAMP => '2026-05-01 12:00:00'::TIMESTAMP);

Google BigQuery

BigQuery is Google's serverless data warehouse. There is no infrastructure to manage — storage and compute are fully managed.

Key features:

**Automatic scaling**: No clusters or nodes to configure.

**Columnar storage**: Capacitor columnar storage format with automatic compression.

**BigQuery Omni**: Query data across AWS, Azure, and GCP.

**BI Engine**: In-memory analysis for sub-second query responses.

**Machine learning**: Run ML models directly in SQL with BigQuery ML.


-- BigQuery: Partitioned table for efficient queries

CREATE TABLE mydataset.sales_partitioned

  (order_id INT64, amount FLOAT64, order_date DATE)

  PARTITION BY order_date

  OPTIONS(require_partition_filter = TRUE);



-- BigQuery ML: Create a linear regression model

CREATE MODEL mydataset.sales_forecast

  OPTIONS(model_type='linear_reg') AS

  SELECT amount, order_date

  FROM mydataset.sales;

Amazon Redshift

Redshift is AWS's petabyte-scale data warehouse. It is based on a modified PostgreSQL and uses a leader-node/follower-node architecture.

Key features:

**Massively Parallel Processing (MPP)**: Data is distributed across compute nodes.

**Distribution styles**: KEY, EVEN, and ALL distribution control how data is distributed across slices.

**Sort keys**: Compound and interleaved sort keys improve query performance for filtered queries.

**Redshift Spectrum**: Query data directly in S3 without loading it into Redshift.

**Concurrency Scaling**: Automatically adds capacity for spikes in concurrent queries.


-- Redshift: Create table with distribution and sort key

CREATE TABLE sales (

    sale_id BIGINT DISTKEY,

    customer_id INT,

    sale_date DATE SORTKEY,

    amount DECIMAL(10,2)

);



-- Redshift Spectrum: Query external data in S3

CREATE EXTERNAL TABLE spectrum.sales (

    sale_id BIGINT,

    amount DECIMAL(10,2)

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

LOCATION 's3://my-bucket/sales-parquet/';

Materialized Views

Materialized views pre-compute and store query results. They trade storage for query performance and are essential for dashboards and recurring reports.


-- Create a materialized view for daily sales aggregations

CREATE MATERIALIZED VIEW daily_sales_summary AS

SELECT

    d.date,

    p.category,

    COUNT(*) as sale_count,

    SUM(s.amount) as total_sales

FROM fact_sales s

JOIN dim_date d ON s.date_id = d.date_id

JOIN dim_product p ON s.product_id = p.product_id

GROUP BY d.date, p.category;



-- Refresh the materialized view

REFRESH MATERIALIZED VIEW daily_sales_summary;



-- Query the materialized view (much faster than the base query)

SELECT * FROM daily_sales_summary

WHERE date >= '2026-01-01';

Materialized views are most useful for:

Aggregations over large fact tables (monthly totals, category summaries).

Pre-joined tables that are expensive to join each time.

Data that does not need real-time freshness.

Conclusion

Modern data warehousing has evolved from on-premise appliances to cloud-native platforms with elastic scaling, columnar storage, and separated compute and storage. Star schemas remain the standard for analytical modeling. ELT has largely replaced ETL for cloud warehouses. Snowflake, BigQuery, and Redshift each offer different trade-offs in terms of management overhead, scaling, and query performance. Choose based on your team's expertise and specific workload requirements.

Data Warehousing Concepts and Modern Tools

Schema Design: Star vs Snowflake

Star Schema

Snowflake Schema

Star vs Snowflake: When to Use

ETL vs ELT

ETL (Extract, Transform, Load)

ELT (Extract, Load, Transform)

Modern Data Warehouse Platforms

Snowflake

Google BigQuery

Amazon Redshift

Materialized Views

Conclusion

Related Articles