Data Warehousing Concepts and Modern Tools


Data Warehousing Concepts

A data warehouse centralizes data from multiple sources for analysis and reporting. It is optimized for read-heavy analytical queries.

Star Schema

A central fact table connected to dimension tables:




CREATE TABLE fact_sales (


sale_id BIGSERIAL PRIMARY KEY,


date_key INT REFERENCES dim_date(date_key),


product_key INT REFERENCES dim_product(product_key),


customer_key INT REFERENCES dim_customer(customer_key),


quantity INT NOT NULL,


unit_price DECIMAL(10,2) NOT NULL,


total_amount DECIMAL(12,2) GENERATED ALWAYS AS


(quantity * unit_price) STORED


);




CREATE TABLE dim_date (


date_key INT PRIMARY KEY,


date DATE NOT NULL, year SMALLINT, quarter SMALLINT,


month SMALLINT, day SMALLINT, is_holiday BOOLEAN


);





Snowflake Schema

Normalized dimensions for storage efficiency. Dimensions are split into sub-dimensions, saving storage at the cost of more joins.

ETL Pipeline




class ETLPipeline:


def extract(self, query):


return pd.read_sql(query, self.source_engine, chunksize=10000)




def transform(self, df):


df = df.drop_duplicates(subset=["order_id"])


df["order_date"] = pd.to_datetime(df["order_date"])


df["date_key"] = df["order_date"].dt.strftime("%Y%m%d").astype(int)


return df




def load(self, df, table_name):


df.to_sql(table_name, self.warehouse_engine, if_exists="append", index=False)





Modern Data Warehousing

Cloud data warehouses like Snowflake and BigQuery separate storage and compute, enabling elastic scaling. Materialized views pre-compute aggregations for dashboard queries.

Conclusion

Design with star schema for performance. Build resilient ETL pipelines. Leverage cloud warehouses for elastic scaling. Start simple and evolve.