Chapter 8: Data Warehousing and Business Intelligence
8.1 Data Warehouse Architecture
A data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence.
8.1.1 Components of a Data Warehouse
- Data Sources
- ETL (Extract, Transform, Load) Process
- Data Warehouse Database
- Data Marts
- Metadata Repository
- Query and Analysis Tools
8.1.2 Data Warehouse Models
- Star Schema
- Snowflake Schema
- Galaxy Schema
Example of a Star Schema:
CREATE TABLE DimCustomer (
CustomerKey INT PRIMARY KEY,
CustomerName VARCHAR(100),
City VARCHAR(50),
State VARCHAR(50)
);
CREATE TABLE DimProduct (
ProductKey INT PRIMARY KEY,
ProductName VARCHAR(100),
Category VARCHAR(50)
);
CREATE TABLE DimDate (
DateKey INT PRIMARY KEY,
Date DATE,
Year INT,
Month INT,
Quarter INT
);
CREATE TABLE FactSales (
SalesKey INT PRIMARY KEY,
CustomerKey INT,
ProductKey INT,
DateKey INT,
SalesAmount DECIMAL(10,2),
FOREIGN KEY (CustomerKey) REFERENCES DimCustomer(CustomerKey),
FOREIGN KEY (ProductKey) REFERENCES DimProduct(ProductKey),
FOREIGN KEY (DateKey) REFERENCES DimDate(DateKey)
);
8.2 ETL Processes
ETL stands for Extract, Transform, and Load. It's the process of taking data from source systems and bringing it into the data warehouse.
8.2.1 Extract
Involves extracting data from homogeneous or heterogeneous data sources.
8.2.2 Transform
Involves converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database.
Transformation processes:
- Cleaning
- Standardizing
- Deduplicating
- Sorting
- Aggregating
8.2.3 Load
Involves loading the transformed data into the end target, usually the data warehouse database.
Example of an ETL process using SQL:
-- Extract
INSERT INTO StagingCustomers (CustomerID, Name, City, State)
SELECT CustomerID, CustomerName, City, State
FROM SourceSystem.Customers;
-- Transform
UPDATE StagingCustomers
SET City = UPPER(City),
State = UPPER(State);
-- Load
INSERT INTO DimCustomer (CustomerKey, CustomerName, City, State)
SELECT CustomerID, Name, City, State
FROM StagingCustomers;
8.3 OLAP vs. OLTP
8.3.1 OLTP (Online Transaction Processing)
- Handles day-to-day transactions
- Optimized for write operations
- Deals with current data
8.3.2 OLAP (Online Analytical Processing)
- Used for complex queries and analysis
- Optimized for read operations
- Deals with historical data
Key Differences:
| Aspect | OLTP | OLAP |
|---|---|---|
| Purpose | Day-to-day operations | Data analysis |
| Data | Current, operational | Historical, consolidated |
| Database | Normalized, ER model | Denormalized, Star schema |
| Queries | Simple, standardized | Complex, ad hoc |
| Performance | Transaction speed | Query speed |
8.4 Business Intelligence
Business Intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of business information.
8.4.1 BI Tools
- Tableau
- Power BI
- QlikView
- SAP BusinessObjects
8.4.2 Key BI Features
- Reporting
- Dashboards
- Data Visualization
- Ad Hoc Analysis
- Predictive Analytics
8.5 Real-life Example: Building a Data Warehouse for a Multinational Corporation
Let's consider building a data warehouse for a multinational retail corporation to analyze sales data across different regions and product categories.
Step 1: Design the Data Warehouse Schema
We'll use a star schema for our data warehouse:
-- Dimension Tables
CREATE TABLE DimStore (
StoreKey INT PRIMARY KEY,
StoreID VARCHAR(10),
StoreName VARCHAR(100),
City VARCHAR(50),
Country VARCHAR(50),
OpenDate DATE
);
CREATE TABLE DimProduct (
ProductKey INT PRIMARY KEY,
ProductID VARCHAR(10),
ProductName VARCHAR(100),
Category VARCHAR(50),
SubCategory VARCHAR(50),
Brand VARCHAR(50)
);
CREATE TABLE DimDate (
DateKey INT PRIMARY KEY,
Date DATE,
Day INT,
Month INT,
Quarter INT,
Year INT,
IsHoliday BIT
);
-- Fact Table
CREATE TABLE FactSales (
SalesKey INT PRIMARY KEY,
StoreKey INT,
ProductKey INT,
DateKey INT,
SalesAmount DECIMAL(10,2),
Quantity INT,
Discount DECIMAL(5,2),
FOREIGN KEY (StoreKey) REFERENCES DimStore(StoreKey),
FOREIGN KEY (ProductKey) REFERENCES DimProduct(ProductKey),
FOREIGN KEY (DateKey) REFERENCES DimDate(DateKey)
);
Step 2: Implement ETL Processes
Here's a simplified ETL process using SQL:
-- Extract and transform store data
INSERT INTO DimStore (StoreKey, StoreID, StoreName, City, Country, OpenDate)
SELECT
ROW_NUMBER() OVER (ORDER BY StoreID),
StoreID,
StoreName,
City,
Country,
OpenDate
FROM SourceSystem.Stores;
-- Extract and transform product data
INSERT INTO DimProduct (ProductKey, ProductID, ProductName, Category, SubCategory, Brand)
SELECT
ROW_NUMBER() OVER (ORDER BY ProductID),
ProductID,
ProductName,
Category,
SubCategory,
Brand
FROM SourceSystem.Products;
-- Extract and transform sales data
INSERT INTO FactSales (SalesKey, StoreKey, ProductKey, DateKey, SalesAmount, Quantity, Discount)
SELECT
ROW_NUMBER() OVER (ORDER BY s.SalesID),
ds.StoreKey,
dp.ProductKey,
dd.DateKey,
s.SalesAmount,
s.Quantity,
s.Discount
FROM SourceSystem.Sales s
JOIN DimStore ds ON s.StoreID = ds.StoreID
JOIN DimProduct dp ON s.ProductID = dp.ProductID
JOIN DimDate dd ON s.SaleDate = dd.Date;
Step 3: Create OLAP Cubes
Using SQL Server Analysis Services (SSAS), we can create OLAP cubes for faster analysis:
<Cube name="Sales">
<MeasureGroups>
<MeasureGroup name="FactSales">
<Measures>
<Measure name="Sales Amount" column="SalesAmount"/>
<Measure name="Quantity" column="Quantity"/>
</Measures>
</MeasureGroup>
</MeasureGroups>
<Dimensions>
<Dimension name="Store" table="DimStore"/>
<Dimension name="Product" table="DimProduct"/>
<Dimension name="Date" table="DimDate"/>
</Dimensions>
</Cube>