Facts and Dimensions
The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" tableA Fact table contains records each describe an instance of a transaction. Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount).
A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc.
In a typical setting of Multi-dimensional model ...
- Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema.
- Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema.
- Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other.
- However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema.
- But it is a hard rule that Dimension table NEVER points / references Fact table
Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity. For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc.
Here we first create a sales fact table that records each sales transaction.
# Setup the dimension tables
state_table <-
data.frame(key=c("CA", "NY", "WA", "ON", "QU"),
name=c("California", "new York", "Washington", "Ontario", "Quebec"),
country=c("USA", "USA", "USA", "Canada", "Canada"))
month_table <-
data.frame(key=1:12,
desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))
prod_table <-
data.frame(key=c("Printer", "Tablet", "Laptop"),
price=c(225, 570, 1120))
# Function to generate the Sales table
gen_sales <- function(no_of_recs) {
# Generate transaction data randomly
loc <- sample(state_table$key, no_of_recs,
replace=T, prob=c(2,2,1,1,1))
time_month <- sample(month_table$key, no_of_recs, replace=T)
time_year <- sample(c(2012, 2013), no_of_recs, replace=T)
prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
amount <- unit*prod_table[prod,]$price
sales <- data.frame(month=time_month,
year=time_year,
loc=loc,
prod=prod,
unit=unit,
amount=amount)
# Sort the records by time order
sales <- sales[order(sales$year, sales$month),]
row.names(sales) <- NULL
return(sales)
}
# Now create the sales fact table
sales_fact <- gen_sales(500)
# Look at a few records
head(sales_fact)
month year loc prod unit amount
1 1 2012 NY Laptop 1 225
2 1 2012 CA Laptop 2 450
3 1 2012 ON Tablet 2 2240
4 1 2012 NY Tablet 1 1120
5 1 2012 NY Tablet 2 2240
6 1 2012 CA Laptop 1 225
Multi-dimensional Cube
Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension.# Build up a cube
revenue_cube <-
tapply(sales_fact$amount,
sales_fact[,c("prod", "month", "year", "loc")],
FUN=function(x){return(sum(x))})
# Showing the cells of the cude
revenue_cube
, , year = 2012, loc = CA
month
prod 1 2 3 4 5 6 7 8 9 10 11 12
Laptop 1350 225 900 675 675 NA 675 1350 NA 1575 900 1350
Printer NA 2280 NA NA 1140 570 570 570 NA 570 1710 NA
Tablet 2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360
, , year = 2013, loc = CA
month
prod 1 2 3 4 5 6 7 8 9 10 11 12
Laptop 225 225 450 675 225 900 900 450 675 225 675 1125
Printer NA 1140 NA 1140 570 NA NA 570 NA 1140 1710 1710
Tablet 3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480
, , year = 2012, loc = NY
month
prod 1 2 3 4 5 6 7 8 9 10 11 12
Laptop 450 450 NA NA 675 450 675 NA 225 225 NA 450
Printer NA 2280 NA 2850 570 NA NA 1710 1140 NA 570 NA
Tablet 3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360
, , year = 2013, loc = NY
.....
dimnames(revenue_cube)
$prod
[1] "Laptop" "Printer" "Tablet"
$month
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
$year
[1] "2012" "2013"
$loc
[1] "CA" "NY" "ON" "QU" "WA"
OLAP Operations
Here are some common operations of OLAP- Slice
- Dice
- Rollup
- Drilldown
- Pivot
# Slice
# cube data in Jan, 2012
revenue_cube[, "1", "2012",]
loc
prod CA NY ON QU WA
Laptop 1350 450 NA 225 225
Printer NA NA NA 1140 NA
Tablet 2240 3360 5600 1120 2240
# cube data in Jan, 2012
revenue_cube["Tablet", "1", "2012",]
CA NY ON QU WA
2240 3360 5600 1120 2240
"Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY].
revenue_cube[c("Tablet","Laptop"),
c("1","2","3"),
,
c("CA","NY")]
, , year = 2012, loc = CA
month
prod 1 2 3
Tablet 2240 4480 12320
Laptop 1350 225 900
, , year = 2013, loc = CA
month
prod 1 2 3
Tablet 3360 3360 1120
Laptop 225 225 450
, , year = 2012, loc = NY
month
prod 1 2 3
Tablet 3360 13440 2240
Laptop 450 450 NA
, , year = 2013, loc = NY
month
prod 1 2 3
Tablet 3360 4480 6720
Laptop 450 NA 225
"Rollup" is about applying an aggregation function to collapse a number of dimensions. For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).
apply(revenue_cube, c("year", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
prod
year Laptop Printer Tablet
2012 22275 31350 179200
2013 25200 33060 166880
"Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity. For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).
apply(revenue_cube, c("year", "month", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
, , prod = Laptop
month
year 1 2 3 4 5 6 7 8 9 10 11 12
2012 2250 2475 1575 1575 2250 1800 1575 1800 900 2250 1350 2475
2013 2250 900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250
, , prod = Printer
month
year 1 2 3 4 5 6 7 8 9 10 11 12
2012 1140 5700 570 3990 4560 2850 1140 2850 2850 1710 3420 570
2013 1140 4560 3420 4560 2850 1140 570 3420 1140 3420 3990 2850
, , prod = Tablet
month
year 1 2 3 4 5 6 7 8 9 10 11 12
2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200 7840
2013 8960 11200 10080 7840 14560 10080 29120 15680 15680 8960 12320 22400
"Pivot" is about analyzing the combination of a pair of selected dimensions. For example, we want to analyze the revenue by year and month. Or we want to analyze the revenue by product and location.
apply(revenue_cube, c("year", "month"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
month
year 1 2 3 4 5 6 7 8 9 10 11 12
2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885
2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500
apply(revenue_cube, c("prod", "loc"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
loc
prod CA NY ON QU WA
Laptop 16425 9450 7650 7425 6525
Printer 15390 19950 7980 10830 10260
Tablet 90720 117600 45920 34720 57120
I hope you can get a taste of the richness of data processing model in R.
However, since R is doing all the processing in RAM. This requires your data to be small enough so it can fit into the local memory in a single machine.