Monday, March 3, 2014

Estimating statistics via Bootstrapping and Monte Carlo simulation

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population.

It will be too tedious to enumerate all members of the whole population.  For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population.  We understand the estimation done this way (via random sampling) can deviate from the population.  Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics).

The challenge is how do we estimate the "standard error" or the "confidence interval".  A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation.  Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation.

But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set.  Bootstrapping provides a way to address this ...

Bootstrapping

Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set.  Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population.

R provides a nice "bootstrap" library to do this.

> library(boot)
> # Generate a population
> population.weight <- rnorm(100000, 160, 60)
> # Lets say we care about the ninety percentile
> quantile(population.weight, 0.9)
     90% 
236.8105 
> # We create our first sample set of 500 samples
> sample_set1 <- sample(population.weight, 500)
> # Here is our sample statistic of ninety percentile
> quantile(sample_set1, 0.9)
     90% 
232.3641 
> # Notice that the sample statistics deviates from the population statistics
> # We want to estimate how big is this deviation by using bootstrapping
> # I need to define my function to compute the statistics
> ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))}
> # Bootstrapping will call this function many times with different idx
> boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000)
> boot_result

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = sample_set1, statistic = ninety_percentile, R = 1000)


Bootstrap Statistics :
    original   bias    std. error
t1* 232.3641 2.379859     5.43342
> plot(boot_result)
> boot.ci(boot_result, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   (227.2, 248.1 )  
Calculations and Intervals on Original Scale


Here is the visual output of the bootstrap plot

Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way.  It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics. 

If we assume certain distribution model want to see the distribution of certain statistics.  Monte Carlo simulation provides a powerful way for this.

Monte Carlo Simulation

The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples.  We compute the statistics of each sample set and see how the statistics distributed across different sample sets.

For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples.

> sample_stats <- rep(0, 1000)
> for (i in 1:1000) {
+     sample_stats[i] <- max(rnorm(5))
+ }
> mean(sample_stats)
[1] 1.153008
> sd(sample_stats)
[1] 0.6584022
> par(mfrow=c(1,2))
> hist(sample_stats, breaks=30)
> qqnorm(sample_stats)
> qqline(sample_stats)


Here is the distribution of the "max(5)" statistics, which shows some right skewness

Bootstrapping and Monte Carlo simulation is a powerful tool to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.

Friday, December 27, 2013

Spark: Low latency, massively parallel processing framework

While Hadoop fits well in most batch processing workload, and is the primary choice of big data processing today, it is not optimized for other types of workload  due to its following limitation
  • Lack of iteration support
  • High latency due to persisting intermediate data onto disk
 For a more detail elaboration of the Hadoop limitation, refer to my previous post.

Nevertheless, the Map/Reduce processing paradigm is a proven mechanism for dealing with large scale data.  On the other hand, many of Hadoop's infrastructure piece such as HDFS, HBase has been mature over time.

In this blog post, we'll look at a different architecture called Spark, which has taken the strength of Hadoop and make improvement in a number of Hadoop's weakness, and provides a more efficient batch processing framework with a much lower latency (from the benchmark result, Spark (using RAM cache) claims to be 100x faster than Hadoop, and 10x faster than Hadoop when running on disk.  Although competing with Hadoop MapRed, Spark integrates well with other parts of Hadoop Ecosystem (such as HDFS, HBase ... etc.).  Spark has generated a lot of excitement in the big data community and represents a very promising parallel execution stack for big data analytics.

Berkeley Spark

Within the Spark cluster, there is a driver program where the application logic execution is started, with multiple workers which processing data in parallel.  Although this is not mandated, data is typically collocated with the worker and partitioned across the same set of machines within the cluster.  During the execution, the driver program will passed code/closure into the worker machine where processing of corresponding partition of data will be conducted.  The data will undergoing different steps of transformation while staying in the same partition as much as possible (to avoid data shuffling across machines).  At the end of the execution, actions will be executed at the worker and result will be returned to the driver program.


Underlying the cluster, there is an important Distributed Data Structure called RDD (Resilient Distributed Dataset), which is a logically centralized entity but physically partitioned across multiple machines inside a cluster based on some notion of key.  Controlling how different RDD are co-partitioned (with the same keys) across machines can reduce inter-machine data shuffling within a cluster.  Spark provides a "partition-by" operator which create a new RDD by redistributing the data in the original RDD across machines within the cluster.



RDD can optionally be cached in RAM and hence providing fast access.  Currently the granularity of caching is done at the RDD level, either the whole or none of the RDD is cached.  Cached is a hint but not a guarantee.  Spark will try to cache the RDD if sufficient memory is available in the cluster, based on LRU (Least Recent Use) eviction algorithm.

RDD provides an abstract data structure from which application logic can be expressed as a sequence of transformation processing, without worrying about the underlying distributed nature of the data.

Typically an application logic are expressed in terms of a sequence of TRANSFORMATION and ACTION.  "Transformation" specifies the processing dependency DAG among RDDs and "Action" specifies what the output will be (ie: the sink node of the DAG with no outgoing edge).  The scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD.


Notice that dependencies in Spark come in two forms.  "Narrow dependency" means the all partitions of an RDD will be consumed by a single child RDD (but a child RDD is allowed to have multiple parent RDDs).  "Wide dependencies" (e.g. group-by-keys, reduce-by-keys, sort-by-keys) means a parent RDD will be splitted with elements goes to different children RDDs based on their keys.  Notice that RDD with narrow dependencies preserve the key partitioning between parent and child RDD.  Therefore RDD can be co-partitioned with the same keys (parent key range to be a subset of child key range) such that the processing (generating child RDD from parent RDD) can be done within a machine with no data shuffling across network.  On the other hand, RDD will wide dependencies involves data shuffling.


Narrow transformation (involves no data shuffling) includes the following operators
  • Map
  • FlatMap
  • Filter
  • Sample
Wide transformation (involves data shuffling) includes the following operators
  •  SortByKey
  • ReduceByKey
  • GroupByKey
  • CogroupByKey
  • Join
  • Cartesian
Action output the RDD to the external world and includes the following operators
  • Collect
  • Take(n)
  • Reduce
  • ForEach
  • Sample
  • Count
  • Save
The scheduler will examine the type of dependencies and group the narrow dependency RDD into a unit of processing called a stage.  Wide dependencies will span across consecutive stages within the execution and require the number of partition of the child RDD to be explicitly specified.


A typical execution sequence is as follows ...
  1. RDD is created originally from external data sources (e.g. HDFS, Local file ... etc)
  2. RDD undergoes a sequence of TRANSFORMATION (e.g. map, flatMap, filter, groupBy, join), each provide a different RDD that feed into the next transformation.
  3. Finally the last step is an ACTION (e.g. count, collect, save, take), which convert the last RDD into an output to external data sources
The above sequence of processing is called a lineage (outcome of the topological sort of the DAG).  Each RDD produced within the lineage is immutable.  In fact, unless if it is cached, it is used only once to feed the next transformation to produce the next RDD and finally produce some action output.

In a classical distributed system, fault resilience is achieved by replicating data across different machines together with a active monitoring system.  In case of any machine crashes, there is always another copy of data residing in a different machine from where recovery can take place.

Fault resiliency in Spark takes a different approach.  First of all, as a large scale compute cluster, Spark is not meant to be a large scale data cluster at all.  Spark makes two assumptions of its workload.
  • The processing time is finite (although the longer it takes, the cost of recovery after fault will be higher)
  • Data persistence is the responsibility of external data sources, which keeps the data stable within the duration of processing.
Spark has made a tradeoff decision that in case of any data lost during the execution, it will re-execute the previous steps to recover the lost data.  However, this doesn't mean everything done so far is discarded and we need to start from scratch at the beginning.  We just need to re-executed the corresponding partition in the parent RDD which is responsible for generating the lost partitions, in case of narrow dependencies, this resolved to the same machine.

Notice that the re-execution of lost partition is exactly the same as the lazy evaluation of the DAG, which starts from the leaf node of the DAG, tracing back the dependencies on what parent RDD is needed and then eventually track all the way to the source node.  Recomputing the lost partition is done is a similar way, but taking partition as an extra piece of information to determine which parent RDD partition is needed.

However, re-execution across wide dependencies can touch a lot of parent RDD across multiple machines and may cause re-execution of everything. To mitigate this, Spark persist the intermediate data output from a Map phase before it shuffle them to different machines executing the reduce phase.  In case of machine crash, the re-execution (from another surviving machine) just need to trace back to fetch the intermediate data from the corresponding partition of the mapper's persisted output.  Spark also provide a checkpoint API to explicitly persist intermediate RDD so re-execution (when crash) doesn't need to trace all the way back to the beginning.  In future, Spark will perform check-pointing automatically by figuring out a good balance between the latency of recovery and the overhead of check-pointing based on statistical result.

Spark provides a powerful processing framework for building low latency, massively parallel processing for big data analytics.  It supports API around the RDD abstraction with a set of operation for transformation and action for a number of popular programming language like Scala, Java and Python.

In future posts, I'll cover other technologies in the Spark stack including real-time analytics using streaming as well as machine learning frameworks.

Thursday, December 12, 2013

Escape local optimum trap

Optimization is a very common technique in computer science and machine learning to search for the best (or good enough) solution.  Optimization itself is a big topic and involves a wide range of mathematical techniques in different scenarios.


In this post, I will be focusing in local search, which is a very popular technique in searching for an optimal solution based on a series of greedy local moves.  The general setting of local search is as follows ...

1. Define an objective function
2. Pick an initial starting point
3. Repeat
     3.1 Find a neighborhood
     3.2 Locate a subset of neighbors that is a candidate move
     3.3 Select a candidate from the candidate set
     3.4 Move to the candidate

One requirement is that the optimal solution need to be reachable by a sequence of moves.  Usually this requires a proof that any arbitrary state is reachable by any arbitrary state through a sequence of moves.

Notice that there are many possible strategies for each steps in 3.1, 3.2, 3.3.  For example, one strategy can examine all members within the neighborhood, pick the best one (in terms of the objective function) and move to that.  Another strategy can randomly pick a member within the neighborhood, and move to the member if it is better than the current state.

Regardless of the strategies, the general theme is to move towards the members which is improving the objective function, hence the greedy nature of this algorithm.

One downside of this algorithm is that it is possible to be trapped in a local optimum, whose is the best candidate within its neighborhood, but not the best candidate from a global sense.

In the following, we'll explore a couple meta-heuristic techniques that can mitigate the local optimum trap.

Multiple rounds

We basically conduct k rounds of local search, each round getting a local optimum and then pick the best one out of these k local optimum.

Simulated Anealing

This strategy involves a dynamic combination of exploitation (better neighbor) and exploration (random walk to worse neighbor).  The algorithm works in the following way ...

1. Pick an initial starting point
2. Repeat until terminate condition
     2.1 Within neighborhood, pick a random member
     2.2 If neighbor is better than me
           move to the neighbor
         else
           With chance exp(-(myObj - neighborObj)/Temp)
               move to the worse neighbor
     2.3 Temp = alpha * Temp

Tabu Search

This strategy maintains a list of previously visited states (called Tabu list) and make sure these states will not be re-visited in subsequent exploration.  The search will keep exploring the best move but skipping the previously visited nodes.  This way the algorithm will explore the path that hasn't be visited before.  The search also remember the best state obtained so far.

1. Initialization
     1.1 Pick an initial starting point S
     1.2 Initial an empty Tabu list
     1.3 Set the best state to S
     1.4 Put S into the Tabu list
2. Repeat until terminate condition
     2.1 Find a neighborhood
     2.2 Locate a smaller subset that is a candidate move
     2.3 Remove elements that is already in Tabu list
     2.4 Select the best candidate and move there
     2.5 Add the new state to the Tabu list
     2.6 If the new state is better that best state
          2.6.1 Set the best state to this state
     2.7 If the Tabu list is too large
          2.7.1 Trim Tabu list by removing old items

Saturday, November 9, 2013

Diverse recommender

This is a continuation of my previous blog in recommendation systems, which describes some basic algorithm for building recommendation systems.  These techniques evaluate each item against user's interest independently and pick the topK items to construct the recommendation.  However, it suffers from the lack of diversity.  For example, the list may contain the same book with soft cover, hard cover, and Kindle version.  Since human's interests are usually diverse, a better recommendation list should contain items that covers a broad spectrum of user's interests, even though each element by itself is not the most aligned with the user's interests.

In this post, I will discuss a recommendation algorithm that consider diversity in its list of recommendation.

Topic Space

First of all, lets define a "topic space" where both the content and user will be map to.  Having a "topic space" is a common approach in recommendation because it can reduce dimensionality and resulting in better system performance and improved generality.

The set of topics in topic space can be extracted algorithmically using Text Mining techniques such as LDA, but for simplicity here we use a manual approach to define the topic space (topics should be orthogonal to each other, as highly correlated topics can distort the measures).  Lets say we have the following topics defined ...
  • Romance
  • Sports
  • Politics
  • Weather
  • Horror
  • ...

Content as Vector of topic weights

Once the topic space is defined, content author can assign topic weights to each content.  For example, movie can be assigned with genres and web page can be assigned with topics as well.  Notice that a single content can be assigned with multiple topics of different weights.  In other words, each content can be described as a vector of topic weights.

User as Vector of topic weights

On the other hand, user can also be represented as a vector of topic weights based on their interaction of content, such as viewing a movie, visiting a web page, buying a product ... etc.  Such interaction can have a positive or negative effect depends on whether the user like or dislike the content.  If the user like the content, the user vector will have the corresponding topic weight associated with the content increases by multiplying alpha (alpha > 1).  If the user dislike the content, the corresponding topic weight will be divided by alpha.  After each update, the user vector will be normalized to a unit vector.

Diversifying the recommendation

We use a utility function to model the diversity of the document and then maximize the utility function.



 










In practice, A is not computed from the full set of documents, which is usually huge.  The full set of documents is typically indexed using some kind of Inverted index technologies using the set of topics as keywords, each c[j,k] is represented as tf-idf.

User is represented as a "query", and will be sent to the inverted index as a search request.  Relevant documents (based on cosine distance measures w.r.t. user) will be return as candidate set D (e.g. top 200 relevant content).

To pick the optimal set of A out of D, we use a greedy approach as follows ...
  1. Start with an empty set A
  2. Repeat until |A| > H
  • Pick a doc i from D such that by adding it to the set A will maximize the utility function
  • Add doc i into A and remove doc i from D

Saturday, September 7, 2013

Exploration and Exploitation

"Cold Start" is a common problem happens quite frequent in recommendation system.  When a new item enters, there is no prior history that the recommendation system can use.  Since recommendation is an optimization engine which recommends item that matches the best with the user's interests.  Without prior statistics, the new item will hardly be picked up as recommendation and hence continuously lacking the necessary statistics that the recommendation system can use.

One example is movie recommendation where a movie site recommends movies to users based on their past viewing history.  When a new movie arrives the market, there isn't enough viewing statistics about the movie and therefore the new movie will not have a strong match score and won't be picked as a recommendation.  Because we never learn from those that we haven't recommended, the new movies will continuously not have any statistics and therefore will never be picked in future recommendations.

Another cold start example is online Ad serving when a new Ad enters the Ad repository.

Multilevel Granularity Prediction

One solution of cold-start problem is to leverage existing items that are "SIMILAR" to the new item; "similarity" is based on content attributes (e.g. actors, genres).  Notice that here we are using a coarser level of granularity (group of similar items) for prediction, which can be less accurate than a fine-grain model that use view statistics history for prediction.

In other words, we can make recommendation based on two models of different granularity.  Fine-grain model based on instance-specific history data is preferred because it usually has higher accuracy.  For cold-start problem when the new items don't have history data available, we will fall back to use the coarse-grain model based on other similar items to predict user's interests on the new items.

A common approach is to combine both models of different granularity using different weights where the weights depends on the confidence level of the fine-grain model.  For new items, the fine-grain model will have low confidence and therefore it gives more weights to the coarse-grain model.

However, in case we don't have a coarser level of granularity, or the coarse level is too coarse and doesn't give good prediction.  We have to use the fine-grain model to predict.  But how can we build up the instance-specific history for the fine-grain model when we are not sure if the new items are good recommendation for the user ?

Optimization under Uncertainty

 The core of our problem is we need to optimize under uncertainties.  We have two approaches
  1. Exploitation: Make the most optimal choice based on current data.  Because of uncertainty (high variation) the current data may deviate from its true expected value, we may end up picking a non-optimal choice.
  2. Exploration: Make a random choice or choices that we haven't made before.  The goal is to gather more data point and reduce the uncertainty.  This may waste our cycles of picking the optimal choice. 
 Lets start with a simple, multi-bandit problem.  There are multiple bandit in a Casino, each Bandit has a different probability to win.  If you know the true underlying winning probability of each bandit, you will pick the one with the highest winning probability and keep playing on that one.
Unfortunately, you don't know the underlying probability and has only a limited number of rounds to play.  How would you choose which bandit to play to maximize the total number of rounds you win.

Our strategy should strike a good balance between exploiting and exploring.  To measure how good the strategy is, there is a concept of "regret", which is the ratio of the two elements
  • Value you obtain by following Batch Optimal strategy (after you have done batch analysis and have a clear picture in the underlying probability distribution)
  • Value you obtain by following the strategy
We'll pick our strategy to do more Exploration initially when you have a lot of uncertainty, and gradually tune down the ratio of Exploration (and leverage more on Exploitation) as we collected more statistics.

Epsilon-Greedy Strategy 

In the "epsilon-greedy" strategy, at every play we throw a dice between explore and exploit.
With probability p(t) = k/t (where k is a constant and t is the number of tries so far),we pick a bandit randomly with equal chance (regardless of whether the bandit has been picked in the past).  And with probability 1 - p(t), we pick the bandit that has the highest probability of win based on passed statistics.

Epsilon-greedy has the desirable property of spending more time to explore initially and gradually reduce the portion as time passes.  However, it doesn't have a smooth transition between explore and exploit.  Also while it explores, it picks each bandit uniformly without giving more weight to the unexplored bandits.  While it exploits, it doesn't consider the confidence of probability estimation.

Upper Confidence Bound: UCB

In the more sophisticated UCB strategy, each bandit is associated with an estimated mean with a confidence interval.  In every play, we choose the bandit whose upper confidence bound (ie: mean + standard deviation) is the largest.

Initially each bandit has a zero mean and a large confidence interval.  As time goes, we estimated the mean p[i] of bandit i based on how many time it wins since we play the bandit i.  We also adjust the confidence interval (reducing deviation) as we play the bandit.
e.g. standard deviation is (p.(1-p)/n)^0.5

Notice that  the UCB model can be used in a more general online machine learning setting.  We require the machine learning model be able to output its estimation based on a confidence parameter.  As a concrete example, lets say a user is visiting our movie site and we want to recommend a movie to the user, based on a bunch of input features (e.g. user feature, query feature ... etc.).

We can do a first round selection (based on information retrieval technique) to identify movie candidate based on relevancy (ie: user's viewing history or user search query).  For each movie candidate, we can invoke the ML model to estimated interest level, as well as the 68% confidence boundary (the confidence level is arbitrary and need to be hand-tuned, 68% is roughly one standard deviation of a Gaussian distribution).  We then combine them by add the 68% confidence range as an offset to its estimation and recommend the movie that has the highest resulting value.

After recommendation, we monitor whether user click on it, view it ... etc. and the response will be fed back to our ML model as a new training data.  Our ML model is an online learning setting and will update the model with this new training data.  Over time the 68% confidence range will be reduced over time as more and more data is gathered.

Relationship with A/B Testing

For most web sites, we run experiments continuously to improve user experience by trying out different layouts, or to improve user's engagement by recommending different types of contents, or by trying out different things.  In general, we have an objective function that defines what aspects we are trying to optimize, and we run different experiments through A/B testing to try out different combinations of configuration to see which one will maximize our objective function.

When the number of experiments (combinations of different configuration) is small, then A/B is exploration mainly.  In a typical setting, we use the old user experience as a control experiment and use the new user experience as a treatment.  The goal is to test if the treatment causes any significant improvement from the control.  Certain percentage of production users (typically 5 - 10%) will be directed to the new experience and we measured whether the user engagement level (say this is our objective function) has increased significantly in a statistical sense.  Such splitting is typically done by hashing the user id (or browser cookies), and based on the range of the hash code falls to determine whether the user should get the new experience.  This hashing is consistent (same user will hash into the same bucket in subsequent request) and so the user will get the whole new user experience when visiting the web site.

When the number of experiments is large and new experiments comes out dynamically and unpredictable, traditional A/B testing model described above will not be able to keep track of all pairs of control and treatment combination.  In the case, we need to use the dynamic exploration/exploitation mechanism to find out the best user experience.

Using the UCB approach, we can treat each user experience as a bandit that the A/B test framework can choose from.  Throughout the process, A/B test framework will explore/exploit among different user experience to optimize for the objective function.  At any time, we can query the A/B testing framework to find out the latest statistics of each user experience.  This provides a much better way to look at large number of experiment result at the same time.

Saturday, August 17, 2013

Six steps in data science

"Data science" and "Big data" has become a very hot term in last 2 years.  Since the data storage price dropped significantly, enterprises and web companies has collected huge amount of their customer's behavioral data even before figuring out how to use them.  Enterprises also start realizing that these data, if use appropriately, can turn into useful insight that can guide its business along a more successful path.

From my observation, the effort of data science can be categorized roughly into the following areas.
  • Data acquisition
  • Data visualization
  • OLAP, Report generation
  • Response automation
  • Predictive analytics
  • Decision optimization

Data Acquisition

Data acquisition is about collecting data into the system.  This is an important step before any meaningful processing or analysis can begin.  Most companies start with collecting business transaction records from their OLTP system.  Typically there is an ETL (Extract/Transform/Load) process involved to ingest the raw data, clean the data and transform them appropriately.  Finally, the data is loaded into a data warehouse where the subsequent data analytics exercises are performed.  In today's big data world, the destination has been shifted from traditional data warehouse to Hadoop distributed file system (HDFS).

Data Visualization

Data visualization is usually the first step of analyzing the data.  It is typically done by plotting data in different ways to establish give a quick sense of its shape, in order to guide the data scientist to determine what subsequent analysis should be conducted.  This is also where a more experience data scientist can be distinguished from an less experienced one based on how fast they can spot common patterns or anomalies from data.  Most of the plotting package works only for data that fits into one single machine.  Therefore, in the big data world, data sampling is typically conducted first to reduce the data size, and then import to a single machine where R, SAS, SPSS can be used to visualized the sample data.

OLAP / Reporting

OLAP is about aggregating transaction data (e.g. revenue) along different dimensions (e.g. Month, Location, Product) where the enterprise defines its KPI/business metrics that measures the companies' performance.  This can be done either in an ad/hoc manner (OLAP), or in a predefined manner (Report template).  Report writer (e.g. Tableau, Microstrategy) are use to produce the reports.  The data is typically stored in regular RDBMS or a multidimensional cube which is optimizing for OLAP processing (ie: slice, dice, rollup, drilldown).  In the big data world, Hive provides a SQL-like access mechanism and is commonly used to access data stored in HDFS.  Most popular report writers have integrated with Hive (or declare plan to integrate with it) to access big data stored in HDFS.

Response Automation

Response automation is about leveraging domain-specific knowledge to encode a set of "rules" which include event/condition/action.  The system monitors all events observed, matches them against the conditions, (which can be a boolean expression of event attributes, or a sequence of event occurrence), and trigger appropriate actions.  In the big data world, automating such response is typically done by stream processing mechanism (such as Flume and Storm).  Notice that the "rule" need to be well-defined and unambiguous.

Predictive Analytics

Prediction is about estimating unknown data based on observed data through statistical/probabilistic approach.  Depends on the data type of the output, "prediction" can be subdivided into "classification" (when the output is a category) or "regression" (when the output is a number).

Prediction is typically done by first "train" a predictive model using historic data (where all input and output value is known).  This training is done via an iterative process where the performance of the model in each iteration is measured at the end of each iteration.  Additional input data or different model parameters will be used in next round of iteration.  When the predictive performance is good enough and no significant improvement is made between subsequent iterations, the process will stop and the best model created during the process will be used.

Once we have the predictive model, we can use it to predict information we haven't observed, either this is information that is hidden, or hasn't happen yet (ie: predicting future)

For a more detail description of what is involved in performing predictive analytics, please refer to my following blog.

Decision Optimization

Decision optimization is about making the best decision after carefully evaluating possible options against some measure of business objectives.  The business objective is defined by some "objective function" which is expressed as a mathematical formula of some "decision variables".  Through various optimization technique, the system will figure out what the decision variables should be in order ro maximize (or minimize) the value of the objective function.  The optimizing technique for discrete decision variables is typically done using exhaustive search, greedy/heuristic search, integer programming technique, while optimization for continuous decision variables is done using linear/quadratic programming.

From my observation, "decision optimization" is the end goal of most data science effort.  But "decision optimization" relies on previous effort.  To obtain the overall picture (not just observed data, but also hidden or future data) at the optimization phase, we need to make use of the output of the prediction.  And in order to train a good predictive model, we need to have the data acquisition to provide clean data.  All these efforts are inter-linked with each other in some sense.

Monday, July 29, 2013

OLAP operation in R

OLAP (Online Analytical Processing) is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions.  This is a well-established field in Business Intelligence / Reporting.  In this post, I will highlight the key ideas in OLAP operation and illustrate how to do this in R.

Facts and Dimensions

The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" table

A Fact table contains records each describe an instance of a transaction.  Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount).

A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc.

In a typical setting of Multi-dimensional model ...
  • Each fact table contains foreign keys that references the primary key of multiple dimension tables.  In the most simple form, it is called a STAR schema.
  • Dimension tables can contain foreign keys that references other dimensional tables.  This provides a sophisticated detail breakdown of the contextual aspects.  This is also called a SNOWFLAKE schema.
  • Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other.
  • However, different Fact table usually share the same set of dimension tables.  This is also called GALAXY schema.
  • But it is a hard rule that Dimension table NEVER points / references Fact table
 A simple STAR schema is shown in following diagram.


Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity.  For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc.

Here we first create a sales fact table that records each sales transaction.

# Setup the dimension tables

state_table <- 
  data.frame(key=c("CA", "NY", "WA", "ON", "QU"),
             name=c("California", "new York", "Washington", "Ontario", "Quebec"),
             country=c("USA", "USA", "USA", "Canada", "Canada"))

month_table <- 
  data.frame(key=1:12,
            desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
            quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))

prod_table <- 
  data.frame(key=c("Printer", "Tablet", "Laptop"),
            price=c(225, 570, 1120))

# Function to generate the Sales table
gen_sales <- function(no_of_recs) {

  # Generate transaction data randomly
  loc <- sample(state_table$key, no_of_recs, 
                replace=T, prob=c(2,2,1,1,1))
  time_month <- sample(month_table$key, no_of_recs, replace=T)
  time_year <- sample(c(2012, 2013), no_of_recs, replace=T)
  prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
  unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
  amount <- unit*prod_table[prod,]$price

  sales <- data.frame(month=time_month,
                      year=time_year,
                      loc=loc,
                      prod=prod,
                      unit=unit,
                      amount=amount)

  # Sort the records by time order
  sales <- sales[order(sales$year, sales$month),]
  row.names(sales) <- NULL
  return(sales)
}

# Now create the sales fact table
sales_fact <- gen_sales(500)

# Look at a few records
head(sales_fact)

  month year loc   prod unit amount
1     1 2012  NY Laptop    1    225
2     1 2012  CA Laptop    2    450
3     1 2012  ON Tablet    2   2240
4     1 2012  NY Tablet    1   1120
5     1 2012  NY Tablet    2   2240
6     1 2012  CA Laptop    1    225


Multi-dimensional Cube

Now, we turn this fact table into a hypercube with multiple dimensions.  Each cell in the cube represents an aggregate value for a unique combination of each dimension.  


 
# Build up a cube
revenue_cube <- 
    tapply(sales_fact$amount, 
           sales_fact[,c("prod", "month", "year", "loc")], 
           FUN=function(x){return(sum(x))})

# Showing the cells of the cude
revenue_cube

, , year = 2012, loc = CA

         month
prod         1    2     3    4    5    6    7    8    9   10   11   12
  Laptop  1350  225   900  675  675   NA  675 1350   NA 1575  900 1350
  Printer   NA 2280    NA   NA 1140  570  570  570   NA  570 1710   NA
  Tablet  2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360

, , year = 2013, loc = CA

         month
prod         1    2    3    4    5    6    7    8    9   10   11   12
  Laptop   225  225  450  675  225  900  900  450  675  225  675 1125
  Printer   NA 1140   NA 1140  570   NA   NA  570   NA 1140 1710 1710
  Tablet  3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480

, , year = 2012, loc = NY

         month
prod         1     2    3    4    5    6    7    8    9   10   11   12
  Laptop   450   450   NA   NA  675  450  675   NA  225  225   NA  450
  Printer   NA  2280   NA 2850  570   NA   NA 1710 1140   NA  570   NA
  Tablet  3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360

, , year = 2013, loc = NY

.....

dimnames(revenue_cube)

$prod
[1] "Laptop"  "Printer" "Tablet" 

$month
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

$year
[1] "2012" "2013"

$loc
[1] "CA" "NY" "ON" "QU" "WA"


OLAP Operations

Here are some common operations of OLAP
  • Slice
  • Dice
  • Rollup
  • Drilldown
  • Pivot
"Slice" is about fixing certain dimensions to analyze the remaining dimensions.  For example, we can focus in the sales happening in "2012", "Jan", or we can focus in the sales happening in "2012", "Jan", "Tablet".

# Slice
# cube data in Jan, 2012
revenue_cube[, "1", "2012",]

         loc
prod        CA   NY   ON   QU   WA
  Laptop  1350  450   NA  225  225
  Printer   NA   NA   NA 1140   NA
  Tablet  2240 3360 5600 1120 2240
 
# cube data in Jan, 2012
revenue_cube["Tablet", "1", "2012",]

  CA   NY   ON   QU   WA 
2240 3360 5600 1120 2240 


 "Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube.  For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY].

revenue_cube[c("Tablet","Laptop"), 
             c("1","2","3"), 
             ,
             c("CA","NY")]


, , year = 2012, loc = CA

        month
prod        1    2     3
  Tablet 2240 4480 12320
  Laptop 1350  225   900

, , year = 2013, loc = CA

        month
prod        1    2    3
  Tablet 3360 3360 1120
  Laptop  225  225  450

, , year = 2012, loc = NY

        month
prod        1     2    3
  Tablet 3360 13440 2240
  Laptop  450   450   NA

, , year = 2013, loc = NY

        month
prod        1    2    3
  Tablet 3360 4480 6720
  Laptop  450   NA  225


"Rollup" is about applying an aggregation function to collapse a number of dimensions.  For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). 

apply(revenue_cube, c("year", "prod"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


      prod
year   Laptop Printer Tablet
  2012  22275   31350 179200
  2013  25200   33060 166880
 


"Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity.  For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).

apply(revenue_cube, c("year", "month", "prod"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


, , prod = Laptop

      month
year      1    2    3    4    5    6    7    8    9   10   11   12
  2012 2250 2475 1575 1575 2250 1800 1575 1800  900 2250 1350 2475
  2013 2250  900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250

, , prod = Printer

      month
year      1    2    3    4    5    6    7    8    9   10   11   12
  2012 1140 5700  570 3990 4560 2850 1140 2850 2850 1710 3420  570
  2013 1140 4560 3420 4560 2850 1140  570 3420 1140 3420 3990 2850

, , prod = Tablet

      month
year       1     2     3     4     5     6     7     8     9    10    11    12
  2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200  7840
  2013  8960 11200 10080  7840 14560 10080 29120 15680 15680  8960 12320 22400



"Pivot" is about analyzing the combination of a pair of selected dimensions.  For example, we want to analyze the revenue by year and month.  Or we want to analyze the revenue by product and location.

apply(revenue_cube, c("year", "month"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


      month
year       1     2     3     4     5     6     7     8     9    10    11    12
  2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885
  2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500
 

apply(revenue_cube, c("prod", "loc"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


         loc
prod         CA     NY    ON    QU    WA
  Laptop  16425   9450  7650  7425  6525
  Printer 15390  19950  7980 10830 10260
  Tablet  90720 117600 45920 34720 57120



I hope you can get a taste of the richness of data processing model in R.

However, since R is doing all the processing in RAM.  This requires your data to be small enough so it can fit into the local memory in a single machine.