Monday, August 13, 2012

BIG Data Analytics Pipeline

"Big Data Analytics" has recently been one of the hottest buzzwords.  It is a combination of "Big Data" and "Deep Analysis".  The former is a phenomenon of Web2.0 where a lot of transaction and user activity data has been collected which can be mined for extracting useful information.  The later is about using advanced mathematical/statistical technique to build models from the data.  In reality I've found these two areas are quite different and disjoint and people working in each area have a pretty different background.

Big Data Camp

People working in this camp typically come from Hadoop, PIG/Hive background.  They usually have implemented some domain-specific logic to process large amount of raw data.  Often the logic is relatively straight-forward based on domain-specific business rules.

From my personal experience, most of the people working in big data come from a computer science and distributed parallel processing system background but not from the statistical or mathematical discipline.

Deep Analysis Camp

On the other hand, people working in this camp usually come from statistical and mathematical background which the first thing being taught is how to use sampling to understand a large population's characteristic.  Notice the magic of "sampling" is that the accuracy of estimating the large population depends only in the size of sample but not the actual size of the population.  In their world, there is never a need to process all the data in the population in the first place.  Therefore, Big Data Analytics is unnecessary under this philosophy.

Typical Data Processing Pipeline

Learning from my previous projects, I observe most data processing pipeline fall into the following pattern.

In this model, data is created from the OLTP (On Line Transaction Processing) system, flowing into the BIG Data Analytics system which produced various output; including data mart/cubes for OLAP (On Line Analytic Processing), reports for the consumption of business executives, and predictive models that feedback decision support for OLTP.

Big Data + Deep Analysis

The BIG data analytics box is usually done in a batch fashion (e.g. once a day), usually we see big data processing and deep data analysis happens at different stages of this batch process.

The big data processing part (colored in orange) is usually done using Hadoop/PIG/Hive technology with classical ETL logic implementation.  By leveraging the Map/Reduce model that Hadoop provides, we can linearly scale up the processing by adding more machines into the Hadoop cluster.  Drawing cloud computing resources (e.g. Amazon EMR) is a very common approach to perform this kind of tasks.

The deep analysis part (colored in green) is usually done in R, SPSS, SAS using a much smaller amount of carefully sampled data that fits into a single machine's capacity (usually less than couple hundred thousands data records).  The deep analysis part usually involve data visualization, data preparation, model learning (e.g. Linear regression and regularization, K-nearest-neighbour/Support vector machine/Bayesian network/Neural network, Decision Tree and Ensemble methods), model evaluation.  For those who are interested, please read up my earlier posts on these topics.

Implementation Architecture

There are many possible ways to implement the data pipeline described above.  Here is one common implementation that works well in many projects.

In this architecture, "Flume" is used to move data from OLTP system to Hadoop File System HDFS.  A workflow scheduler (typically a cron-tab entry calling a script) will periodically run to process the data using Map/Reduce.  The data has two portions:  a) Raw transaction data from HDFS  b) Previous model hosted in some NOSQL server.  Finally the "reducer" will update the previous model which will be available to the OLTP system.

For most the big data analytic projects that I got involved, the above architecture works pretty well.  I believe projects requiring real-time feedback loop may see some limitation in this architecture.  Real-time big data analytics is an interesting topic which I am planning to discuss in future posts.

Wednesday, August 8, 2012

Measuring similarity and distance function

Measuring similarity or distance between two data points is very fundamental to many Machine Learning algorithms such as K-Nearest-Neighbor, Clustering ... etc.  Depends on the nature of the data point, various measurement can be used.


Distance between numeric data points

When the dimension of data point is numeric, the general form is called Minkowski distance

( (x1 - x2)p + (y1 - y2)p )1/p

When p = 2, this is equivalent to Euclidean distance.  When p = 1, this is equivalent to Manhattan distance.

This measure is independent of the underlying data distribution.  But what if the value along the x-dimension is much bigger than that from the y-dimension.  So we need to bring all of them into the same scale first.  A common way is to perform a z-transform where each data point first subtract the mean value, and then divide the standard deviation.

(x1, y1) becomes ( (x1μx)/σx , (y1μy)/σy )

This measure, although taking into consideration of the distribution of each dimension, it assumes the dimension are independent of each other.  But what if x-dimension and y-dimension has some correlation.  To consider correlation between different dimensions, we use ...

Mahalanobis distance = (v 1 -  v 2)T.CovMatrix.(v 1 -  v 2)  where v 1  = (x1, y1)

If we care about the direction of the data rather than the magnitude, then cosine distance is a common approach.  It computes the dot product of the two data points divided by the product of their magnitude.  Cosine distance, together with term/document matrix, is commonly used to measure the similarity between documents.


Distance between categorical data points

Since there is no ordering between categorical value, we can only measure whether the categorical value is the same or not.  Basically we are measuring the degree of overlapping of attribute values.  Hamming distance can be used to measure how many attributes need to changed in order to match each other.  We can calculate the ratio to determine how similar (or difference) between two data points using simple matching coefficient:
noOfMatchAttributes / noOfAttributes

However, when the data point contains asymmetric binary data attributes, equality of certain value doesn't mean anything.  For example, lets say the data point represents a user with attributes represent each movie.  The data point contains a high dimensional binary value representing whether the user has seen the movie.  (1 represent yes and 0 represent no).  Given that most users only see a very small portion of all movies, if both user hasn't seen a particular movie (both value is zero), it doesn't indicate any similarity between the user.  On the other hand, if both user saw the same movie (both value is one), it implies a lot of similarity between the user.  In this case, equality of one should carry a much higher weight than equality of zero.  This lead to Jaccard similarity :
noOfOnesInBoth / (noOfOnesInA + noOfOnesInB - noOfOnesInAandB)

Besides matching or not, if category is structured as a Tree hierarchy, then the distance of two category can be quantified by path length of their common parent.  For example, "/product/spot/ballgame/basketball" is closer to "/product/spot/ballgame/soccer/shoes" than "/product/luxury/handbags" because the common parent has a longer path.


Similarity between instances containing mixed types of attributes

When the data point contain a mixed of attributes, we can calculate the similarity of each attribute (or group the attributes of the same type), and then combine them together using some weighted average.

But we have to be careful when treating asymmetric attributes where its presence doesn't mean anything.

combined_similarity(x, y) = Σover_k[wk * δk * similarity(xk, yk)] / Σover_kk)

where Σover_k(wk) = 1

Distance between sequence (String, TimeSeries)

In case each attribute represent an element of a sequence, we need a different way to measure the distance.  For example, lets say each data point is a string (which contains a sequence of characters), then edit distance is a common measurement.  Basically, edit distance is how many "modifications" (which can be insert, modify, delete) is needed to change stringA into stringB.  This is usually calculated by using dynamic programming technique.

Time Series is another example of sequence data.  Similar to the concept of edit distance, Dynamic Time Warp is about distorting the time dimension by adding more data points in both time series such that their square error between corresponding pairs is minimized.  Where to add these data points are solved using a similar dynamic programming technique.  Here is a very good paper that describe the details.


 Distance between nodes in a network

In a homogenous undirected graph (nodes are of the same type), distance between nodes can be measured by the shortest path.

In a bi-partite graph, there are two types of nodes in which each node only connects to the other type.  (e.g. People joining Communities).  Similarity between nodes (of same type) can be measured by analyzing how similar their connected communities are.

SimRank is an iterative algorithm that compute the similarity of each type of nodes by summing the similarity between all pairs of other type of nodes that it has connected, while other type of nodes' similarity is computed in the same way.

We can also use a probabilistic approach such as RandomWalk to determine the similarity.  Each people node will pass a token (label with the people's name) along a randomly picked community node which it is connected to (weighted by the strength of connectivity).  Each community node will propagated back the received token back to a randomly picked people.  Now the people who received the propagated token may drop the token (with a chance beta) or propagated to a randomly chosen community again.  This process continues under all the tokens are die out (since they have a chance of being dropped).  After that, we obtain the trace Matrix and compute the similarity based on the dot product of the tokens it receives.


Distance between population distribution

Instead of measuring distance between individual data points, we can also compare a collection of data points (ie: population) and measure the distance between them.  In fact, one important part of statistics is to measure the distance between two groups of samples and see if the "difference" is significant enough to conclude they are from different populations.

Lets say the population contains members that belongs to different categories and we want to measure if population A and population B have same or different proportions of members across these categories, then we can use Chi-Square or KL-Divergence to measure their distance.

In case every member of the population has two different numeric attributes (e.g. weight and height), and we want to infer one attribute from the other if they are correlated, correlation coefficient is a measure that quantify their degree of correlation; whether these two attributes are moving along the same direction (heavier people are taller), different direction (heavier people are shorter), or independent.  The correlation coefficient ranges from -1 (negatively correlated) to 0 (no correlation) to 1 (positively correlated).

If the two attributes are categorical (rather than numeric), then mutual information is a common way to measure their dependencies and give a good sense of whether knowing the value of one attribute can help inferring the other attribute.

Now if there are two judges who rank a collection of items and we are interested in the degree of agreement of their ranking order.  We can use Spearman's rank coefficient to measure their degree of consensus in the ranking order.