<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7994087232040033267</id><updated>2012-01-26T23:34:25.119-08:00</updated><category term='Bigtable'/><category term='Software as a service'/><category term='Memcached'/><category term='predictive analytics'/><category term='data mining'/><category term='scalability'/><category term='Cache'/><category term='Multi-tenancy'/><category term='stream'/><category term='Coherence'/><category term='CEP'/><category term='SOA'/><category term='Business Intelligence'/><category term='Dryad'/><category term='SaaS'/><category term='NOSQL'/><category term='map reduce'/><category term='Cost optimization'/><category term='Hadoop'/><category term='Network latency'/><category term='HBase'/><category term='Gigaspace'/><category term='machine learning'/><category term='scatter gather'/><category term='Scalable'/><category term='Pregel'/><category term='BSP'/><category term='Cloud computing'/><category term='Cassandra'/><title type='text'>Pragmatic Programming Techniques</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>89</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1516544936509223853</id><published>2012-01-14T07:45:00.000-08:00</published><updated>2012-01-15T16:41:51.068-08:00</updated><title type='text'>Machine Learning: Ensemble Methods</title><content type='html'>Ensemble Method is a popular approach in Machine Learning based on the idea of combining multiple models.  For example, by mixing different machine learning algorithms (e.g. SVM, Logistic regression, Bayesian network), ensemble method can automatically pick the best algorithmic model that fits the data the best.  On the other hand, by mixing different parameter set of the same algorithmic model (e.g. Random forest, Boosting tree), it can pick the best set of parameters of the same algorithmic model.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Bagging&lt;/span&gt;&lt;br /&gt;Bagging is based on the idea of learning multiple models using different sets of sample data, basically by random sampling (with replacement) from the same training set.  After the models are learned, we use a voting scheme to predict future data.  In case of the classification problem, we use the majority class voted by all models.  In case of the regression problem, we take the average value of estimated output from all models.&lt;br /&gt;&lt;br /&gt;The model doesn't have to be equally weighted.  We can use a weighted average of individual models to come up with the final model.  m = w1.m1 + w2.m2 + w3.m3 + ...&lt;br /&gt;&lt;br /&gt;To obtain the weights w1, w2 ... we can use machine learning to figure out.  In the case, we first use the training data set to train individual models.  After that, we use the validation data set to train the weights.  Concretely, we feed each data point from the validation set to each model to come up with the final prediction.  Based on how the ensembled prediction deviates from the actual outcome, we can learn the optimal set of weights.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Boosting&lt;/span&gt;&lt;br /&gt;Boosting extend the idea of bagging by putting more emphasis on the training data that is wrongly predicted.  In boosting, weight sampling is used.  Initially each training data is equally weighted but as each iteration goes, the data that is wrongly classified will have its weight increased.  On the other hand, the model also has its weighted according to how good this model is predicting the data in its round.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Each training data carry a sample weight (initially are the same)&lt;/li&gt;&lt;li&gt;For each iteration, take sample (with replacement) based on weights&lt;/li&gt;&lt;li&gt;Compute the error rate e&lt;/li&gt;&lt;li&gt;For each sample that is wrongly predicted, adjust its sample weight by e/(1-e)&lt;/li&gt;&lt;li&gt;Weight the trained model according to its error.  For example, log(e/(1-e))&lt;/li&gt;&lt;li&gt;Stop when e is small enough&lt;/li&gt;&lt;li&gt;Ensemble the overall model according to model weight.&lt;/li&gt;&lt;/ol&gt;Gradient Boosting Method is one of the most powerful and popular boosting methods.  It is based on incrementally add a function that fits the residuals.&lt;br /&gt;&lt;br /&gt;Found function F to fit y ~ F(x) using a gradient descent approach.&lt;br /&gt;&lt;br /&gt;Start with some random guessing function F.  Compute the loss based on the residual. &lt;br /&gt;loss = L(y - F(x)) where L is some loss function&lt;br /&gt;&lt;br /&gt;Partial differentiate loss w.r.t. function F, and compute the value at every data point.  Use another machine learning model to learn another function g(x) that predicts the partial differentiation value.  Then update F(x) &amp;lt;- F(x) + a.g(x) where a is the learning rate.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Other ways of Sampling&lt;/span&gt;&lt;br /&gt;Instead of sampling the training data, we can also sample on the attributes (e.g. we can randomly pick a subset of the input variables)  Random Forest is an example of this approach.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Sliding window&lt;/span&gt;&lt;br /&gt;Machine Learning is based on an important assumption that the future is repeating the same pattern as the history.  As we all know, as time goes by, this assumption becomes more and more invalid.  In other words, we should put more weight in recent history than long term history.&lt;br /&gt;&lt;br /&gt;Ensemble method also gives us an elegant way to decay the weight of old data.  In this case, we maintain a sliding window of models (say the last 7 days).  We learn a model daily and expire the oldest model we compute 7 days ago.&lt;br /&gt;&lt;br /&gt;M = M1 + M2.a + M3.a^2 + ... + M7.a^6&lt;br /&gt;M = M / (1 + a + a^2 + ... + a^6)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1516544936509223853?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1516544936509223853/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1516544936509223853' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1516544936509223853'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1516544936509223853'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2012/01/machine-learning-ensemble-methods.html' title='Machine Learning: Ensemble Methods'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-5274453680574951921</id><published>2011-09-01T12:59:00.000-07:00</published><updated>2011-09-04T16:35:42.168-07:00</updated><title type='text'>Recommendation Engine</title><content type='html'>&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;In a classical model of recommendation system, there are "users" and "items".  User has associated metadata (or content) such as age, gender, race and other demographic information.  Items also has its metadata such as text description, price, weight ... etc.  On top of that, there are interaction (or transaction) between user and items, such as userA download/purchase movieB, userX give a rating 5 to productY ... etc.&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;a href="http://1.bp.blogspot.com/-5jJbvLcccrM/Tl_mi56QEtI/AAAAAAAAAkU/sHFcM2rT1Qk/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 152px;" src="http://1.bp.blogspot.com/-5jJbvLcccrM/Tl_mi56QEtI/AAAAAAAAAkU/sHFcM2rT1Qk/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5647485945080976082" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Now given all the metadata of user and item, as well as their interaction over time, can we answer the following questions ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;What is the probability that userX purchase itemY ?&lt;/li&gt;&lt;li&gt;What rating will userX give to itemY ?&lt;/li&gt;&lt;li&gt;What is the top k unseen items that should be recommended to userX ?&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Content-based Approach&lt;/span&gt;&lt;br /&gt;In this approach, we make use of the metadata to categorize user and item and then match them at the category level.  One example is to recommend jobs to candidates, we can do a IR/text search to match the user's resume with the job descriptions.  Another example is to recommend an item that is "similar" to the one that the user has purchased.  Similarity is measured according to the item's metadata and various distance function can be used.  The goal is to find k nearest neighbors of the item we know the user likes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Collaborative Filtering Approach&lt;/span&gt;&lt;br /&gt;In this approach, we look purely at the interactions between user and item, and use that to perform our recommendation.  The interaction data can be represented as a matrix.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-g4Kek53agA8/Tl_s7r5mmMI/AAAAAAAAAkk/q43IJZ9mIsQ/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 114px;" src="http://2.bp.blogspot.com/-g4Kek53agA8/Tl_s7r5mmMI/AAAAAAAAAkk/q43IJZ9mIsQ/s320/P1.png" alt="" id="BLOGGER_PHOTO_ID_5647492967886657730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that each cell represents the interaction between user and item.  For example, the cell can contain the rating that user gives to the item (in the case the cell is a numeric value), or the cell can be just a binary value indicating whether the interaction between user and item has happened.  (e.g. a "1" if userX has purchased itemY, and "0" otherwise.&lt;br /&gt;&lt;br /&gt;The matrix is also extremely sparse, meaning that most of the cells are unfilled.  We need to be careful about how we treat these unfilled cells, there are 2 common ways ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Treat these unknown cells as "0".  Make them equivalent to user giving a rate "0".  This may or may not be a good idea depends on your application scenarios.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Guess what the missing value should be.  For example, to guess what userX will rate itemA given we know his has rate on itemB, we can look at all users (or those who is in the same age group of userX) who has rate both itemA and itemB, then compute an average rating from them.  Use the average rating of itemA and itemB to interpolate userX's rating on itemA given his rating on itemB.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;User-based Collaboration Filter&lt;br /&gt;&lt;/span&gt;In this model, we do the following&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Find a group of users that is “similar” to user X&lt;/li&gt;&lt;li&gt;Find all movies liked by this group that hasn’t been seen by user X&lt;/li&gt;&lt;li&gt;Rank these movies and recommend to user X&lt;/li&gt;&lt;/ol&gt;&lt;a href="http://4.bp.blogspot.com/-06YIrYjJ1m4/TmBnFznqoxI/AAAAAAAAAks/HD2vhDWetdg/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 151px;" src="http://4.bp.blogspot.com/-06YIrYjJ1m4/TmBnFznqoxI/AAAAAAAAAks/HD2vhDWetdg/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5647627282176189202" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This introduces the concept of user-to-user similarity, which is basically the similarity between 2 row vectors of the user/item matrix.   To compute the K nearest neighbor of a particular users.  A naive implementation is to compute the "similarity" for all other users and pick the top K.&lt;br /&gt;&lt;br /&gt;Different similarity functions can be used.  Jaccard distance function  is defined as the number of intersections of movies that both users has  seen divided by the number of union of movies they both seen.   Pearson  similarity is first normalizing the user's rating and then compute the  cosine distance.&lt;br /&gt;&lt;br /&gt;There are two problems with this approach&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Compare userX and userY is expensive as they have millions of attributes&lt;/li&gt;&lt;li&gt;Find top k similar users to userX require computing all pairs of userX and userY&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;Location Sensitive Hashing and Minhash&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;To resolve problem 1, we approximate the similarity using a cheap estimation function, called minhash.  The idea is to find a hash function h() such that the probability of h(userX) = h(userY) is proportion to the similarity of userX and userY.  And if we can find 100 of h() function, we can just count the number of such function where h(userX) = h(userY) to determine how similar userX is to userY.  The idea is depicted as follows ...&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-NvaT7CYf9dk/TmHEHmZwuYI/AAAAAAAAAlc/y9CpjWVepIQ/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 344px;" src="http://2.bp.blogspot.com/-NvaT7CYf9dk/TmHEHmZwuYI/AAAAAAAAAlc/y9CpjWVepIQ/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5648011042546039170" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;It will be expensive to permute the rows if the number of rows is large.  Remember that the purpose of h(c1) is to return row number of the first row that is 1.  So we can scan each row of c1 to see if it is 1, if so we apply a function newRowNum = hash(rowNum) to simulate a permutation.  Take the minimum of the newRowNum seen so far.&lt;br /&gt;&lt;br /&gt;As an optimization, instead of doing one column at a time, we can do it a row at the time, the algorithm is as follows&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/-6lGM-gmXbRo/TmG4MBZNoDI/AAAAAAAAAlM/pekmpz2n2RI/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 275px; height: 320px;" src="http://1.bp.blogspot.com/-6lGM-gmXbRo/TmG4MBZNoDI/AAAAAAAAAlM/pekmpz2n2RI/s320/P1.png" alt="" id="BLOGGER_PHOTO_ID_5647997924371439666" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;To solve problem 2, we need to avoid computing all other users' similarity to userX.  The idea is to hash users into buckets such that similar users will be fall into the same bucket.  Therefore, instead of computing all users, we only compute the similarity of those users who is in the same bucket of userX.&lt;br /&gt;&lt;br /&gt;The idea is to horizontally partition the column into b bands, each with r rows.   By pick the parameter b and r, we can control the likelihood (function of similarity) that they will fall into the same bucket in at least one band.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-EwZkU5S95HU/TmHC_xlsGhI/AAAAAAAAAlU/f7rt6A0iLl0/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 193px;" src="http://3.bp.blogspot.com/-EwZkU5S95HU/TmHC_xlsGhI/AAAAAAAAAlU/f7rt6A0iLl0/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5648009808598276626" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Item-based Collaboration Filter&lt;/span&gt;&lt;br /&gt;If we transpose the user/item matrix and do the same thing, we can compute the item to item similarity.  In this model, we do the following ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Find the set of movies that user X likes (from interaction data)&lt;/li&gt;&lt;li&gt;Find a group of movies that is similar to these set of movies that we know user X likes&lt;/li&gt;&lt;li&gt;Rank these movies and recommend to user X&lt;/li&gt;&lt;/ol&gt;&lt;a href="http://2.bp.blogspot.com/-YEEM5PYuTAI/TmHGPmDYICI/AAAAAAAAAls/8VIm0-7PyYM/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 200px; height: 125px;" src="http://2.bp.blogspot.com/-YEEM5PYuTAI/TmHGPmDYICI/AAAAAAAAAls/8VIm0-7PyYM/s200/P1.png" alt="" id="BLOGGER_PHOTO_ID_5648013378914361378" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;It turns out that computing item-based collaboration filter has more benefit than computing user to user similarity for the following reasons ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Number of items typically smaller than number of users&lt;/li&gt;&lt;li&gt;While user's taste will change over time and hence the similarity matrix need to be updated more frequent, item to item similarity tends to be more stable and requires less update.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Singular Value Decomposition&lt;br /&gt;&lt;/span&gt;If we look back at the matrix, we can see the matrix multiplication is equivalent to mapping an item from the item space to the user space.  In other words, if we view each of the existing item as an axis in the user space (notice, each user is a vector of their rating on existing items), then multiplying a new item with the matrix gives the same vector like the user.  So we can then compute a dot product with this projected new item with user to determine its similarity.  It turns out that this is equivalent to map the user to the item space and compute a dot product there.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-y9btoLjP_iM/TmJQUkWh6cI/AAAAAAAAAl0/gC-yWmMYTmU/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 274px;" src="http://4.bp.blogspot.com/-y9btoLjP_iM/TmJQUkWh6cI/AAAAAAAAAl0/gC-yWmMYTmU/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5648165196961802690" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In other words, multiply the matrix is equivalent to mapping between item space and user space. Now lets imagine there is a hidden concept space in between.  Instead of jumping directly from user space to item space, we can think of jumping from user space to a concept space, and then to the item space.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-aewfHUe898c/TmJQ7T3SwTI/AAAAAAAAAl8/EyXh1KEKMTg/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 270px;" src="http://4.bp.blogspot.com/-aewfHUe898c/TmJQ7T3SwTI/AAAAAAAAAl8/EyXh1KEKMTg/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5648165862550716722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that here we first map the user space to the concept space and also map the item space to the concept space.  Then we match both user and item at the concept space.  This is a generalization of our recommender.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can use SVD to factor the matrix into 2 parts.  Let P be the m by n matrix (m rows and n columns).  P = UDV where U is an m by m matrix, each column represents the eigenvectors of P*transpose(P).  And V is an n by n matrix with each row represents the eigenvector of transpose(P)*P.  D is a diagonal matrix containing eigenvalues of P*transpose(P), or transpose(P)*P.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In other words, we can decompose P into U*squareroot(D) and squareroot(D)*V.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Notice that D can be thought as the strength of each "concept" in the concept space.  And the value is order in terms of their magnitude in decreasing order.  If we remove some of the weakest concept by making them zero, we reduce the number of non-zero elements in D, which effective generalize the concept space (make them focus in the important concepts).&lt;br /&gt;&lt;div&gt;&lt;br /&gt;Calculate SVD decomposition for matrix with large dimensions is expensive.  Fortunately, if our goal is to compute an SVD approximation (with k diagonal non-zero value), we can use the &lt;a href="http://www.stanford.edu/group/mmds/slides2010/Martinsson.pdf"&gt;random projection mechanism as describer here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;Association Rule Based&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;In this model, we use the market/basket association rule algorithm to discover rule like ...&lt;/div&gt;&lt;div&gt;{item1, item2} =&amp;gt; {item3, item4, item5}&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We represent each user as a basket and each viewing as an item (notice that we ignore the rating and use a binary value).  After that we use association rule mining algorithm to detect frequent item set and the association rules.  Then for each user, we match the user's previous viewing items to the set of rules to determine what other movies should we recommend.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;Evaluate the recommender&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;After we have a recommender, how do we evaluate the performance of it ?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The basic idea is to use separate the data into the training set and the test set.  For the test set, we remove certain user-to-movies interaction (change certain cells from 1 to 0) and pretending the user hasn't seen the item.  Then we use the training set to train a recommender and then fit the test set (with removed interaction) to the recommender.  The performance is measured by how much overlap between the recommended items with the one that we have removed.  In other words, a good recommender should be able to recover the set of items that we have removed from the test set.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;Leverage tagging information on items&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;In some cases, items has explicit tags associated with them (we can considered the tags is a user-annotated concept space added to the items).  Consider each item is described with a vector of tags.  Now user can also be auto-tagged based on the items they have interacted.  For example, if userX purchase itemY which is tagged with Z1, and Z2.  Then user will increase her tag Z1 and Z2 in her existing tag vector.  We can use a time decay mechanism to update the user's tag vector as follows ...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;current_user_tag = alpha * item_tag + (1 - alpha) * prev_user_tag&lt;br /&gt;&lt;br /&gt;To recommend an item to the user, we simply need to calculate the top k items by computing the dot product (ie: cosine distance) of the user tag vector and the item tag vector.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-5274453680574951921?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/5274453680574951921/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=5274453680574951921' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5274453680574951921'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5274453680574951921'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2011/09/recommendation-engine.html' title='Recommendation Engine'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-5jJbvLcccrM/Tl_mi56QEtI/AAAAAAAAAkU/sHFcM2rT1Qk/s72-c/P1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-5328332568332849178</id><published>2011-08-28T21:37:00.000-07:00</published><updated>2011-08-28T22:14:00.824-07:00</updated><title type='text'>Scale Independently in the Cloud</title><content type='html'>Deploying a large scale system nowadays is quite different from before when data center is the only choice.  A traditional deployment exercise typically involve a intensive performance modeling exercise to accurately predict the resource requirement for the production system.  The accuracy is very important because it is expensive and slow to make changes after deploy.&lt;br /&gt;&lt;br /&gt;This performance modeling typically involve the following steps.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Build a graph model based on the component interaction.&lt;/li&gt;&lt;li&gt;Express the mathematical relationship between input traffic, the resource consumption at the processing node (CPU and Memory based on the processing algorithm), and the output traffic (which will become the input of downstream processing nodes)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Model external workload as random variable (with a workload distribution function)&lt;/li&gt;&lt;li&gt;Run a simulation exercise to compute the corresponding workload distribution function for the workload of each link and node, such workload unit includes CPU, Memory and Network requirement (latency and bandwidth).&lt;/li&gt;&lt;li&gt;Based on business requirement, pick a peak external load target (say 95%).  Vary the external workload from 0 to the max workload and compute the corresponding range of workload at each node and link in the graph.&lt;/li&gt;&lt;li&gt;The max CPU, Memory, I/O of each node defines capacity needed to provision for that node.  The max value of each link defines the network bandwidth / latency requirement of that link&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-eMBozepwj94/TlsapTefHLI/AAAAAAAAAkE/jZVl_c5RFro/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 108px;" src="http://2.bp.blogspot.com/-eMBozepwj94/TlsapTefHLI/AAAAAAAAAkE/jZVl_c5RFro/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5646135854744149170" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that the resource are typically provisioned at the peak load target which means the resources are idle most of the time, impacting the efficiency of the overall system.  On the other hand, SaaS based system introduce a more dynamic  relationship (anyone can call anyone) between components which makes this  tradition way of performance modeling more challenging.  The performance modeling exercise need to be conducted whenever new clients or new services are introduced into the system, resulting in a non-trivial on going maintenance cost.&lt;br /&gt;&lt;br /&gt;Thanks for the cloud computing phenomenon the underlying dynamics and economics has shifted quite significantly over the last few years and now doing capacity planning is quite different from before.&lt;br /&gt;&lt;br /&gt;First of all, making a wrong capacity estimation is less costly when deploying additional resources are talking about minutes rather than month.   Instead to attempting to construct the fully picture of the system, the cloud practices is to focus at each individual component to make sure each can "scale independently".  The steps are as follows ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Each component scale independently using horizontal scaling.  ie:  f(a.x) = a.f(x)&lt;/li&gt;&lt;li&gt;Instead of establish a formal mathematical model, just deploy the system in the cloud, adjust the input workload and measure the utilization at each node and link (e.g. AWS Cloudwatch)&lt;/li&gt;&lt;li&gt;Based on the utility measurement, define the initial deployment capacity based on average load (not peak load).&lt;/li&gt;&lt;li&gt;Use auto-scaling to adjust pool size of independent components according to runtime workload.&lt;/li&gt;&lt;li&gt;Sync workload is typically frontend by Load balancer.   Async workload will be frontend by scalable queues.   Output can be a callout, stored in queue, or stored in scalable storage &lt;/li&gt;&lt;/ol&gt;&lt;a href="http://4.bp.blogspot.com/-VhH_yZzf4Hs/TlsfEEBG9DI/AAAAAAAAAkM/Rpkj1ElghnQ/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 157px;" src="http://4.bp.blogspot.com/-VhH_yZzf4Hs/TlsfEEBG9DI/AAAAAAAAAkM/Rpkj1ElghnQ/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5646140712497378354" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;By focusing in "scale independently", each component can plug and play much easier with other component due to less assumption is made on each other as each component can dynamically adjusted its capacity according to run-time need.  This results in not only a more scalable, but also more flexible system.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-5328332568332849178?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/5328332568332849178/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=5328332568332849178' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5328332568332849178'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5328332568332849178'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2011/08/scale-independently-in-cloud.html' title='Scale Independently in the Cloud'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-eMBozepwj94/TlsapTefHLI/AAAAAAAAAkE/jZVl_c5RFro/s72-c/P1.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-2924746439738554798</id><published>2011-07-09T16:35:00.000-07:00</published><updated>2011-07-09T18:05:04.899-07:00</updated><title type='text'>Fraud Detection Methods</title><content type='html'>Online electronic fraud has become increasingly problematic to many companies offering services on the web.  Here I am trying to generalize a set of techniques that I found useful in the past.&lt;br /&gt;&lt;br /&gt;To be effective in combating frauds, the first thing companies need to have is an overall top-down strategy to deal with frauds, including ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Have a clearly defined security objective, a good understanding of the fraudsters' motivation, as well as the consequences of fraud.&lt;/li&gt;&lt;li&gt;Have an effective analytic method in place to detect fraud immediately when it happens&lt;/li&gt;&lt;li&gt;Have an responsive handling process in place to react immediately after fraud is detected&lt;/li&gt;&lt;li&gt;Have an preventive process in place to feedback newly discovered fraud patterns into the system&lt;/li&gt;&lt;/ol&gt;I will be focusing more in following discussion on the technical side of the analytic methods but I want to reiterate that the process side is equally (or even more) important in order for the whole effort of combating fraud to be effective.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Setting Objectives and Targets&lt;/span&gt;&lt;br /&gt;Setting the objectives upfront is very important for guiding the subsequent design process of the technical mechanism, especially when making tradeoffs decisions between false positive and false negative.   A high false negative rate means fraud goes through undetected while a high false positive rate will cause inconvenience to your existing customers as well as unnecessarily large manual investigation effort.&lt;br /&gt;&lt;br /&gt;From another angle, some companies look at the fraud detection methods as an optimization mechanism of using existing resource for conducting manual investigation, which is usually the last resort to handle fraud.  These companies usually has a constant team size of fraud investigators.  If these people spend too much time in legitimate transactions, there will be less time left to investigate the real fraud transactions.   Therefore, the analytical methods aim at guiding the manual investigation effort to those transaction with a higher chance of fraud.&lt;br /&gt;&lt;br /&gt;Notice that fraud detection is a continuously-improvement-game.   At each iteration, there is a baseline (usually the current best method) and an improvement threshold.   The method at each iteration is supposed to provide at least an improvement over the baseline.   In the first iteration, the baseline can be very low (e.g. simply random guess).   At each iteration, the baseline will be raised until the companies' objectives and targets have been satisfactorily met.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Instrumenting Analytical Methods&lt;/span&gt;&lt;br /&gt;Depends on the nature of business and the motivation of fraudster, the characteristics of fraud can be very different.  It is very important to understand them before designing the best mechanism to combat them.&lt;br /&gt;&lt;br /&gt;Here is a high level decision process to determine the correct method&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-Qv9-KM_kVws/Thjq1yuCtmI/AAAAAAAAAj8/AU0w-Kg-jcA/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 381px;" src="http://4.bp.blogspot.com/-Qv9-KM_kVws/Thjq1yuCtmI/AAAAAAAAAj8/AU0w-Kg-jcA/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5627505944268289634" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;a) Rule-base approach&lt;/span&gt;&lt;br /&gt;If the attack pattern is well-defined (e.g. credit card fradulent transactions tend to have a higher-than-usual spending amount as well as higher-than-usual transaction rate).   These attack pattern can usually be extracted from domain experts in the business.  The best method in this case to implement a solution is to encode such knowledge as rules or even hard-wired into the application code for efficiency reasons.&lt;br /&gt;&lt;br /&gt;Notice that rules need to maintain as new attack patterns are discovered or old attack patterns become obsoleted.  Rule engine is a pretty common approach in order to keep such domain knowledge in a declarative form so it can be easily maintained.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;b) Classification approach&lt;/span&gt;&lt;br /&gt;If we have training examples for both normal case and fraud case, classification methods (based on machine learning) can perform very well.  Such analytic methods includes logistic regression, decision trees (random forest), Support vector machine, Bayesian network (naive bayes), Neural network ... etc.&lt;br /&gt;&lt;br /&gt;To compare the performance of different classification methods, confusion matrix is commonly used.  It is a 2 by 2 matrix measuring the ratio of true positive, false positives, true negative and false negative.  Based on the cost associated with false positive and false negative, we can determine a best method (or ensemble of multiple methods) to achieve a minimal cost.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;c) One-Class Model approach&lt;/span&gt;&lt;br /&gt;If we have just training examples for norm cases but no fraud examples, we still can learn a model based on normal data and then compute the distance between the transaction data and the model we learned.  We flag the transaction as fraud if the distance exceed a domain-specific threshold.   Here the distance function between the model and a data point needs to be defined ad commonly used ones include statistic methods where the model is the mean and standard deviation of the norm data and the P-value as the distance function.  On the other hand, Euclidean distance, Jaccard distance and cosine distance are also commonly used.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;d) Density based methods and clustering methods&lt;/span&gt;&lt;br /&gt;If we know nothing about the fraud patterns and also don't have training examples for even norm cases, then we can make some assumptions about the distribution of data, such as fraud data is less dense than norm data, in other words, fraud transaction will have less neighbors within a certain radius.  If this assumption is reasonable, then we can use density-based method to predict fraud transaction.  For example, counting number of neighbors within radius r, or measuring the distance to the kth nearest neighbour.  We can also use clustering method to learn clusters and flag transactions too distant from its cluster center as fraud.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Determining input signals&lt;/span&gt;&lt;br /&gt;In my experience, determining the right signal is the most important part of the whole process.  Sometimes we use raw input attributes as the signal while other times we need to combine multiple attributes to provide the signal.&lt;br /&gt;&lt;br /&gt;For example, as we take raw measurements at different points in time, the input signal may involve computing the rate of change of these raw measurement over time.  In other words, it is not adequate to just look at each data point in isolation and we need to aggregate raw measurement in a domain specific way.&lt;br /&gt;&lt;br /&gt;In my past experience, a large portion of fraud detection cases is about how to deal with account takeover transactions (stolen identities and impersonation).  Usually detecting sudden change of behavior (e.g. change point detection) is an effective approach to deal with this kind of frauds.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Time dimension&lt;/span&gt;&lt;br /&gt;Instead of looking at each fraud in isolation, in many cases we need to look at the "context" under which fraud are evaluated.  As we discussion above in detecting sudden change of behavior, it is quite common to use the past data of a user to build a norm model and evaluate the recent transactions against it to determine if it is fraud.  In other words, we compare his/her current behavior with the past.&lt;br /&gt;&lt;br /&gt;Besides the "time dimension", we can look into other context as well.   For example, we can look at user's peer-group's behavior, observing the deviation of one person's behavior to its peer-group as an indication of a stolen identity.&lt;br /&gt;&lt;br /&gt;Notice that the norm pattern may also evolve/change over time, nevertheless we usually don't expect such change to be sudden or rapid.  To cater for such slow drift, the norm model need to be continuously adjusted as well.   A pretty common technique is to compute a long-term behavioral signature based on a longer time span of transactional data (e.g. 6 months) and compute a short-term behavioral signature based on a shorter time span of data.   Then the short-term signature is compared with the long-term signature using a distance function and fraud is flagged if it exceed a pre-defined threshold.  It is also important to have an incremental update mechanism for the long-term signature rather than recomputing it from scratch at every update.  A common approach is to use exponentially time-decay function such as ...&lt;br /&gt;M[t+1] = a.M[t] + (1-a)S[t]. &lt;br /&gt;where 0 &amp;lt; a &amp;lt; 1&lt;br /&gt;M[t] is model at time t&lt;br /&gt;S[t] is the transaction at time t&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The importance of Domain Experts&lt;/span&gt;&lt;br /&gt;Although sophisticated machine learning algorithms has been pretty powerful in using a generalized solution for a broad scenarios of problems.  From my past experience, I have yet seen much cases a sophisticated machine learning algorithm can beat domain expertise.   In many projects, simple algorithm with deep domain expertise out-performs sophisticated analytical methods significantly.   Therefore, the common pattern that I recommend is to build a rule-based solution at the core and augment it using machine learning analytical methods.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-2924746439738554798?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/2924746439738554798/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=2924746439738554798' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2924746439738554798'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2924746439738554798'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2011/07/fraud-detection-methods.html' title='Fraud Detection Methods'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-Qv9-KM_kVws/Thjq1yuCtmI/AAAAAAAAAj8/AU0w-Kg-jcA/s72-c/P1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-2400415893429084182</id><published>2011-04-21T22:29:00.000-07:00</published><updated>2011-04-22T16:04:48.872-07:00</updated><title type='text'>K-Means Clustering in Map Reduce</title><content type='html'>Unsupervised machine learning has broad application in many e-commerce sites and one common usage is to find clusters of consumers with common behaviors.  In clustering methods, K-means is the most basic and also efficient one.&lt;br /&gt;&lt;br /&gt;K-Means clustering involve the following logical steps&lt;br /&gt;&lt;br /&gt;1) Determine the value of k&lt;br /&gt;2) Determine the initial k centroids&lt;br /&gt;3) Repeat until converge&lt;br /&gt;- Determine membership:  Assign each point to the closest centroid&lt;br /&gt;- Update centroid position:  Compute new centroid position from assigned members&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Determine the value of K&lt;/span&gt;&lt;br /&gt;This is basically asking the question of: "How many clusters you are interested to discover ?"&lt;br /&gt;So the answer is specific to the problem domain.&lt;br /&gt;&lt;br /&gt;One way is to try different K.  At some point, we'll see increasing K doesn't help much to improve the overall quality of clustering.  Then that is the right value of K.&lt;br /&gt;&lt;br /&gt;Notice that the overall quality of cluster is the average distance from each data point to its associated cluster.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Determine the initial K centroids&lt;/span&gt;&lt;br /&gt;We need to pick K centroids to start the algorithm.  So one way to pick them is to randomly pick K points from the whole data set.&lt;br /&gt;&lt;br /&gt;However, picking a good set of centroids can reduce the number of subsequent iterations and by "good" I mean the K centroid should be as far apart to each other as possible, or even better the initial K centroid is close to the final K centroid.  As you can see, choosing the random K points is reasonable but non-optimum.&lt;br /&gt;&lt;br /&gt;Another approach is to take a small random sample set from the input data set and do a  hierarchical clustering within this smaller set (note that hierarchical  clustering is not-scaling to large data set).&lt;br /&gt;&lt;br /&gt;We can also partition the space into overlapping region using canopy cluster technique (describe below) and pick the center of each canopy as the initial centroid.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Iteration&lt;/span&gt;&lt;br /&gt;Each iteration is implemented as a Map/Reduce job.  First of all, we need a control program on the client side to  initialize the centroid positions, kickoff the iteration of Map/Reduce jobs and determine whether the iteration should end ...&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"&gt;&lt;code&gt;kmeans(data) {&lt;br /&gt;  initial_centroids = pick(k, data)&lt;br /&gt;  upload(data)&lt;br /&gt;  writeToS3(initial_centroids)&lt;br /&gt;  old_centroids = initial_centroids&lt;br /&gt;  while (true){&lt;br /&gt;    map_reduce()&lt;br /&gt;    new_centroids = readFromS3()&lt;br /&gt;    if change(new_centroids, old_centroids) &amp;lt; delta {&lt;br /&gt;      break&lt;br /&gt;    } else {&lt;br /&gt;      old_centroids = new_centroids&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;  result = readFromS3()&lt;br /&gt;  return result&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Within each iteration, most of the processing will be done in the Map task, which determine the membership for each point, as well as compute a partial sum of each member points of each cluster.&lt;br /&gt;&lt;br /&gt;The reducer did the easy job by aggregating all partial sums and compute the update centroid position, and then out them into a shared store (S3 in this case) that can be picked up by the Map/Reduce job of next round.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-cAYr3kBEFlQ/TbEopL57W_I/AAAAAAAAAjY/MYYMGBJ6UYI/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 279px;" src="http://3.bp.blogspot.com/-cAYr3kBEFlQ/TbEopL57W_I/AAAAAAAAAjY/MYYMGBJ6UYI/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5598300499833740274" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Complexity Analysis&lt;/span&gt;&lt;br /&gt;Most of the work is done by the Mapper and the workload is pretty balanced.  So the time complexity will be O(k*n/p) where k is number of clusters, n is number of data points and p is number of machines.  Note that the factor of k comes in at the closest_centroid() function above when comparing each data point with each intermediate centroid as follows ...&lt;br /&gt;&lt;pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"&gt;&lt;code&gt;closest_centroid(point, listOfCentroids) {&lt;br /&gt;  bestCentroid = listOfCentroids[0]&lt;br /&gt;  minDistance = INFINITY&lt;br /&gt;  for each centroid in listOfCentroids {&lt;br /&gt;    distance = dist(point, centroid)&lt;br /&gt;    if distance &amp;lt; minDistance {&lt;br /&gt;      minDistance = distance&lt;br /&gt;      bestCentroid = centroid&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;  return bestCentroid&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;If we partition the space into proximity regions, we only need to compare each point with centroid within the same proximity region and treat other centroids infinite distance.  In other words, we don't have to compare each point with all k centroids.&lt;br /&gt;&lt;br /&gt;Canopy clustering provide such a partitioning mechanism.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Canopy Clustering&lt;/span&gt;&lt;br /&gt;To define the proximity region (canopy), we can draw a circle (or hypersphere) centered at a data point.  Points outside this sphere is considered to be too far.&lt;br /&gt;&lt;br /&gt;However, if we apply this definition to every point, then we will have as many proximity region as the number of points, which ends up doesn't save much processing.  We also observed that points are very close by each other can stay in the same region without each point creating their own.  Therefore, we can draw a smaller circle within the big circle (with the same center) such that data points within the small circle is not allowed to form its own proximity region.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-CPemjTn74Lw/TbGlvWyRe3I/AAAAAAAAAjg/KEqmSN6v594/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 198px;" src="http://2.bp.blogspot.com/-CPemjTn74Lw/TbGlvWyRe3I/AAAAAAAAAjg/KEqmSN6v594/s320/P1.png" alt="" id="BLOGGER_PHOTO_ID_5598438044787112818" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that each proximity region can overlap with each other and the degree of overlapping will be affected by the choice of T1.  Also the choice of T2 affects how many canopies will be formed.  Picking the right number of T1 and T2 is domain-specific, and also depends on the number of clusters and the space volume.  If there is a small number of clusters within a big space, then a bigger T1 should be chosen.&lt;br /&gt;&lt;br /&gt;To create the canopies (and mark the data points with the canopies), we will do the following steps ...&lt;br /&gt;1) Create the canopy centers, with one scan&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Keep a list of canopies, initially an empty list&lt;/li&gt;&lt;li&gt;Scan each data point, if it is within T2 distance of existing canopies, discard it.  Otherwise, add this point into the list of canopies&lt;/li&gt;&lt;/ul&gt;&lt;a href="http://2.bp.blogspot.com/-S9Ty4WNwk5I/TbIBNL-d2_I/AAAAAAAAAjo/MOPga4NmfDA/s1600/P1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 330px;" src="http://2.bp.blogspot.com/-S9Ty4WNwk5I/TbIBNL-d2_I/AAAAAAAAAjo/MOPga4NmfDA/s400/P1.png" alt="" id="BLOGGER_PHOTO_ID_5598538612839668722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;2) Assign data points to the canopies, with another scan&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Start with a list of canopies from last step&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Scan each data point, if it is within T1 of the canopyA, add A as the assigned canopy to the data point.  Notice that the data point can be assigned to multiple canopies&lt;/li&gt;&lt;li&gt;When done, each data point will look like &lt;x1, c1=""&gt;&lt;br /&gt;&lt;/x1,&gt;&lt;/li&gt;&lt;/ul&gt;&lt;a href="http://2.bp.blogspot.com/-2cTcbzx3cZQ/TbIBcAsF7WI/AAAAAAAAAjw/Qs5SPSWMXds/s1600/P2.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 303px;" src="http://2.bp.blogspot.com/-2cTcbzx3cZQ/TbIBcAsF7WI/AAAAAAAAAjw/Qs5SPSWMXds/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5598538867507850594" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that now the input data points has been added with an extra attribute that contains the assigned canopies.  When compare the point with the intermediate centroids, we only need to compare centroids within the same canopy.  Here is the modified version of the algorithm ...&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"&gt;&lt;code&gt;closest_centroid(point, listOfCentroids) {&lt;br /&gt;  bestCentroid = listOfCentroids[0]&lt;br /&gt;  minDistance = INFINITY&lt;br /&gt;  for each cent in listOfCentroids {&lt;br /&gt;    if (not point.myCanopy.intersects(cent.myCanopy)) {&lt;br /&gt;      continue&lt;br /&gt;    }&lt;br /&gt;    distance = dist(point, centroid)&lt;br /&gt;    if distance &amp;lt; minDistance {&lt;br /&gt;      minDistance = distance&lt;br /&gt;      bestCentroid = centroid&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;  return bestCentroid&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-2400415893429084182?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/2400415893429084182/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=2400415893429084182' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2400415893429084182'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2400415893429084182'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2011/04/k-means-clustering-in-map-reduce.html' title='K-Means Clustering in Map Reduce'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-cAYr3kBEFlQ/TbEopL57W_I/AAAAAAAAAjY/MYYMGBJ6UYI/s72-c/P1.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-3187787432347206329</id><published>2011-03-19T18:47:00.000-07:00</published><updated>2011-03-19T20:39:35.795-07:00</updated><title type='text'>Compare Machine Learning models with ROC Curve</title><content type='html'>ROC Curve is a common method to compare performance between different models.  It can also be used to pick trade-off decisions between "false positives" and "false negatives".  ROC curve is defined as a plot of "false positive rate" against "false negative rate".  However, I don't find the ROC concept is intuitive and has been struggled for a while to grasp the concept.&lt;br /&gt;&lt;br /&gt;Here is my attempt to explain ROC curve from a different angle.  We use a binary classification example to illustrate the idea.  (ie: predicting whether a patient has cancer or not)&lt;br /&gt;&lt;br /&gt;First of all, all predictive model is not 100% correct.  The desirable state is that a person who actually has cancer got a positive test result, and a person who actually has no cancer got a negative test result.  Since the test is imperfect, it is possible that a person who actually has cancer was tested negative (ie: Fail to detect) or a person who actually has no cancer was tested positive (ie: False alarm).&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-6y4jktKu-YM/TYVhQ1ShJ6I/AAAAAAAAAiw/hcxHH9y8SPk/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 176px;" src="http://4.bp.blogspot.com/-6y4jktKu-YM/TYVhQ1ShJ6I/AAAAAAAAAiw/hcxHH9y8SPk/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5585977854633519010" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In reality, there is always a tradeoff between the false negative rate  and the false positive rate.  People can tune the decision threshold to  adjust them (e.g. In "random forest", we can set the threshold of predicting positive when more than 30% decision trees predicting positive).  Usually, the threshold is set based on the consequence or cost of mis-classification.  (e.g. in this  example, fail to detect has a much higher cost than a false alarm)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-NYD2KAx6pg0/TYVu2mKU66I/AAAAAAAAAi4/gK0IpQYznZ0/s1600/P2.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 202px;" src="http://2.bp.blogspot.com/-NYD2KAx6pg0/TYVu2mKU66I/AAAAAAAAAi4/gK0IpQYznZ0/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5585992797058821026" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This can also be used to compare model performance.  A good model is one that has both low false positive rate and low false negative rate, which is indicated in the size of the gray area below (the smaller the better).&lt;br /&gt;&lt;br /&gt;"Random guess" is the worst prediction model and is used as a baseline for comparison.  The decision threshold of a random guess is a number between 0 to 1 in order to determine between positive and negative prediction.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-Nz9gIYM512A/TYVzLE8adMI/AAAAAAAAAjI/usbJWYLeIuw/s1600/P3.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 283px;" src="http://2.bp.blogspot.com/-Nz9gIYM512A/TYVzLE8adMI/AAAAAAAAAjI/usbJWYLeIuw/s320/P3.png" alt="" id="BLOGGER_PHOTO_ID_5585997546965857474" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;ROC Curve is basically what I have described above with one transformation, which is transforming the y-axis from "fail to detect" to 1 - "fail to detect", which now become "success to detect".  Honestly I don't understand why this representation is better though.&lt;br /&gt;&lt;br /&gt;Now, the ROC curve will look as follows ...&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-1PvjF2Pq4mQ/TYV2ouqSwJI/AAAAAAAAAjQ/KHmQqQdAyMA/s1600/P4.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 183px;" src="http://4.bp.blogspot.com/-1PvjF2Pq4mQ/TYV2ouqSwJI/AAAAAAAAAjQ/KHmQqQdAyMA/s400/P4.png" alt="" id="BLOGGER_PHOTO_ID_5586001354915233938" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-3187787432347206329?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/3187787432347206329/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=3187787432347206329' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/3187787432347206329'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/3187787432347206329'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2011/03/compare-machine-learning-models-with.html' title='Compare Machine Learning models with ROC Curve'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-6y4jktKu-YM/TYVhQ1ShJ6I/AAAAAAAAAiw/hcxHH9y8SPk/s72-c/p1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-4965787605309153381</id><published>2011-03-17T22:59:00.000-07:00</published><updated>2011-03-17T23:18:53.617-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='predictive analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Predictive Analytics Conference 2011</title><content type='html'>&lt;span style="font-family:arial;"&gt;I attended the San Francisco Predictive Analytic conference this week and got a chance to chat with some best data mining practitioners of the country.  Here summarizes my key takeaways.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;How is the division of labor between human and machine?&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;Another way to ask this question is how “machine learning” and “domain expertise” work together and complement each other, since each has different strength and weakness.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;Machine learning is very good at processing large amount of data in an unbiased way while human is unable to process the same data volume and the judgment is usually biased.  However, machine cannot look beyond the data being given.  For example, if the prediction power is low, machine learning methods cannot distinguish whether it is because the data is not clean, or the wrong model is being chosen, or because some important input feature is not captured.  Domain expertise must be brought in to figure out the problem.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;So the consensus is data mining / machine learning is simply a toolbox that can be used to augment human’s domain expertise, but can never replace it.  For example, the domain expert can throw in a large number of input features to the machine learning model, which can determine a subset that are most influential.  But if the domain expert doesn’t recognize an important input feature (and not capturing it), there is no way the machine learning model can figure out what is missing, not even recognizing that something is missing.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;On the other hand, human is also very good in visualizing data patterns.  “Data visualization” technique can be a powerful means to get a good sense and quickly identify the area where drilldown analysis should be conducted.  Of course, visualization is limited to low dimension data as human cannot comprehend more than a handful of dimensions.  Human is also easily biased so they may find patterns where are actually coincidence.  By having human and machine working together, they complement each other very well.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;What are some of the key design decisions in data mining?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-family:arial;"&gt;Balance between false +ve and false –ve based on cost / consequence of making a wrong decision.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:arial;"&gt;We don’t have to use a method from beginning to end.  We can use different methods at different stage of the analysis.  For example, in a multi-class (A, B, C) problem, we can use decision tree to distinguish A from notA (ie: B, C) and then use support vector machine to separate B and C.  As another example, we can use decision tree to determine the best input attributes to be used by the neural network.&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;What is the most powerful / most commonly used supervised machine learning modeling technique?&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;The general answer is that each modeling technique has its strength and weakness and none of them wins in all situations.  So understand their corresponding strength and weakness is important to pick the right one.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-family:arial;" &gt;Generalized Linear Regression&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;a href="http://horicky.blogspot.com/2009/11/machine-learning-with-linear-model.html"&gt;Linear and Logistic regression&lt;/a&gt; are based on fitting a linear plane into a set of data points such that the root mean square of error (distance between predicted output and actual output) is minimized.  It is by far the most commonly used technique, one for numeric output and the other for categorical output.  They have a long history in statistics.  It is supported in pretty much all commercial and open source data mining tools.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;Linear and Logistic regression model requires certain amount of data preparation such as missing data handling.  It also assuming that the output (or logit output) is a linear combination of input features, error is expected to be normally distribution.  However, real-life scenarios are not always linear.  To deal with non-linearity, input terms will be mixed (usually by cross-multiplication) in different ways to generate additional input terms called “interactions”.  This process is like trial and error and can generate huge number of combination.  Nevertheless, they do a reasonably good job in a wide spectrum of business problems and are well-understood by statisticians and data miners.  And they are commonly used as a baseline comparison with other models.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;font-family:arial;" &gt;Neural Network&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:arial;"&gt;&lt;a href="http://horicky.blogspot.com/2009/11/machine-learning-with-linear-model.html"&gt;Neural Network&lt;/a&gt; is based on multiple layer of perceptrons (each is like a logistic regression with binary input and output).  There is typically a hidden layer (so the number of layers is 3) with N perceptrons (where N is trial and error).  Because of the extra layer and the logit() function in the neural network, it can handle non-linearity very well.  If it has good predictor in its input data, Neural network can achieve very high performance in prediction.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;Similar to linear regression, Neural network requires careful data preparation to remove noisy data as well as redundant input attributes (those that are highly correlated).  Neural network also take much longer time to train as compared to other methods.  Also the model that Neural network has learned is not explainable or make good sense out of it.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;font-family:arial;" &gt;Support Vector Machine&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:arial;"&gt;&lt;a href="http://horicky.blogspot.com/2009/11/support-vector-machine.html"&gt;Support Vector Machine&lt;/a&gt; is a binary classifier (input feature is numeric).  It is based on finding a linear plane that can separate the binary output class such that the margin is maximized.  The optimal solution is expressed in terms of the dot product of vectors.  If the points are not linearly separable, we can use a function to transform the points to a higher dimension space such that it is linearly separable.  The Math shows that the dot product (after transforming to a hi-dim space) can be generalized into a Kernel function (Radial basis function being the most common one).  Although the underlying math is not easy for everyone to understand, SVM has demonstrated outstanding performance in a wide spectrum of problems and recently become one of the most effective methods.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;Despite of its powerful capability, SVM is not broadly implemented in commercial products as there are some patent issue as AT&amp;amp;T holds the patent of SVM.  On the other hand, the non-linear kernel function (such as the most common Radial Basis function) is difficult to implement in parallel programming model such as Map/Reduce.  SVM is undergoing active research and a derivative Support Vector Regression can be used to predict numeric output.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Tree Ensembles&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:arial;"&gt;This is combining “ensemble methods” with “decision tree”.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;a href="http://horicky.blogspot.com/2008/02/classification-via-decision-tree.html"&gt;Decision tree&lt;/a&gt; is the first generation machine learning algorithm based on a greedy approach.  For a classification problem, decision tree try to split a branch where the combined “purity” (either by the Gini index or Entropy) after split is maximized.  For a regression problem, decision tree try to split where the combined “between-class-variance” divided by “within-class-variance” can be maximized.  This is equivalent to maximizing the F-value after split.  The splitting continues until reaching the terminating condition such as there are too few member remains in the branch, or the gain of further split is insignificant.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;Decision tree are very good at dealing with missing value (simply not using that value in learning and go own both path in scoring).  Using a decision tree to capture the decision model is also very comprehensible and explainable.  However, decision tree is relatively sensitive to noise and can easily overfit the data.  Although the learning mechanism is easy to understand, Decision tree doesn’t perform very well in general and is rarely used in real system.  However, when decision trees are used together with Ensemble methods, it becomes extraordinary powerful as all its weakness now disappears.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;The idea of ensemble is simple.  Instead of learning one model, we learning multiple models and combine the estimation of each individual learner (e.g. we let them vote on categorical output and compute the average for numeric output).&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;There are two main models for creating different learners.  One is called “bagging”, which is basically drawing samples (with replacement) from the training set and then have the same Tree algorithm to learn on different sample data set.  Another model is called “boosting”, which has a sequence of iterations where samples are drawn from the training set based on the probability distribution where the wrongly predicted items in last round will have a higher chance to be selected.  In other words, the algorithm places more attention to learn from wrongly-classified examples.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;&lt;br /&gt;It turns out Ensemble tree is the most popular method at this moment as it achieve very good prediction across the board, easy to understand and can be implemented in Map/reduce. Google recently published &lt;a href="http://research.google.com/pubs/archive/36296.pdf"&gt;a good paper on their PLANET project &lt;/a&gt;which implements ensemble tree on map/reduce.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-4965787605309153381?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/4965787605309153381/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=4965787605309153381' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4965787605309153381'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4965787605309153381'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2011/03/predictive-analytics-conference-2011.html' title='Predictive Analytics Conference 2011'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1394175011370671044</id><published>2010-12-05T08:36:00.000-08:00</published><updated>2010-12-06T00:25:29.417-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Business Intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='scalability'/><title type='text'>BI at large scale</title><content type='html'>As more and more data being collected everywhere from pretty much everything a user do, such as transactions activities, social interactions, information search ... enterprises has been actively looking into ways to turn these vast amount of raw data into useful information.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;BI process flow&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_j6mB7TMmJJY/TPvSTN2OV7I/AAAAAAAAAiQ/EG6_m6lWkTk/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 289px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/TPvSTN2OV7I/AAAAAAAAAiQ/EG6_m6lWkTk/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5547258593613338546" border="0" /&gt;&lt;/a&gt;It include the following stages of processing&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;ETL:&lt;/span&gt;  Extract operational data (inside enterprise or external sources) into data warehouse (typically organized in Star/Snowflake schema with Fact and Dimension tables).&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Data exploration&lt;/span&gt;:  Get insight into data using simple visualization tools (e.g. histogram, summary statistics) or sophisticated OLAP tools (slice, dice, rollup, drilldown)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Report generation&lt;/span&gt;:  Produce executive reports&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Data mining&lt;/span&gt;: Extract patterns of the underlying data to form models (e.g. &lt;a href="http://horicky.blogspot.com/2009/05/machine-learning-probabilistic-model.html"&gt;bayesian networks&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2009/11/machine-learning-with-linear-model.html"&gt;linear regression&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2009/11/machine-learning-with-linear-model.html"&gt;neural networks&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2008/02/classification-via-decision-tree.html"&gt;decision trees&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2009/11/support-vector-machine.html"&gt;support vector machines&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2009/05/machine-learning-nearest-neighbor.html"&gt;nearest neighbors&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2009/10/machine-learning-association-rule.html"&gt;association rules&lt;/a&gt;, &lt;a href="http://horicky.blogspot.com/2009/11/principal-component-analysis.html"&gt;principal component analysis&lt;/a&gt;)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Feedback&lt;/span&gt;: The model will be used to assist business decision making (predicting the future)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The gap of processing BIG data&lt;/span&gt;&lt;br /&gt;Many data mining and machine learning algorithms are available in both commercial packages (e.g. SAS, SPSS) as well as open source libraries (e.g. Weka, R).  Nevertheless, most of these ML algorithms implementation are based on fitting al data in memory and not designed to process big data (e.g. Tera byte data volume).&lt;br /&gt;&lt;br /&gt;On the other hand, massively parallel processing platform such as &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;Hadoop, Map/Reduce&lt;/a&gt;, over the last few years, has been proven in processing Terabyte or even Petabyte range of data.  Although many &lt;a href="http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html"&gt;sequential algorithm can be restructured to run in map reduce&lt;/a&gt;, including &lt;a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf"&gt;a big portion of machine learning algorithm&lt;/a&gt;, there isn't a corresponding parallel implementation of ML available in massively parallel form.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Approach 1:  Apache Mahout&lt;/span&gt;&lt;br /&gt;One approach is to "re-implement" the ML algorithm in Map/Reduce and this is the path of &lt;a href="http://mahout.apache.org/"&gt;Apache Mahout project&lt;/a&gt;.  Mahout seems to have &lt;a href="https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms"&gt;implemented an impressive list of algorithms&lt;/a&gt; although I haven't used them for my projects yet.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Approach 2:  Ensemble of parallel independent learners&lt;/span&gt;&lt;br /&gt;This is an alternative path that doesn't require re-implementation of existing algorithms.  It works in the following way.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Draw samples from the Big data into many sample data sets, which can fit into the memory of a single, individual learner.&lt;/li&gt;&lt;li&gt;Assign each sample data set to an individual learner, who use existing algorithms to learn the model.  After learning, each individual learner keep their own learned model&lt;br /&gt;&lt;/li&gt;&lt;li&gt;When a decision / prediction request is received, each individual learner will come up with its own prediction and then combine their results in some ways.  (e.g. for classification task, the learners will vote for the predicted class and the majority wins.  for regression, the average of the estimate values will be used to predict the output value)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;a href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TPybnJqgcNI/AAAAAAAAAig/RnZsobzCvag/s1600/P2.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 244px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/TPybnJqgcNI/AAAAAAAAAig/RnZsobzCvag/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5547479937925017810" border="0" /&gt;&lt;/a&gt;&lt;a href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TPya7eHvCDI/AAAAAAAAAiY/0a4cdlX-Acg/s1600/P2.png"&gt;&lt;br /&gt;&lt;/a&gt;I also found this approach can smoothly fade out outdated model.  As user's behavior may change over time, same happens to the validity of a learned model.  With this ensemble approach, I can have multiple learners each learn their model periodically.   Everytime when a prediction is needed, I will pick the latest k models and combine the final prediction based on a time-decayed weighted voting model.  Outdated model will automatically slide out the k-size window automatically.&lt;br /&gt;&lt;br /&gt;One gotchas of sampling approach is the handling of rare events (since you may lost those rare events in sampling).  In this case, stratified sampling (instead of simple random sampling) should be used.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1394175011370671044?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1394175011370671044/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1394175011370671044' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1394175011370671044'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1394175011370671044'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/12/bi-at-large-scale.html' title='BI at large scale'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j6mB7TMmJJY/TPvSTN2OV7I/AAAAAAAAAiQ/EG6_m6lWkTk/s72-c/p1.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-2107343426798480052</id><published>2010-11-05T13:42:00.000-07:00</published><updated>2010-11-06T21:40:07.142-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CEP'/><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='scalability'/><category scheme='http://www.blogger.com/atom/ns#' term='stream'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Map Reduce and Stream Processing</title><content type='html'>&lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;Hadoop Map/Reduce&lt;/a&gt; model is very good in processing large amount of data in parallel.  It provides a general partitioning mechanism (based on the key of the data) to distribute aggregation workload across different machines.  Basically, &lt;a href="http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html"&gt;map/reduce algorithm design&lt;/a&gt; is all about how to select the right key for the record at different stage of processing.&lt;br /&gt;&lt;br /&gt;However, "time dimension" has a very different characteristic compared to other dimensional attributes of data, especially when real-time data processing is concerned.  It presents a different set of challenges to the batch oriented, Map/Reduce model.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Real-time processing demands a very low latency of response, which means there isn't too much data accumulated at the "time" dimension for processing.&lt;/li&gt;&lt;li&gt;Data collected from multiple sources may not have all arrived at the point of aggregation.&lt;/li&gt;&lt;li&gt;In the standard model of Map/Reduce, the reduce phase cannot start until the map phase is completed.  And all the intermediate data is persisted in the disk before download to the reducer.  All these added to significant latency of the processing.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;a href="http://horicky.blogspot.com/2009/11/what-hadoop-is-good-at.html"&gt;Here&lt;/a&gt; is a more detail description of &lt;a href="http://horicky.blogspot.com/2009/11/what-hadoop-is-good-at.html"&gt;this high latency characteristic of Hadoop&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Although Hadoop Map/Reduce is designed for batch-oriented work load, certain application, such as fraud detection, ad display, network monitoring requires real-time response for processing large amount of data, have started to looked at various way of tweaking Hadoop to fit in the more real-time processing environment.  Here I try to look at some technique to perform low-latency parallel processing based on the Map/Reduce model.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;General stream processing model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TNWCnSfpLmI/AAAAAAAAAiA/x3tUY33zczY/s1600/P2.png"&gt;&lt;img style="float: left; margin: 0pt 10px 10px 0pt; cursor: pointer; width: 200px; height: 150px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TNWCnSfpLmI/AAAAAAAAAiA/x3tUY33zczY/s200/P2.png" alt="" id="BLOGGER_PHOTO_ID_5536474928412962402" border="0" /&gt;&lt;/a&gt;In this model, data are produced at various OLTP system, which update the transaction data store and also asynchronously send additional data for analytic processing.  The analytic processing will write the output to a decision model, which will feed back information to the OLTP system for real-time decision making.&lt;br /&gt;&lt;br /&gt;Notice the "asynchronous nature" of the analytic processing which is decoupled from the OLTP system, this way the OLTP system won't be slow down waiting for the completion of the analytic processing.  Nevetheless, we still need to perform the analytic processing ASAP, otherwise the decision model will not be very useful if it doesn't reflect the current picture of the world.  What latency is tolerable is application specific.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Micro-batch in Map/Reduce&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TNWCCfBYOPI/AAAAAAAAAhw/ab40q2TYELQ/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 395px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TNWCCfBYOPI/AAAAAAAAAhw/ab40q2TYELQ/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5536474296120522994" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;One approach is to cut the data into small batches based on time window (e.g. every hour) and submit the data collected in each batch to the Map Reduce job.  Staging mechanism is needed such that the OLTP application can continue independent of the analytic processing.  A job scheduler is used to regulate the producer and consumer so each of them can proceed independently.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Continuous Map/Reduce&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here lets imagine some possible modification of the Map/Reduce execution model to cater for real-time stream processing.  I am not trying to worry about the backward compatibility of Hadoop which is the approach that &lt;a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.pdf"&gt;Hadoop online prototype (HOP)&lt;/a&gt; is taking.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Long running&lt;/span&gt;&lt;br /&gt;The first modification is to make the mapper and reducer long-running.  Therefore, we cannot wait for the end of the map phase before starting the reduce phase as the map phase never ends.  This implies the mapper push the data to the reducer once it complete its processing and let the reducer to sort the data.  A downside of this approach is that it offers no opportunity to run the combine() function on the map side to reduce the bandwidth utilization.  It also shift more workload to the reducer which now needs to do the sorting.&lt;br /&gt;&lt;br /&gt;Notice there is a tradeoff between latency and optimization.  Optimization requires more data to be accumulated at the source (ie: the Mapper) so local consolidation (ie: combine) can be performed.  Unfortunately, low latency requires the data to be sent ASAP so not much accumulation can be done.&lt;br /&gt;&lt;br /&gt;HOP suggest an adaptive flow control mechanism such that data is pushed out to reducer ASAP until the reducer is overloaded and push back (using some sort of flow control protocol).  Then the mapper will buffer the processed message and perform combine() before it send to the reducer.  This approach automatically shift back and forth the aggregation workload between the reducer and the mapper.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Time Window:  Slice and Range&lt;/span&gt;&lt;br /&gt;This is a "time slice" concept and a "time range" concept.  "Slice" defines a time window where result is accumulated before the reduce processing is executed.  This is also the minimum amount of data that the mapper should accumulate before sending to the reducer.&lt;br /&gt;&lt;br /&gt;"Range" defines the time window where results are aggregated.  It can be a landmark window where it has a well-defined starting point, or a jumping window (consider a moving landmark scenario).  It can also be a sliding window where is a fixed size window from the current time is aggregated.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TNWilIApI8I/AAAAAAAAAiI/G3OfFShCqTU/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 161px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TNWilIApI8I/AAAAAAAAAiI/G3OfFShCqTU/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5536510075610932162" border="0" /&gt;&lt;/a&gt;After receiving a specific time slice from every mapper, the reducer can start the aggregation processing and combine the result with the previous aggregation result.  Slice can be dynamically adjusted based on the amount of data sent from the mapper.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; font-weight: bold;font-size:100%;" &gt;Incremental processing&lt;/span&gt;&lt;br /&gt;Notice that the reducer need to compute the aggregated slice value after receive all records of the same slice from all mappers.  After that it calls the user-defined merge() function to merge the slice value with the range value.  In case the range need to be refreshed (e.g. reaching a jumping window boundary), the init() functin will be called to get a refreshed range value.  If the range value need to be updated (when certain slice value falls outside a sliding range), the unmerge() function will be invoked.&lt;br /&gt;&lt;br /&gt;Here is an example of how we keep tracked of the average hit rate (ie: total hits per hour) within a 24 hour sliding window with update happens per hour (ie: an one-hour slice).&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;# Call at each hit record&lt;br /&gt;map(k1, hitRecord) {&lt;br /&gt;   site = hitRecord.site&lt;br /&gt;   # lookup the slice of the particular key&lt;br /&gt;   slice = lookupSlice(site)&lt;br /&gt;   if (slice.time - now &gt; 60.minutes) {&lt;br /&gt;       # Notify reducer whole slice of site is sent&lt;br /&gt;       advance(site, slice)&lt;br /&gt;&lt;/code&gt;&lt;code&gt;        slice = lookupSlice(site)&lt;br /&gt;&lt;/code&gt;&lt;code&gt;    }&lt;br /&gt;   emitIntermediate(site, slice, 1)&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;combine(site, slice, countList) {&lt;br /&gt;   hitCount = 0&lt;br /&gt;   for count in countList {&lt;br /&gt;       hitCount += count&lt;br /&gt;   }&lt;br /&gt;   # Send the message to the downstream node&lt;br /&gt;   emitIntermediate(site, slice, hitCount)&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;# Called when reducer receive full slice from all mappers&lt;br /&gt;reduce(site, slice, countList) {&lt;br /&gt;   hitCount = 0&lt;br /&gt;   for count in countList {&lt;br /&gt;       hitCount += count&lt;br /&gt;   }&lt;br /&gt;   sv = SliceValue.new&lt;br /&gt;   sv.hitCount = hitCount&lt;br /&gt;   return sv&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;# Called at each jumping window boundary&lt;br /&gt;init(slice) {&lt;br /&gt;   rangeValue = RangeValue.new&lt;br /&gt;   rangeValue.hitCount = 0&lt;br /&gt;   return rangeValue&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;# Called after each reduce()&lt;br /&gt;merge(rangeValue, slice, sliceValue) {&lt;br /&gt;   rangeValue.hitCount += sliceValue.hitCount&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;# Called when a slice fall out the sliding window&lt;br /&gt;unmerge(rangeValue, slice, sliceValue) {&lt;br /&gt;   rangeValue.hitCount -= sliceValue.hitCount&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-2107343426798480052?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/2107343426798480052/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=2107343426798480052' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2107343426798480052'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2107343426798480052'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html' title='Map Reduce and Stream Processing'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j6mB7TMmJJY/TNWCnSfpLmI/AAAAAAAAAiA/x3tUY33zczY/s72-c/P2.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-744734752066582382</id><published>2010-10-15T23:09:00.000-07:00</published><updated>2010-10-16T11:26:23.069-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='scatter gather'/><category scheme='http://www.blogger.com/atom/ns#' term='BSP'/><category scheme='http://www.blogger.com/atom/ns#' term='map reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='scalability'/><category scheme='http://www.blogger.com/atom/ns#' term='Dryad'/><category scheme='http://www.blogger.com/atom/ns#' term='Gigaspace'/><category scheme='http://www.blogger.com/atom/ns#' term='Pregel'/><title type='text'>Scalable System Design Patterns</title><content type='html'>Looking back after 2.5 years since &lt;a href="http://horicky.blogspot.com/2008/02/scalable-system-design.html"&gt;my previous post on scalable system design techniques&lt;/a&gt;, I've observed an emergence of a set of commonly used design patterns.  Here is my attempt to capture and share them.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Load Balancer&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In this model, there is a dispatcher that determines which worker instance will handle the request based on different policies.  The application should best be "stateless" so any worker instance can handle the request.&lt;br /&gt;&lt;br /&gt;This pattern is deployed in almost every medium to large web site setup.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TLnj_mWL50I/AAAAAAAAAgg/JFPsfGcAenI/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 267px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TLnj_mWL50I/AAAAAAAAAgg/JFPsfGcAenI/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5528700699338860354" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Scatter and Gather&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In this model, the dispatcher multicast the request to all workers of the pool.  Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.&lt;br /&gt;&lt;br /&gt;This pattern is used in Search engines like Yahoo, Google to handle user's keyword search request ... etc.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TLlDyOK60HI/AAAAAAAAAfI/JreI7fqvohA/s1600/P2.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 231px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TLlDyOK60HI/AAAAAAAAAfI/JreI7fqvohA/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5528524547650408562" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Result Cache&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In this model, the dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.&lt;br /&gt;&lt;br /&gt;This pattern is commonly used in large enterprise application.  Memcached is a very commonly deployed cache server.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_j6mB7TMmJJY/TLlEpBawVMI/AAAAAAAAAfQ/Jp8vbVYnF0s/s1600/P3.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 231px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/TLlEpBawVMI/AAAAAAAAAfQ/Jp8vbVYnF0s/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5528525489119974594" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Shared Space&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This model also known as "Blackboard"; all workers monitors information from the shared space and contributes partial knowledge back to the blackboard.  The information is continuously enriched until a solution is reached.&lt;br /&gt;&lt;br /&gt;This pattern is used in JavaSpace and also commercial product GigaSpace.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TLlFf-b8lPI/AAAAAAAAAfY/Poy8V0eH1gA/s1600/P4.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 278px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/TLlFf-b8lPI/AAAAAAAAAfY/Poy8V0eH1gA/s400/P4.png" alt="" id="BLOGGER_PHOTO_ID_5528526433212470514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Pipe and Filter&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This model is also known as "Data Flow Programming"; all workers connected by pipes where data is flow across.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.eaipatterns.com/PipesAndFilters.html"&gt;This pattern&lt;/a&gt; is a very common EAI pattern.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_j6mB7TMmJJY/TLlGIM4IDiI/AAAAAAAAAfg/nQgVADmUl5w/s1600/P5.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 289px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/TLlGIM4IDiI/AAAAAAAAAfg/nQgVADmUl5w/s400/P5.png" alt="" id="BLOGGER_PHOTO_ID_5528527124283526690" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Map Reduce&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The model is targeting batch jobs where disk I/O is the major bottleneck.  It use a distributed file system so that disk I/O can be done in parallel.&lt;br /&gt;&lt;br /&gt;This pattern is used in many of Google's internal application, as well as implemented in open source &lt;a href="http://hadoop.apache.org/"&gt;Hadoop &lt;/a&gt;parallel processing framework.  I also find this pattern &lt;a href="http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html"&gt;can be used in many many application design scenarios&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TLlHPyMkTII/AAAAAAAAAf4/McnK_GGkYpw/s1600/P7.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 278px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/TLlHPyMkTII/AAAAAAAAAf4/McnK_GGkYpw/s400/P7.png" alt="" id="BLOGGER_PHOTO_ID_5528528354072087682" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Bulk Synchronous Parellel&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This model is based on lock-step execution across all workers, coordinated by a master.  Each worker repeat the following steps until the exit condition is reached, when there is no more active workers.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Each worker read data from input queue&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Each worker perform local processing based on the read data&lt;/li&gt;&lt;li&gt;Each worker push local result along its direct connection&lt;/li&gt;&lt;/ol&gt;This pattern has been used in Google's &lt;a href="http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html"&gt;Pregel graph processing model&lt;/a&gt; as well as the &lt;a href="http://incubator.apache.org/hama/"&gt;Apache Hama&lt;/a&gt; project.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_j6mB7TMmJJY/TLnhYZH7PTI/AAAAAAAAAgY/YHy5K8H6hZA/s1600/P8.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 274px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/TLnhYZH7PTI/AAAAAAAAAgY/YHy5K8H6hZA/s400/P8.png" alt="" id="BLOGGER_PHOTO_ID_5528697826751233330" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Execution Orchestrator&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This  model is based on an intelligent scheduler / orchestrator to schedule  ready-to-run tasks (based on a dependency graph) across a clusters of  dumb workers.&lt;br /&gt;&lt;br /&gt;This pattern is used in &lt;a href="http://research.microsoft.com/en-us/projects/dryad/"&gt;Microsoft's Dryad project&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TLlH_a9WOMI/AAAAAAAAAgI/41l0bvV3fkE/s1600/P8.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 281px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/TLlH_a9WOMI/AAAAAAAAAgI/41l0bvV3fkE/s400/P8.png" alt="" id="BLOGGER_PHOTO_ID_5528529172467955906" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Although I tried to cover the whole set of commonly used design pattern for building large scale system, I am sure I have missed some other important ones.  Please drop me a comment and feedback.&lt;br /&gt;&lt;br /&gt;Also, there is a whole set of scalability patterns around data tier that I haven't covered here.  This include &lt;a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;some very basic patterns underlying NOSQL&lt;/a&gt;.  And it worths to &lt;a href="http://horicky.blogspot.com/2010/10/bigtable-model-with-cassandra-and-hbase.html"&gt;take a deep look at some leading implementations&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-744734752066582382?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/744734752066582382/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=744734752066582382' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/744734752066582382'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/744734752066582382'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/10/scalable-system-design-patterns.html' title='Scalable System Design Patterns'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_j6mB7TMmJJY/TLnj_mWL50I/AAAAAAAAAgg/JFPsfGcAenI/s72-c/p1.png' height='72' width='72'/><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-5630585881148053320</id><published>2010-10-06T20:17:00.000-07:00</published><updated>2010-11-28T09:59:44.627-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Cassandra'/><category scheme='http://www.blogger.com/atom/ns#' term='Bigtable'/><category scheme='http://www.blogger.com/atom/ns#' term='NOSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='HBase'/><title type='text'>BigTable Model with Cassandra and HBase</title><content type='html'>Recently in a number of "scalability discussion meeting", I've seen the following pattern coming up repeatedly ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;To make your app scalable, you try to make your app layer “stateless”.&lt;/li&gt;&lt;li&gt;OK, so you move the "state" out from your application layer out to a shared DB, or shared data layer.&lt;/li&gt;&lt;li&gt;Now, how do we make the data tier scalable, by definition, we cannot make the data tier stateless.&lt;/li&gt;&lt;li&gt;OK, now lets think about how to "partition" your data and spread them across multiple machines in such a way that workload is balanced.&lt;/li&gt;&lt;li&gt;Now there are more boxes and what if some of them crashes.&lt;/li&gt;&lt;li&gt;OK, we should replicate the data across machines.&lt;/li&gt;&lt;li&gt;Now, how do we keep these data in sync ...&lt;/li&gt;&lt;li&gt;And then Cloud computing gets into the picture (as it always does).  Now we are not just having a pool of machines but also the pool size can grow and shrink according to workload fluctuation (you don't want to pay for something idle, right ?).&lt;/li&gt;&lt;li&gt;Now we need to figure out as we add more machines into the pool or remove machine from the pool, how we should "redistribute" the data.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;This is an area where NOSQL shines.  In the last 18 months, NOSQL has become one of the hottest topic in the software industry.  It has been introduced as a solution to large scale data storage problem at the range of Terabytes or Petabytes.  Dozens of NOSQL products has come to the market, but two leaders HBase and Cassandra seems to stand out from the rest in terms of their adoption.&lt;br /&gt;&lt;br /&gt;Given an increasing demand of explaining these 2 products recently, I decide to write a post on this.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-family:times new roman;" &gt;Not to repeat the basic theory of NOSQL here, for a foundation of &lt;/span&gt;&lt;a style="font-style: italic; font-family: times new roman;" href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;distributed system theory underlying the NOSQL design&lt;/a&gt;&lt;span style="font-style: italic;font-family:times new roman;" &gt;, please refer to &lt;/span&gt;&lt;a style="font-style: italic; font-family: times new roman;" href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;my earlier blog&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Both Hbase and Cassandra are based on Google BigTable model, here lets introduce some key characteristic underlying Bigtable first.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Fundamentally Distributed&lt;/span&gt;&lt;br /&gt;BigTable is built from the ground up on a "highly distributed", "share nothing" architecture.  Data is supposed to store in large number of unreliable, commodity server boxes by "partitioning" and "replication".  Data partitioning means the data are partitioned by its key and stored in different servers.  Replication means the same data element is replicated multiple times at different servers.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TK1uoEOxDRI/AAAAAAAAAeI/NO1o3ciSc44/s1600/P3.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 186px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TK1uoEOxDRI/AAAAAAAAAeI/NO1o3ciSc44/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5525193952462966034" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Column Oriented&lt;/span&gt;&lt;br /&gt;Unlike traditional RDBMS implementation where each "row" is stored contiguous on disk, BigTable, on the other hand, store each column contiguously on disk.  The underlying assumption is that in most cases not all columns are needed for data access, column oriented layout allows more records sitting in a disk block and hence can reduce the disk I/O.&lt;br /&gt;&lt;br /&gt;Column oriented layout is also very effective to store very sparse data (many cells have NULL value) as well as multi-value cell.  The following diagram illustrate the difference between a Row-oriented layout and a Column-oriented layout&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TK1npAatLqI/AAAAAAAAAd4/TscPInSeUoo/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 239px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TK1npAatLqI/AAAAAAAAAd4/TscPInSeUoo/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5525186272037777058" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Variable number of Columns&lt;/span&gt;&lt;br /&gt;In RDBMS, each row must have a fixed set of columns defined by the table schema, and therefore it is not easy to support columns with multi-value attributes.  The BigTable model introduces the "Column Family" concept such that a row has a fixed number of "column family" but within the "column family", a row can have a variable number of columns that can be different in each row.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TK3rvDfWSFI/AAAAAAAAAeY/kD1z9eBG_iw/s1600/P5.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 162px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TK3rvDfWSFI/AAAAAAAAAeY/kD1z9eBG_iw/s400/P5.png" alt="" id="BLOGGER_PHOTO_ID_5525331511476635730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In the Bigtable model, the basic data storage unit is a cell, (addressed by a particular row and column).  Bigtable allow multiple timestamp version of data within a cell.  In other words, user can address a data element by the rowid, column name and the timestamp.  At the configuration level, Bigtable allows the user to specify how many versions can be stored within each cell either by count (how many) or by freshness (how old).&lt;br /&gt;&lt;br /&gt;At the physical level, BigTable store each column family contiguously on disk (imagine one file per column family), and physically sort the order of data by rowid, column name and timestamp.  After that, the sorted data will be compressed so that a disk block size can store more data.  On the other hand, since data within a column family usually has a similar pattern, data compression can be very effective.&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TK3tXVqVcbI/AAAAAAAAAeg/yIckcEOCR1Q/s1600/P4.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 215px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TK3tXVqVcbI/AAAAAAAAAeg/yIckcEOCR1Q/s400/P4.png" alt="" id="BLOGGER_PHOTO_ID_5525333303060951474" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Note: &lt;/span&gt; Although not shown in this example, rowid of different column families can be completely different types.  For example, in the above example, I can have another column family "UserIdx" whose rowid is a string (user's name) and it has columns whose columnKey is the u1, u2 (ie: the row id of the User Column family) and columnValue is null (ie: not used).   This is a common technique to build index at the application level.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Sequential write&lt;/span&gt;&lt;br /&gt;BigTable model is highly optimized for write operation (insert/update/delete) with sequential write (no disk seek is needed).  Basically, write happens by first appending a transaction entry to a log file (hence the disk write I/O is sequential with no disk seek), followed by writing the data into an in-memory Memtable  .  In case of the machine crashes and all in-memory state is lost, the recovery step will bring the Memtable up to date by replaying the updates in the log file.&lt;br /&gt;&lt;br /&gt;All the latest update therefore will be stored at the Memtable, which will grow until reaching a size threshold, then it will flushed the Memtable to the disk as an SSTable (sorted by the String key).  Over a period of time there will be multiple SSTables on the disk that store the data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt; Merged read&lt;/span&gt;&lt;br /&gt;Whenever a read request is received, the system will first lookup the Memtable by its row key to see if it contains the data.  If not, it will look at the on-disk SSTable to see if the row-key is there.  We call this the "merged read" as the system need to look at multiple places for the data.  To speed up the detection, SSTable has a companion Bloom filter such that it can rapidly detect the absence of the row-key.  In other words, only when the bloom filter returns positive will the system be doing a detail lookup within the SSTable.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Periodic Data Compaction&lt;/span&gt;&lt;br /&gt;As you can imagine, it can be quite inefficient for the read operation when there are too many SSTables scattering around.  Therefore, the system periodically merge the SSTable.  Notice that since each of the SSTable is individually sorted by key, a simple "merge sort" is sufficient to merge multiple SSTable into one.  The merge mechanism is based on a logarithm property where two SSTable of the same size will be merge into a single SSTable will doubling the size.  Therefore the number of SSTable is proportion to O(logN) where N is the number of rows.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TK1uBvZAfzI/AAAAAAAAAeA/LRykejZpRaI/s1600/P2.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 252px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TK1uBvZAfzI/AAAAAAAAAeA/LRykejZpRaI/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5525193294033747762" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;After looking at the common part, lets look at their difference of Hbase and Cassandra.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;HBase&lt;/span&gt;&lt;br /&gt;Based on the BigTable, HBase uses the Hadoop Filesystem (HDFS) as its data storage engine.  The advantage of this approach is then HBase doesn't need to worry about data replication, data consistency and resiliency because HDFS has handled it already.  Of course, the downside is that it is also constrained by the characteristics of HDFS, which is not optimized for random read access.  In addition, there will be an extra network latency between the DB server to the File server (which is the data node of Hadoop).&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TK4AnbAbagI/AAAAAAAAAew/OrK54uwnp5M/s1600/P6.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 291px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TK4AnbAbagI/AAAAAAAAAew/OrK54uwnp5M/s400/P6.png" alt="" id="BLOGGER_PHOTO_ID_5525354470094629378" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In the HBase architecture, data is stored in a farm of Region Servers.  The "key-to-server" mapping is needed to locate the corresponding server and this mapping is stored as a "Table" similar to other user data table.&lt;br /&gt;&lt;br /&gt;Before a client do any DB operation, it needs to first locate the corresponding region server.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The client contacts a predefined Master server who replies the endpoint of a region server that holds a "Root Region" table.&lt;/li&gt;&lt;li&gt;The client contacts the region server who replies the endpoint of a second region server who holds a "Meta Region" table, which contains a mapping from "user table" to "region server".&lt;/li&gt;&lt;li&gt;The client contacts this second region server, passing along the user table name.  This second region server will lookup its meta region and reply an endpoint of a third region server who holds a "User Region", which contains a mapping from "key range" to "region server"&lt;/li&gt;&lt;li&gt;The client contacts this third region server, passing along the row key that it wants to lookup.  This third region server will lookup its user region and reply the endpoint of a fourth region server who holds the data that the client is looking for.&lt;/li&gt;&lt;li&gt;Client will cache the result along this process so subsequent request doesn't need to go through this multi-step process again to resolve the corresponding endpoint.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;In Hbase, the in-memory data storage (what we refer to as "Memtable" in above paragraph) is implemented in &lt;a href="http://horicky.blogspot.com/2009/10/notes-on-memcached.html"&gt;Memcache&lt;/a&gt;.  The on-disk data storage (what we refer to as "SSTable" in above paragraph) is implemented as a HDFS file residing in Hadoop data node server.  The Log file is also stored as an HDFS file.  (I feel storing a transaction log file remotely will hurt performance)&lt;br /&gt;&lt;br /&gt;Also in the HBase architecture, there is a special machine playing the "role of master" who monitors and coordinates the activities of all region servers (the heavy-duty worker node).  To the best of my knowledge, the master node is the single point of failure at this moment.&lt;br /&gt;&lt;br /&gt;For a more detail architecture description, Lars George has a very good explanation in the &lt;a href="http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html"&gt;log file implementation&lt;/a&gt; as well as the &lt;a href="http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html"&gt;data storage architecture&lt;/a&gt; of Hbase.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Cassandra&lt;/span&gt;&lt;br /&gt;Also based on the BigTable model, Cassandra use the DHT (distributed hash table) model to partition its data, based on the paper described in the &lt;a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf"&gt;Amazon Dynamo model&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Consistent Hashing via O(1) DHT&lt;/span&gt;&lt;br /&gt;Each machine (node) is associated with a particular id that is distributed in a keyspace (e.g. 128 bit).  All the data element is also associated with a key (in the same key space).  The server owns all the data whose key lies between its id and the preceding server's id.&lt;br /&gt;&lt;br /&gt;Data is also replicated across multiple servers.  Cassandra offers multiple replication schema including storing the replicas in neighbor servers (whose id succeed the server owning the data), or a rack-aware strategy by storing the replicas in a physical location.  The simple partition strategy is as follows ...&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TK4tcDyi7sI/AAAAAAAAAe4/aCd9EC9fX74/s1600/P7.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 238px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/TK4tcDyi7sI/AAAAAAAAAe4/aCd9EC9fX74/s400/P7.png" alt="" id="BLOGGER_PHOTO_ID_5525403752907075266" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tunable Consistency Level&lt;/span&gt;&lt;br /&gt;Unlike Hbase, Cassandra allows you to choose the consistency level that is suitable to your application, so you can gain more scalability if willing to tradeoff some data consistency.&lt;br /&gt;&lt;br /&gt;For example, it allows you to choose how many ACK to receive from different replicas before considering a WRITE to be successful.  Similarly, you can choose how many replica's response to be received in the case of READ before return the result to the client.&lt;br /&gt;&lt;br /&gt;By choosing the appropriate number for W and R response, you can choose the level of consistency you like.  For example, to achieve Strict Consistency, we just need to pick W, R such that W + R &gt; N.  This including the possibility of (W = one and R = all), (R = one and W = all), (W = quorum and R = quorum).  Of course, if you don't need strict consistency, you can even choose a smaller value for W and R and gain a bigger availability.  Regardless of what consistency level you choose, the data will be eventual consistent by the "hinted handoff", "read repair" and "anti-entropy sync" mechanism described below.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Hinted Handoff&lt;/span&gt;&lt;br /&gt;The client performs a write by send the request to any Cassandra node which will act as the proxy to the client.  This proxy node will located N corresponding nodes that holds the data replicas and forward the write request to all of them.  In case any node is failed, it will pick a random node as a handoff node and write the request with a hint telling it to forward the write request back to the failed node after it recovers.    The handoff node will then periodically check for the recovery of the failed node and forward the write to it.  Therefore, the original node will eventually receive all the write request.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Conflict Resolution&lt;/span&gt;&lt;br /&gt;Since write can reach different replica, the corresponding timestamp of the data is used to resolve conflict, in other words, the latest timestamp wins and push the earlier timestamps into an earlier version (they are not lost)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Read Repair&lt;/span&gt;&lt;br /&gt;When the client performs a "read", the proxy node will issue N reads but only wait for R copies of responses and return the one with the latest version.  In case some nodes respond with an older version, the proxy node will send the latest version to them asynchronously, hence these left-behind node will still eventually catch up with the latest version.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Anti-Entropy data sync&lt;/span&gt;&lt;br /&gt;To ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync.  For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors.  By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.&lt;br /&gt;&lt;br /&gt;Anti-entropy is the "catch-all" way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently.  By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;BigTable trade offs&lt;/span&gt;&lt;br /&gt;To  retain the scalability features of BigTable, some of the basic features  of what RDBMS has provided is missing in the BigTable model.  Here we highlight the rough edges of Bigtable.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1) Primitive transaction support&lt;/span&gt;&lt;br /&gt;Transaction  protection is only guaranteed within a single row.   In other words, you cannot start a atomic transaction to modify  multiple rows.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;2) Primitive isolation support&lt;/span&gt;&lt;br /&gt;While  you are reading a row, other people may have modified the same row and  update it before you. Your view is not current anymore but your later update can easily wipe off other people's change.&lt;br /&gt;&lt;br /&gt;There are many techniques how concurrent update can be isolated,  including pessimistic approach like locking or optimistic approach by  using vector clock to be the version stamp.  But to the best of my  understanding, there is no robust test-and-set operation in the BigTable  model (this is some getLock mechanism in Hbase which I haven't looked into), my impression is that there is no easy  way to check there is no concurrent update happen in between.&lt;br /&gt;&lt;br /&gt;Because  of this limitation, I think BigTable model is more suitable for those  applications where concurrent update to the same row is very rare, or  some inconsistency is tolerable at the application level.  Fortunately,  there are still a lot of applications falling into this bucket.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3) No indexes&lt;/span&gt;&lt;br /&gt;Notice  that data within BigTable are all physically sorted; by rowid, column name  and timestamp.  There is no index from the column value to  its containing rowid.&lt;br /&gt;&lt;br /&gt;This model is quite different from RDBMS  where you typically define a table and worry about defining the index  later.  There is no such "index" concept in BigTable and you need to  carefully plan out the physical sorting order of your data layout.&lt;br /&gt;&lt;br /&gt;Lacking index turns out to be quite inconvenient and many  people using Bigtable ends up building their own index at the application  level.  This usually results in having a highly denormalized data  model with lots of column family who store links to other tables.  Any update to the base data need to carefully update these other column family as well.  From a performance angle, this is actually better than maintaining index in RDBMS because Bigtable is optimized from writes.  However, since it is now the application logic to maintain the index, this can be a source of application bugs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;4) No referential integrity enforcement&lt;/span&gt;&lt;br /&gt;As mentioned above, since you are building artificial index at the application level, you need  to maintain the integrity of your index as well.  This includes update  your index when the base data is inserted, modified or deleted.  This  kind of handling logic is traditionally residing at the RDBMS level, but  since BigTable has no such referential integrity concept, this  responsibility is now landed on your application logic.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;5) Lack of surrounding tools&lt;/span&gt;&lt;br /&gt;As  NOSQL or BigTable is very new, the tools surrounding it is definitely  not comparable to the RDBMS world at this moment, such tools includes  report generation, BI, data warehouse ... etc.&lt;br /&gt;&lt;br /&gt;I observe the general trend that most NOSQL products are moving towards the direction to provide an ODBC / JDBC interface to integrate with existing tool markets easier.  But at this  moment, to the best of my understanding, such interface is not wide spread  yet.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Design Patterns for Bigtable model&lt;br /&gt;&lt;/span&gt;Due to the very different model of Bigtable, the data model design methodology is also quite different from traditional RDBMS schema design.   Here is a sequence of steps that are pretty common ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1) Identify all your query scenarios&lt;/span&gt;&lt;br /&gt;Since there is no index concept, you have to plan out carefully how your data is physically sorted.  Therefore it is important to find all your query use cases first.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2) Define your "entity table" and its corresponding column families&lt;/span&gt;&lt;br /&gt;For an &lt;span&gt;entity table&lt;/span&gt;, it is pretty common to have one column family storing all the entity attributes, and multiple column families to store the links to other entities.  (e.g. A "UserTable" may contain a column family "baseInfo" to store all attributes of the user, a column family "friend" to store the links to another user, a column family "company" to store links to another CompanyTable)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3) Define your "index table"&lt;/span&gt;&lt;br /&gt;The "index table" is what your application build to support reverse lookup.  The "key" is typically base on the search criteria you have identified in your query scenario.  It is not uncommon that each query may have its own specific index table.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;4) Make sure your application logic updates the index correctly&lt;/span&gt;&lt;br /&gt;Since the index table has to be maintained by application logic, you need to check to make sure it is done correctly.  In many cases, this can be quite a source of bugs.&lt;br /&gt;&lt;br /&gt;It is important to realize that NOSQL is not advocating a  replacement of RDBMS which has been proven in many lines of application.   The NOSQL should be considered a complementary technologies for some  niche area where RDBMS is not covering well.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;5) Design your update to be idempotent&lt;/span&gt;&lt;br /&gt;&lt;a href="http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/"&gt;Max's post&lt;/a&gt; has a good articulation on this.  The basic idea is that due to the eventual consistency model based on "read/repair" and "quorum update".  It is possible that a failed update is in fact successful.  Here is how this can happen.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Client issue a quorum update.&lt;/li&gt;&lt;li&gt;The server distributed the update to all replica servers, but unfortunately doesn't get more than half to respond successfully.  So it returns a failure to the client.&lt;/li&gt;&lt;li&gt;Nevertheless, the update has been received by some minority replicas (in other words, they don't rollback even the update is not successful).&lt;/li&gt;&lt;li&gt;Later, if the client read one of this minority, it will get this update (even it has failed).  Even more, since this update has a later version, it will read-repair the other copies (ie: further propagate the failed update).&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Therefore, the usual recovery is that user should retry the operation when it fails.  And the application logic need to deal with potentially duplicated updates.  One way is to find some way to detect duplications and ignore them once detected.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-5630585881148053320?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/5630585881148053320/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=5630585881148053320' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5630585881148053320'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5630585881148053320'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/10/bigtable-model-with-cassandra-and-hbase.html' title='BigTable Model with Cassandra and HBase'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_j6mB7TMmJJY/TK1uoEOxDRI/AAAAAAAAAeI/NO1o3ciSc44/s72-c/P3.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-7402865418627103302</id><published>2010-08-29T09:49:00.000-07:00</published><updated>2011-07-13T23:39:51.670-07:00</updated><title type='text'>Designing algorithms for Map Reduce</title><content type='html'>Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model.  The result is pretty encouraging and I've found Map/Reduce is applicable in a wide spectrum of application scenarios.&lt;br /&gt;&lt;br /&gt;So I want to write down my findings but then found the scope is too broad and also I haven't spent enough time to explore different problem domains.  Finally, I realize that there is no way for me to completely cover what Map/Reduce can do in all areas, so I just dump out what I know at this moment over the long weekend when I have an extra day.&lt;br /&gt;&lt;br /&gt;Notice  that Map/Reduce is good for "data parallelism", which is different from  "task parallelism". &lt;a href="http://horicky.blogspot.com/2008/11/design-for-parallelism.html"&gt;Here is a description&lt;/a&gt; about their difference and a general parallel processing design methodology.&lt;br /&gt;&lt;br /&gt;I'll cover the abstract Map/Reduce processing model below.  For a detail description of the &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;implementation of Hadoop framework&lt;/a&gt;, please refer to &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;my earlier blog here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Abstract Processing Model&lt;/span&gt;&lt;br /&gt;There are no formal definition of the Map/reduce model.  Basic on the Hadoop implementation, we can think of it as a "distributed merge-sort engine".  The general processing flow is as follows.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Input data is "split" into multiple mapper process which executes in parallel&lt;/li&gt;&lt;li&gt;The result of the mapper is partitioned by key and locally sorted&lt;/li&gt;&lt;li&gt;Result of mapper of the same key will land on the same reducer and consolidated there&lt;/li&gt;&lt;li&gt;Merge sorted happens at the reducer so all keys arriving the same reducer is sorted&lt;/li&gt;&lt;/ul&gt;&lt;a href="http://4.bp.blogspot.com/_j6mB7TMmJJY/TNBQJUikfGI/AAAAAAAAAgo/2VZCtGMBBgE/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 235px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/TNBQJUikfGI/AAAAAAAAAgo/2VZCtGMBBgE/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5535012063101090914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Within the processing flow, user defined functions can be plugged-in to the framework.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;map(key1, value1)  -&amp;gt;  emit(key2, value2)&lt;/li&gt;&lt;li&gt;reduce(key2, value2_list)  -&amp;gt;  emit(key2, aggregated_value2)&lt;/li&gt;&lt;li&gt;combine(key2, value2_list)  -&amp;gt;  emit(key2, combined_value2)&lt;/li&gt;&lt;li&gt;partition(key2)  return reducerNo&lt;/li&gt;&lt;/ul&gt;Design the algorithm for map/reduce is about how to morph your problem into a distributed sorting problem and fit your algorithm into the user defined functions of above.&lt;br /&gt;&lt;br /&gt;To analyze the complexity of the algorithm, we need to understand the processing cost, especially the cost of network communication in such a highly distributed system.&lt;br /&gt;&lt;br /&gt;Lets first consider the communication between Input data split and Mapper.  To minimize this overhead, we need to run the mapper logic at the data split (without moving the data).  How well we do this depends on how the input data is stored and whether we can run the mapper code there.  For HDFS and Cassandra, we can the mapper at the storage node and the scheduler algorithm of JobTracker will assign the mapper to the data split that it collocates with and hence significantly reduce the data movement.  Other data store such as Amazon S3 doesn't allow execution of mapper logic at the storage node and therefore incur more data traffic.&lt;br /&gt;&lt;br /&gt;The communication between Mapper and Reducer cannot be collocated because it depends on the emit key.  The only mechanism available is the combine() function which can perform a local consolidation and hence can reduce the data sent to the reducer.&lt;br /&gt;&lt;br /&gt;Finally the communication between the reducer and the output data store depends on the store's implementation.  For HDFS, the data is triply replicated and hence the cost of writing can be high.  Cassandra (a NOSQL data store) allows configurable latency with various degree of data consistency trade-off.  Fortunately, in most case the volume of result data after a Map/Reduce processing is not high.&lt;br /&gt;&lt;br /&gt;Now, we see how to fit various different kinds of algorithms into the Map/Reduce model ...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Map-Only&lt;/span&gt;&lt;br /&gt;"Embarrassing parallel" problems are those that the same processing is applied in each data element in a pretty independent way, in other words, there is no need to consolidate or aggregate individual results.&lt;br /&gt;&lt;br /&gt;These kinds of problem can be expressed as a Map-only job (by specifying the number of reducers to zero).  In this case, Mapper's emitted result will directly go to the output format.&lt;br /&gt;&lt;br /&gt;Some examples of map-only examples are ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Distributed grep&lt;/li&gt;&lt;li&gt;Document format conversion&lt;/li&gt;&lt;li&gt;ETL&lt;/li&gt;&lt;li&gt;Input data sampling&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Sorting&lt;/span&gt;&lt;br /&gt;As we described above, Hadoop is fundamentally a distributed sorting engine, so using it for sorting is a natural fit.&lt;br /&gt;&lt;br /&gt;For example, we can use an Identity function for both map() and reduce(), then the output is equivalent to sorting the input data.  Notice that we are using a single reducer here.  So the merge is still sequential although the sorting is done at the mapper in parallel.&lt;br /&gt;&lt;br /&gt;We can perform the merge in parallel by using multiple reducers.  In this case, output of each reducer are sorted.  We may need to do a final merge on all the reducer's output.  Another way is to use a customized partition() function such that the keys are partitioned by range.  In this case, each reducer is sorting a particular range and the final result is just to concatenate the each reducer's sorted result.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;partition(key) {&lt;br /&gt;  range = (KEY_MAX - KEY_MIN) / NUM_OF_REDUCERS&lt;br /&gt;  reducer_no = (key - KEY_MIN) / range&lt;br /&gt;  return reducer_no&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Inverted Indexes&lt;/span&gt;&lt;br /&gt;The map reduce model is originated from Google which has a lot of scenarios of building large scale inverted index.  Building an inverted index is about parsing different documents to build a word -&amp;gt; document index for keyword search.&lt;br /&gt;&lt;br /&gt;In fact, inverted index is pretty general and can be applied in many scenarios.  To build an inverted index, we can feed the mapper each document (or lines within a document).  The Mapper will parse the words in the document to emit [word, doc] pairs along with other metadata such as where in the document this word occurs ... etc.  The reducer can simply be an identity function that just dump out the list, or it can perform some statistic aggregation per word.&lt;br /&gt;&lt;br /&gt;In a more general form of Inverted index, there is a "container" and "element" concept.  The Map and Reduce function will be organized in the following patterns.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;map(key, container) {&lt;br /&gt;  for each element in container {&lt;br /&gt;      element_meta =&lt;br /&gt;           extract_metadata(element, container)&lt;br /&gt;      emit(element, [container_id, element_meta])&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;reduce(element, container_ids) {&lt;br /&gt;  element_stat =&lt;br /&gt;       compute_stat(container_ids)&lt;br /&gt;  emit(element, [element_stat, container_ids])&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;In Text index, we are not just counting the actual frequency of the terms but also adjust its weighting based on its frequency distribution so common words will have less significance when they appears in the document.  The final value after normalization is called &lt;a href="http://horicky.blogspot.com/2009/01/solving-tf-idf-using-map-reduce.html"&gt;TF-IDF (term frequency times inverse document frequency) and can be computed using Map Reduce as well&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Simple Statistics Computation&lt;br /&gt;&lt;/span&gt;Computing max, min, count is very straightforward since this operation is commutative and associative.  Each mapper will perform the local computation and send the result to a single reducer to do the final computation.&lt;br /&gt;&lt;br /&gt;Combine function is typically used to reduce the network traffic.  Notice that the input to the combine function must look the same as the input to the reducer function and the output of the combine function must look the same as the output of the map function.  There is also no guarantee that the combiner function will be invoked at all.&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;class Mapper {&lt;br /&gt;  buffer&lt;br /&gt;&lt;br /&gt;  map(key, number) {&lt;br /&gt;      buffer.append(number)&lt;br /&gt;      if (buffer.is_full) {&lt;br /&gt;          max = compute_max(buffer)&lt;br /&gt;          emit(1, max)&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class Reducer {&lt;br /&gt;  reduce(key, list_of_local_max) {&lt;br /&gt;      global_max = 0&lt;br /&gt;      for local_max in list_of_local_max {&lt;br /&gt;          if local_max &amp;gt; global_max {&lt;br /&gt;              global_max = local_max&lt;br /&gt;          }&lt;br /&gt;      }&lt;br /&gt;&lt;/code&gt;&lt;code&gt;        emit(1, global_max)&lt;/code&gt;&lt;br /&gt;  }&lt;br /&gt;&lt;code&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class Combiner {&lt;br /&gt;  combine(key, list_of_local_max) {&lt;br /&gt;      local_max = maximum(list_of_local_max)&lt;br /&gt;      emit(1, local_max)&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;span style="text-decoration: underline;"&gt;&lt;/span&gt;Computing avg is done in a similar way except that instead of computing the local avg, we compute the local sum and local count.  The reducer will do the final sum divided by the final count to come up with the final avg.&lt;br /&gt;&lt;br /&gt;Computing a histogram is pretty common in statistics and can give a quick idea about the data distribution.  A typical approach is to divide the number into different intervals.  The mapper will compute the count per interval, and emit that per interval and the reducer will compute the sum of that interval.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;class Mapper {&lt;br /&gt;  interval_start = [0, 20, 40, 60, 80]&lt;br /&gt;&lt;br /&gt;  map(key, number) {&lt;br /&gt;      i = 0;&lt;br /&gt;      while (i &amp;lt; NO_OF_INTERVALS) {&lt;br /&gt;          if (number &amp;lt; interval_start[i]) {&lt;br /&gt;              emit(i, 1)&lt;br /&gt;              break&lt;br /&gt;          }&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class Reducer {&lt;br /&gt;  reduce(interval, counts) {&lt;br /&gt;      total_counts = 0&lt;br /&gt;      for each count in counts {&lt;br /&gt;          total_counts += count&lt;br /&gt;      }&lt;br /&gt;      emit(interval, total_counts)&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class Combiner {&lt;br /&gt;  combine(interval, occurrence) {&lt;br /&gt;      emit(interval, occurrence.size)&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;Notice that a non-uniform distribution of values across intervals may cause an unbalanced workload among reducers and hence undermine the degree of parallelism.  We'll address this in the later part of this post.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;In-Mapper Combine&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.morganclaypool.com/doi/abs/10.2200/S00274ED1V01Y201006HLT007?journalCode=hlt"&gt;Jimmy Lin, in his excellent book&lt;/a&gt;,  talks about a technique call "in-mapper combine" which regains control  at the application level when the combine takes place.  The general idea  is to maintain a HashMap to buffer the intermediate result and has a  separate logic to determine when to actually emit the data from the  buffer.  The general code structure is as follows ...&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;class Mapper {&lt;br /&gt;  buffer&lt;br /&gt;&lt;br /&gt;  init() {&lt;br /&gt;      buffer = HashMap.new&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  map(key, data) {&lt;br /&gt;      elements = process(data)&lt;br /&gt;      for each element {&lt;br /&gt;          ....&lt;br /&gt;          check_and_put(buffer, k2, v2)&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  check_and_put(buffer, k2, v2) {&lt;br /&gt;      if buffer.full {&lt;br /&gt;          for each k2 in buffer.keys {&lt;br /&gt;              emit(k2, buffer[k2])&lt;br /&gt;          }&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  close() {&lt;br /&gt;&lt;/code&gt;&lt;code&gt;        for each k2 in buffer.keys {&lt;br /&gt;          emit(k2, buffer[k2])&lt;br /&gt;      }&lt;br /&gt;&lt;/code&gt;&lt;code&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;SQL Model&lt;/span&gt;&lt;br /&gt;The SQL model can be used to extract data from the data source.  It contains a number of primitives.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Projection / Filter &lt;/span&gt;&lt;br /&gt;This logic is typically implemented in the Mapper&lt;br /&gt;&lt;ul&gt;&lt;li&gt;result = SELECT c1, c2, c3, c4 FROM source WHERE conditions&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Aggregation / Group by&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt; / Having&lt;/span&gt;&lt;br /&gt;This logic is typically implemented in the Reducer&lt;br /&gt;&lt;ul&gt;&lt;li&gt;SELECT sum(c3) as s1, avg(c4) as s2 ... FROM result GROUP BY c1, c2 HAVING conditions&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The above example can be realized by the following map/reduce job&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;class Mapper {&lt;br /&gt;  map(k, rec) {&lt;br /&gt;      select_fields =&lt;br /&gt;          [rec.c1, rec.c2, rec.c3, rec.c4]&lt;br /&gt;      group_fields =&lt;br /&gt;          [rec.c1, rec.c2]&lt;br /&gt;      if (filter_condition == true) {&lt;br /&gt;          emit(group_fields, select_fields)&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;class Reducer {&lt;br /&gt;  reduce(group_fields, list_of_rec) {&lt;br /&gt;      s1 = 0&lt;br /&gt;      s2 = 0&lt;br /&gt;      for each rec in list_of_rec {&lt;br /&gt;          s1 += rec.c3&lt;br /&gt;          s2 += rec.c4&lt;br /&gt;      }&lt;br /&gt;      s2 = s2 / rec.size&lt;br /&gt;      if (having_condition == true) {&lt;br /&gt;          emit(group_fields, [s1, s2])&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Data Joins&lt;/span&gt;&lt;br /&gt;Joining 2 data set is a very common operation in Relational Data Model and has been very mature in RDBMS implementation.  The common join mechanism in a centralized DB architecture is as follows&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Nested loop join&lt;/span&gt; -- This is the most basic and naive mechanism and is organized as two loops.  The outer loop reads from data set1, the inner loop scan through the whole data set2 and compare with the records just read from data set1.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Indexed join&lt;/span&gt; -- An index (e.g. B-Tree index) is built for one of the data sets (say data set2 which is the smaller one).  The join will scan through data set1 and lookup the index to find the matched records of data set2.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Merge join&lt;/span&gt; -- Pre-sort both data sets so they are arranged physically in increasing order.  The join is realized by just merging the two data sets.  a) Locate the first record in both data set1 &amp;amp; set2, which is their corresponding minimum key  b) In the one with a smaller minimum key (say data set1), keep scanning until finding the next key which is bigger than the minimum key of the other data set (ie. data set2), call this the next minimum key of data set1.  c) Switch position and repeat the whole thing until one of the data set is exhausted.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Hash / Partition join&lt;/span&gt; -- Partition the data set1 and data set2 into smaller size and apply other join algorithm in a smaller data set size.  A linear scan with a hash() function is typically performed to partition the data sets such that data in set1 and data in set2 with the same key will land on the same partition.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Semi join&lt;/span&gt; -- This is mainly used to join two sets of data that is stored at different locations and the goal is to reduce the amount of data transfer such that only the full records appears in the final joint result will be send through.  a) Data set2 will send its key set to machine holding Data set1.  b) Machine holding Data set1  will do a join and send back the records in Data set1 that matches one of the send-over keys.  c) The machine holding data set2 will do a final join to the data send back.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;In the map reduce environment, it has the corresponding joins.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;General reducer-side join&lt;/span&gt;&lt;br /&gt;This is the most basic one, records from data set1 and set2 with the same key will land on the same reducer, which will then do a cartesian product.  The downside of this model is that the reducer need to have enough memory to hold all records of each key.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;map(k1, rec) {&lt;br /&gt;  emit(rec.key, [rec.type, rec])&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;reduce(k2, list_of_rec) {&lt;br /&gt;  list_of_typeA = []&lt;br /&gt;  list_of_typeB = []&lt;br /&gt;  for each rec in list_of_rec {&lt;br /&gt;      if (rec.type == 'A') {&lt;br /&gt;          list_of_typeA.append(rec)&lt;br /&gt;      } else {&lt;br /&gt;          list_of_typeB.append(rec)&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  # Compute the catesian product&lt;br /&gt;  products = []&lt;br /&gt;  for recA in list_of_typeA {&lt;br /&gt;      for recB in list_of_typeB {&lt;br /&gt;          emit(k2, [recA, recB])&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Optimized reducer-side join&lt;/span&gt;&lt;br /&gt;You can "secondary sort" the data type for each key by defining a customized partition function.  In this model, you arrange the data type (which has less records per key to arrive first) and you only need to store these types.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;map(k1, rec) {&lt;br /&gt;  emit([rec.key, rec.type], rec])&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;partition(key_pair) {&lt;br /&gt;  super.partition(key_pair[0])&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;reduce(k2, list_of_rec) {&lt;br /&gt;  list_of_typeA = []&lt;br /&gt;  for each rec in list_of_rec {&lt;br /&gt;      if (rec.type == 'A') {&lt;br /&gt;          list_of_typeA.append(rec)&lt;br /&gt;      } else { # receive records of typeA&lt;br /&gt;          for recA in list_of_typeA {&lt;br /&gt;              emit(k2, [recA, rec])&lt;br /&gt;          }&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;While being very flexible, the downside of Reducer side join is that all data need to be transfer from the mapper to the reducer and then result write to HDFS.  Map-side join explore some special arrangement of the input file such that the join is being perform at the mapper.  The advantage of doing in the mapper is that we can exploit the collocation of the Map reduce framework such that the mapper will be allocated an input split in its local machine, hence reduce the data transfer from the disk to the mapper.  After the map-side join, the result is written directly to the output HDFS files and hence eliminate the data transfer between the mapper and the reducer.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Map-side partition join&lt;/span&gt;&lt;br /&gt;In this model, it requires the 2 data sets to be partitioned into 2 sets of partition files (same number of partitions for each set).  The size of the partition is such that it can fit into the memory of the Mapper machine.  We also need to configure the Map/Reduce job such that there is no split in the partition file, in other words, the whole partition is assigned to a mapper task.&lt;br /&gt;&lt;br /&gt;The mapper will detect the partition of the input file and then read the corresponding partition file of the other data set into an in-memory hashtable.  After that, the mapper will lookup the Hashtable to do the join.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;class Mapper {&lt;br /&gt;  map = Hashtable.new&lt;br /&gt;&lt;br /&gt;  init() {&lt;br /&gt;      partition = detect_input_filename()&lt;br /&gt;      map = load("hdfs://dataset2/" + partition)&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  map(k1, rec1) {&lt;br /&gt;      rec2 = map[rec1.key]&lt;br /&gt;      if (rec2 != nil) {&lt;br /&gt;          emit(rec1.key, [rec1, rec2])&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;Map-side partition merge join&lt;/span&gt;&lt;br /&gt;In additional, if the partition file is also sorted, then the mapper can use a merge join, which has an even smaller memory footprint.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;class Mapper {&lt;br /&gt;  rec2_key = nil&lt;br /&gt;  next_rec2 = nil&lt;br /&gt;  list_of_rec2 = []&lt;br /&gt;  file = nil&lt;br /&gt;&lt;br /&gt;  init() {&lt;br /&gt;      partition = detect_input_filename()&lt;br /&gt;      file = open("hdfs://dataset2/" + partition, "r")&lt;br /&gt;      next_rec2 = file.read()&lt;br /&gt;      fill_rec2_list()&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  # Fill up the list of rec2 list which has the same key&lt;br /&gt;  fill_rec2_list() {&lt;br /&gt;      rec2_key = next_rec2.key&lt;br /&gt;      list_of_rec2.append(next_rec2)&lt;br /&gt;      next_rec2 = file.read&lt;br /&gt;      while(next_rec2.key == key) {&lt;br /&gt;          list_of_rec2.append(next_rec2)&lt;br /&gt;      }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  map(k1, rec1) {&lt;br /&gt;      while (rec1.key &amp;gt; rec2_key) {&lt;br /&gt;          fill_rec2_list()&lt;br /&gt;      }&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;        while (rec1.key == rec2.key) {&lt;br /&gt;          for rec2 in list_of_rec2 {&lt;br /&gt;              emit(rec1.key, [rec1, rec2])&lt;br /&gt;          }&lt;br /&gt;      }&lt;/code&gt;&lt;br /&gt;&lt;code&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Memcache join&lt;/span&gt;&lt;br /&gt;The model is very straightforward, the second data set is loaded into a distributed hash table (like memcache) which has effectively unlimited size.  The mapper will receive input split from the first data set and then lookup the memcache for the corresponding record of the other data set.&lt;br /&gt;&lt;br /&gt;There are also some other more sophisticated join mechanism such as semi-join described in this &lt;a href="http://delivery.acm.org/10.1145/1810000/1807273/p975-blanas.pdf"&gt;paper&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Graph Algorithms&lt;/span&gt;&lt;br /&gt;Many problems can be modeled as a graph of Node and Edges.  In the &lt;a href="http://horicky.blogspot.com/2010/03/search-engine-basics.html"&gt;Search engine environment&lt;/a&gt;, computing the rank of a document using &lt;a href="http://en.wikipedia.org/wiki/PageRank"&gt;Page Rank&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/HITS_algorithm"&gt;Hits &lt;/a&gt;can be model &lt;a href="http://www.umiacs.umd.edu/%7Ejimmylin/Cloud9/docs/content/pagerank.html"&gt;as a sequence of iterations of Map/Reduce jobs&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In the past, I have been blog a number of very basic graph algorithms in map reduce including doing &lt;a href="http://horicky.blogspot.com/2010/02/nosql-graphdb.html"&gt;topological sort, finding shortest path, minimum spanning tree&lt;/a&gt; etc. and also how to &lt;a href="http://horicky.blogspot.com/2010/08/mapreduce-to-recommend-people.html"&gt;recommend people connection&lt;/a&gt; using Map/Reduce.&lt;br /&gt;&lt;br /&gt;Due to the fact that graph traversal is inherently sequential, I am not sure Map/Reduce is the best parallel processing model for graph processing.  Another problem is that due to the "stateless nature" of map() and reduce() functions, the whole graph need to be transferred between mapper and reducer which incur significant communication costs.  Jimmy Lin has described a clever technique called Shimmy which exploit using a special partitioning function which let the reducer to retain the ownership of nodes across map/reduce jobs.  I have described this technique as well as a general model of &lt;a href="http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html"&gt;Map/Reduce graph processing&lt;/a&gt; in a previous blog.&lt;br /&gt;&lt;br /&gt;I think a parallel programming model specific for Graph processing will perform much better.  &lt;a href="http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html"&gt;Google's Pregel model&lt;/a&gt; is a good example of that.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Machine Learning&lt;/span&gt;&lt;br /&gt;Many of the machine learning algorithm involve multiple iterations of parallel processing that fits very well into Map/Reduce model.&lt;br /&gt;&lt;br /&gt;For example, we can &lt;a href="http://horicky.blogspot.com/2009/05/machine-learning-probabilistic-model.html"&gt;use map reduce to calculate the statistics for probabilistic methods&lt;/a&gt; such as naive Bayes.&lt;br /&gt;&lt;br /&gt;A simple example of computing K-Means cluster can also be done in the following way.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Input: A set of points, with k initial centrods&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Output: K final centroids&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Iterate until no more change of membership&lt;br /&gt;&lt;ol&gt;&lt;li&gt;For each point, assign it to be the member of closest centroid&lt;/li&gt;&lt;li&gt;Re-compute the centroid from the assigned point members&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TIKbty6qceI/AAAAAAAAAdg/HsVnG1IlwXI/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 194px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TIKbty6qceI/AAAAAAAAAdg/HsVnG1IlwXI/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5513140104919151074" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;For a complete list of Machine learning algorithms and how they can be implemented using the Map/Reduce model, &lt;a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf"&gt;here is a very good paper&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Matrix arithmetic&lt;/span&gt;&lt;br /&gt;A lot of real-life relationships can be represented as a Matrix.  One example is the vector space model of Information Retrieval where the column represents docs and the row represents terms.  Another example is the social network graph where the column as well as the row representing people and a binary value of each cell to represent a "friend" relationship.  In this case, M + M.M represents all the people that I can reach within 2 degree.&lt;br /&gt;&lt;br /&gt;Processing for dense matrix is very easy to parallelized.  But since the sequential version is O(N^3), it is not that interesting for Matrix with large size (millions range in rows and columns).&lt;br /&gt;&lt;br /&gt;A lot of real-world graph problem can be represented as sparse matrix.  So my interests is to focus more in the processing of sparse matrix.  I don't have much to share at this moment but I hope this is something I will blog about in future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-7402865418627103302?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/7402865418627103302/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=7402865418627103302' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7402865418627103302'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7402865418627103302'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html' title='Designing algorithms for Map Reduce'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j6mB7TMmJJY/TNBQJUikfGI/AAAAAAAAAgo/2VZCtGMBBgE/s72-c/p1.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-2597567348364138197</id><published>2010-08-28T17:58:00.000-07:00</published><updated>2010-08-28T21:04:45.725-07:00</updated><title type='text'>The Limitations of SPARQL</title><content type='html'>&lt;span style="font-size:100%;"&gt;&lt;span style="font-family: georgia;"&gt;Recently, I have been looking at RDF model and try to compare that with the property graph model that &lt;a href="http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html"&gt;I mention in a previous post&lt;/a&gt;.  I also look at the SPARQL query model.  While I think it is a very powerful query language based by variable bindings, I also observe a couple of limitations that it doesn't handle well.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: georgia;"&gt;Note that I haven't used SPARQL in very simple examples and don't claim to be expert in this area.  I am hoping my post here can invite other SPARQL experts to share their experience.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: georgia;"&gt;Here are the limitations that I have seen.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-family: georgia;font-size:130%;" &gt;Support of Negation&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: georgia;font-size:100%;" &gt;&lt;br /&gt;Because of the “Open World” assumption, SPARQL doesn’t  support “negation” well, this means expressing "negation" in SPARQL is  not easy.&lt;br /&gt;&lt;/span&gt;&lt;ul style="font-family: georgia;"&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;Find all persons who is Bob’s friends but doesn’t know Java&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;Find all persons who know Bob but doesn't know Alice&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold; font-family: georgia;font-size:130%;" &gt;Support of Path Expression&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: georgia;font-size:100%;" &gt;&lt;br /&gt;In SPARQL, expressing a variable length path is not easy.&lt;br /&gt;&lt;/span&gt;&lt;ul style="font-family: georgia;"&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;Find all posts written by Bob’s direct and indirect friends (everyone reachable from Bob)&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: georgia;font-size:100%;" &gt; &lt;span style="font-weight: bold;font-size:130%;" &gt;Predicates cannot have Properties&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This may be a RDF limitation that SPARQL inherits.  Since RDF represents everything in Triples.  It is easy to implement properties of a Node using extra Triples, but it is very difficult to implement properties in Edges.&lt;br /&gt;&lt;br /&gt;In SPARQL, there is no way to attaching a property to a “predicate”.&lt;br /&gt;&lt;/span&gt;&lt;ul style="font-family: georgia;"&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;Bob knows Peter for 5 years&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: georgia;font-size:100%;" &gt; &lt;span style="font-weight: bold;font-size:130%;" &gt;RDF inference Rule&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Inference rules are build around RDFS and OWL which is focusing mainly on type and set relationships and is implemented using a Rule: (conditions =&gt; derived triple) expression.  But it is not easy to express a derived triples whose object’s value is an expression of existing triples.&lt;br /&gt;&lt;/span&gt;&lt;ul style="font-family: georgia;"&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;Family income is the sum of all individual member’s income&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: georgia;font-size:100%;" &gt; &lt;span style="font-weight: bold;font-size:130%;" &gt;Support of Fuzzy Matches with Ranked results&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;SPARQL is based on a boolean query model which is designed for exact match.  Express a fuzzy match with ranked result is very difficult.&lt;br /&gt;&lt;/span&gt;&lt;ul style="font-family: georgia;"&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;Find the top 20 posts that is “similar” to this post ranked by  degree of similarity (lets say similarity is measured by the number of  common tags that the 2 posts share)&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;span style="font-family: georgia;"&gt;I am also very interested to see if there is any large scale deployment of RDF graph in real-life scenarios.  I am not aware of any popular social network sites are using RDF to store the social graph or social activities.  I guess this may be due to scalability of the RDF implementation today.  I may be wrong though.&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-2597567348364138197?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/2597567348364138197/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=2597567348364138197' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2597567348364138197'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/2597567348364138197'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/08/limitations-of-sparql.html' title='The Limitations of SPARQL'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-5066917425646523579</id><published>2010-08-04T21:02:00.001-07:00</published><updated>2010-08-04T22:22:06.559-07:00</updated><title type='text'>Map/Reduce to recommend people connection</title><content type='html'>Once common feature in Social Network site is to recommend people connection.  e.g. "People you may know" from Linkedin.  The basic idea is very simple; if person A and person B doesn't know each other but they have a lot of common friends, then the system should recommend person B to person A and vice versa.&lt;br /&gt;&lt;br /&gt;From a graph theory perspective, for each person who is 2-degree reachable from person A, we count how many distinct paths (with 2 connecting edges) exist between this person and person A.  Rank this list in terms the number of paths and show the top 10 persons that person A should connect with.&lt;br /&gt;&lt;br /&gt;We should how we can use Map/Reduce to compute this top-10 connection list for every person.  The problem can be stated as:  For every person X, we determine a list of person X1, X2 ... X10 which is the top 10 persons that person X has common friends with.&lt;br /&gt;&lt;br /&gt;The social network graph is generally very sparse.  Here we assume the input records is an adjacency list sorted by name.&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;"ricky" =&amp;gt; ["jay", "peter", "phyllis"]&lt;br /&gt;"peter" =&amp;gt; ["dave", "jack", "ricky", "susan"]&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;We use two rounds of Map/Reduce job to compute the top-10 list&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;First Round MR Job&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The purpose of this MR job is to compute the number of distinct path between all pairs of people who is 2 degree separated from each other.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;In &lt;span style="font-weight: bold;"&gt;Map()&lt;/span&gt;, we do a cartesian product for all pairs of friends (since these friends may be connected in 2-dgrees).  We also need to eliminate the pairs if they already have a direct connection.  Therefore, the The Map() function should also emit pairs of direct connected persons.  We need to order the key space such that all keys with the same pair of people with go to the same reducer.  On the other hand, we need the pair of direct connection come before the pairs of 2 degree of separations.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In &lt;span style="font-weight: bold;"&gt;Reduce()&lt;/span&gt;, we know all the key pairs reaching the same reducer will be sorted.  So the direct connect pair will come before the 2-degree pairs.  So the reducer just need to check if the first pair is a direct connected one and if so skip the rest.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;Input record ...  person -&amp;gt; connection_list&lt;br /&gt;e.g. "ricky" =&amp;gt; ["jay", "john", "mitch", "peter"]&lt;br /&gt;also the connection list is sorted by alphabetical order&lt;br /&gt;&lt;br /&gt;def map(person, connection_list)&lt;br /&gt;  # Compute a cartesian product using nested loops&lt;br /&gt;  for each friend1 in connection_list&lt;br /&gt;     # Eliminate all 2-degree pairs if they already&lt;br /&gt;     # have a one-degree connection&lt;br /&gt;     emit([person, friend1, 0])&lt;br /&gt;     for each friend2 &amp;gt; friend1 in connection_list&lt;br /&gt;         emit([friend1, friend2, 1],  1)&lt;br /&gt;&lt;br /&gt;def partition(key)&lt;br /&gt;  #use the first two elements of the key to choose a reducer&lt;br /&gt;  return super.partition([key[0], key[1]])&lt;br /&gt;&lt;br /&gt;def reduce(person_pair, frequency_list)&lt;br /&gt;  # Check if this is a new pair&lt;br /&gt;  if @current_pair != [person_pair[0], person_pair[1]]&lt;br /&gt;      @current_pair = [person_pair[0], person_pair[1]]&lt;br /&gt;      # Skip all subsequent pairs if these two person&lt;br /&gt;      # already know each other&lt;br /&gt;      @skip = true if person_pair[2] == 0&lt;br /&gt;&lt;br /&gt;  if !skip&lt;br /&gt;      path_count = 0&lt;br /&gt;      for each count in frequency_list&lt;br /&gt;          path_count += count&lt;br /&gt;      emit(person_pair, path_count)&lt;br /&gt;&lt;br /&gt;Output record ... person_pair =&amp;gt; path_count&lt;br /&gt;e.g. ["jay", "john"] =&amp;gt; 5&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Second Round MR Job&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The purpose of this MR job is to rank the connections for every person by the number of distinct path between them.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;In &lt;span style="font-weight: bold;"&gt;Map()&lt;/span&gt;,  we rearrange the input records so it will be sorted before reaching the reducer&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In &lt;span style="font-weight: bold;"&gt;Reduce()&lt;/span&gt;, all the connections from the person is sorted, we just need to aggregate the top 10 to a list and then write the list out.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;Input record = Output record of round 1&lt;br /&gt;&lt;br /&gt;def map(person_pair, path_count)&lt;br /&gt;  emit([person_pair[0], path_count], person_pair[1])&lt;br /&gt;&lt;br /&gt;def partition(key)&lt;br /&gt;  #use the first element of the key to choose a reducer&lt;br /&gt;  return super.partition(key[0])&lt;br /&gt;&lt;br /&gt;def reduce(connection_count_pair, candidate_list)&lt;br /&gt;  # Check if this is a new person&lt;br /&gt;  if @current_person != connection_count_pair[0]&lt;br /&gt;      emit(@current_person, @top_ten)&lt;br /&gt;      @top_ten = []&lt;br /&gt;      @current_person = connection_count_pair[0]&lt;br /&gt;&lt;br /&gt;  #Pick the top ten candidates to connect with&lt;br /&gt;  if @top_ten.size &amp;lt; 10&lt;br /&gt;      for each candidate in candidate_list&lt;br /&gt;          @top_ten.append([candidate, connection_count_pair[1]])&lt;br /&gt;          break if @pick_count &amp;gt; 10&lt;br /&gt;&lt;br /&gt;Output record ... person -&amp;gt; candidate_count_list&lt;br /&gt;e.g.  "ricky" =&amp;gt; [["jay", 5],  ["peter", 3] ...]&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-5066917425646523579?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/5066917425646523579/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=5066917425646523579' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5066917425646523579'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5066917425646523579'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/08/mapreduce-to-recommend-people.html' title='Map/Reduce to recommend people connection'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1344573185220550859</id><published>2010-07-20T05:34:00.000-07:00</published><updated>2010-07-20T08:13:20.147-07:00</updated><title type='text'>Graph Processing in Map Reduce</title><content type='html'>In &lt;a href="http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html"&gt;my previous post about Google's Pregel model&lt;/a&gt;, a general pattern of parallel graph processing can be expressed as multiple iterations of processing until a termination condition is reached.  Within each iteration, same processing happens at a set of nodes (ie: context nodes).&lt;br /&gt;&lt;br /&gt;Each context node perform a sequence of steps independently (hence achieving parallelism)&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Aggregate all incoming messages received from its direct inward arcs during the last iteration&lt;br /&gt;&lt;/li&gt;&lt;li&gt;With this aggregated message, perform some local computation (ie: the node and its direct outward arcs' local state)&lt;/li&gt;&lt;li&gt;Pass the result of local computation along all outward arcs to its direct neighbors&lt;/li&gt;&lt;/ol&gt;This processing pattern can be implemented using Map/Reduce model, using a MR job for each iteration.  The sequence is a little different from above.  Typically a mapper will perform (2) and (3) where it emits the message using its neighbor's node id as key.  Reducer will be responsible to perform (1).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Issue of using Map/Reduce&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;However, due to the functional programming nature of Map() and Reduce(), M/R does not automatically retain "state" between jobs.  To retain the graph across iterations, the mapper need to explicitly pass along the corresponding portion of the graph to the reducer, in additional to the messages itself.  Similarly, the reducer need to handle a different type of data passed along.&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;map(id, node) {&lt;br /&gt;  emit(id, node)&lt;br /&gt;  partial_result = local_compute()&lt;br /&gt;  for each neighbor in node.outE.inV {&lt;br /&gt;      emit(neighbor.id, partial_result)&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;reduce(id, list_of_msg) {&lt;br /&gt;  node = null&lt;br /&gt;  result = 0&lt;br /&gt;&lt;br /&gt;  for each msg in list_of_msg {&lt;br /&gt;      if type_of(msg) == Node&lt;br /&gt;          node = msg&lt;br /&gt;      else&lt;br /&gt;          result = aggregate(result, msg)&lt;br /&gt;      end&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  node.value = result&lt;br /&gt;  emit(id, node)&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;This downside of this approach is a substantial amount of I/O processing and bandwidth is consumed to just passing the graph itself around.&lt;br /&gt;&lt;br /&gt;Google's Pregel model provides an alternative message distribution model so that state can be retained at the processing node across iterations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The Schimmy Trick&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a &lt;a href="http://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_Schatz_MLG2010.pdf"&gt;recent research paper&lt;/a&gt;, Jimmy Lin and Michael Schatz use a clever partition() algorithm in Map /Reduce which can achieve "stickiness" of graph distribution as well as maintaining a sorted-order of node id on disk.&lt;br /&gt;&lt;br /&gt;The whole graph is broken down into multiple files and stored in HDFS.  Each file contains multiple records and each record describe a node and its corresponding adjacency list.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;id -&gt; [nodeProps, [[arcProps, toNodeId], [arcProps, toNodeId] ...]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In addition, the records are physically sorted within the file by their node id.&lt;br /&gt;&lt;br /&gt;There will be as many reducers as the number of above files and so each Reducer task is assigned with one of this file.  On the other hand, the partition() function assign all nodes within the file to land on its associated reducer.&lt;br /&gt;&lt;br /&gt;Mapper does the same thing before, except the first line in the method is removed as it no longer need to emit the graph.&lt;br /&gt;&lt;br /&gt;Reducer will receive all the message emitted from the mapper, which is sorted by the Map/Reduce framework by the key (which happens to be the node id).  On the other hand, the reducer can open the corresponding file in HDFS, which also maintain a sorted list of nodes based on their ids.  The reducer can just read the HDFS file sequentially on each reduce() call and confident that all preceding nodes in the file has already received their corresponding messages.&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"&gt;&lt;code&gt;reduce(id, list_of_msg) {&lt;br /&gt;   nodeInFile = readFromFile()&lt;br /&gt;&lt;br /&gt;   # Emit preceding nodes that receives no message&lt;br /&gt;   while(nodeInFile.id &amp;lt; id)&lt;br /&gt;       emit(nodeInFile.id, nodeInFile)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   result = 0&lt;br /&gt;&lt;br /&gt;   for each msg in list_of_msg {&lt;br /&gt;       result = aggregate(result, msg)&lt;br /&gt;   }&lt;br /&gt;&lt;br /&gt;   nodeInFile.value = result&lt;br /&gt;   emit(id, nodeInFile)&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;Although the Schimmy trick provides an improvement over the classical way of map/reduce, it only eliminates the communication between the mapper and the reducer.  At each iteration, the mapper still needs to read the whole graph from HDFS to the mapper node and the reducer still need to write the whole graph back to HDFS, which maintains a 3-way replication for each file.&lt;br /&gt;&lt;br /&gt;Hadoop provides some co-location mechanism for the mapper and try to assign files that is sitting at the same machine to the mapper.  However, this co-location mechanism is not available for the reducer and so reducer still need to write the graph back over the network.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Pregel Advantage&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Since Pregel model retain worker state (the same worker is responsible for the same set of nodes) across iteration, the graph can be loaded in memory once and reuse across iterations.  This will reduce I/O overhead as there is no need to read and write to disk at each iteration.  For fault resilience, there will be a periodic check point where every worker write their in-memory state to disk.&lt;br /&gt;&lt;br /&gt;Also, Pregel (with its stateful characteristic), only send local computed result (but not the graph structure) over the network, which implies the minimal bandwidth consumption.&lt;br /&gt;&lt;br /&gt;Of course, Pregel is very new and relative immature as compared to Map/Reduce.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1344573185220550859?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1344573185220550859/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1344573185220550859' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1344573185220550859'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1344573185220550859'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html' title='Graph Processing in Map Reduce'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-471839538167791257</id><published>2010-07-12T21:43:00.001-07:00</published><updated>2010-07-13T11:53:48.249-07:00</updated><title type='text'>Google Pregel Graph Processing</title><content type='html'>A lot of real life problems can be expressed in terms of entities related to each other and best captured using graphical models.  Well defined graph theory can be applied to processing the graph and return interesting results.  The general processing patterns can be categorized into the following ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Capture&lt;/span&gt; (e.g. When John is connected to Peter in a social network, a link is created between two Person nodes)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Query&lt;/span&gt; (e.g. Find out all of John's friends of friends whose age is less than 30 and is married)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Mining&lt;/span&gt; (e.g. Find out the most influential person in Silicon Valley)&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Distributed and Parallel Graph Processing &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Although using a Graph to represent a relationship network is not new, the size of network has been dramatically increase in the past decade such that storing the whole graph in one place is impossible.  Therefore, the graph need to be broken down into multiple partitions and stored in different places.  Traditional graph algorithm that assume the whole graph can be resided in memory becomes invalid.  We need to redesign the algorithm such that it can work in a distributed environment.  On the other hand, by breaking the graph into different partitions, we can manipulate the graph in parallel to speed up the processing.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Property Graph Model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The paper “&lt;a href="http://arxiv.org/pdf/1006.2361v1"&gt;Constructions from Dots and Lines&lt;/a&gt;” by Marko A. Rodriguez and  Peter Neubauer illustrate the idea very well.  Basically, a graph contains nodes and arcs.&lt;br /&gt;&lt;br /&gt;A node has a "type" which defines a set of properties (name/value pairs) that the node can be associated with.&lt;br /&gt;&lt;br /&gt;An arc defines a directed relationship between nodes, and hence contains the fromNode, toNode as well as a set of properties defined by the "type" of the arc.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TDv2YzZBDTI/AAAAAAAAAcg/GM2qUvp3vJE/s1600/p1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 156px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TDv2YzZBDTI/AAAAAAAAAcg/GM2qUvp3vJE/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5493255076480879922" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;General Parallel Graph Processing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Most of the graph processing algorithm can be expressed in terms of a combination of "traversal" and "transformation".&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Parallel Graph Traversal&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In the case of "traversal", it can be expressed as a path which contains a sequence of segments.  Each segment contains a traversal from a node to an arc, followed by a traversal from an arc to a node.  In &lt;a href="http://arxiv.org/pdf/1006.2361v1"&gt;Marko and Peter's model&lt;/a&gt;, a Node (Vertex) contains a collection of "inE" and another collection of "outE".  On the other hand, an Arc (Edge) contains one "inV", one "outV".  So to expressed a "Friend-of-a-friend" relationship over a social network, we can use the following&lt;br /&gt;&lt;br /&gt;./outE[@type='friend']/inV/outE[@type='friend']/inV&lt;br /&gt;&lt;br /&gt;Loops can also be expressed in the path, to expressed all persons that is reachable from this person, we can use the following&lt;br /&gt;&lt;br /&gt;.(/outE[@type='friend']/inV)*[@cycle='infinite']&lt;br /&gt;&lt;br /&gt;On the implementation side, a traversal can be processed in the following way&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Start with a set of "context nodes", which can be defined by a list of node ids, or a search criteria (in this case, the search result determines the starting context nodes)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Repeat until all segments in the path are exhausted.  Perform a walk from all context nodes in parallel.  Evaluate all outward arcs (ie: outE) with conditions (ie: @type='friend'). The nodes that this arc points to (ie: inV) will become the context node of next round&lt;/li&gt;&lt;li&gt;Return the final context nodes&lt;/li&gt;&lt;/ol&gt;Such traversal path can also be used to expressed inference (or derived) relationships, which doesn't have a physical arc stored in the graph model.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Parallel Graph Transformation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The main goal of Graph transformation is to modify the graph.  This include modifying the properties of existing nodes and arcs, creating new arcs / nodes and removing existing arcs / nodes.  The modification logic is provided by a user-defined function, which will be applied to all active nodes.&lt;br /&gt;&lt;br /&gt;The Graph transformation process can be implemented in the following steps&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Start with a set of "active nodes", which can be defined by a  lost of node ids, or a search criteria (in this case, the search result  determines the starting context nodes)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Repeat until there is no more active nodes.  Execute the user-defined transformation which modifies the properties of the context nodes and outward arcs.  It can also remove outwards arcs or create new arcs that point to existing or new nodes (in other words, the graph connectivity can be modified).  It can also send message to other nodes (the message will be picked up in the next round) as well as receive message sent from other nodes in the previous round.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Return the transformed graph, or a traversal can be performed to return a subset of the transformed graph.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt; &lt;span style="font-weight: bold;font-size:130%;" &gt;Google's Pregel&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Pregel can be thought as a generalized parallel graph transformation framework.  In this model, the most basic (atomic) unit is a "node" that contains its properties, outward arcs (and its properties) as well as the node id (just the id) that the outward arc points to.  The node also has a logical inbox to receive all messages sent to it.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/TDwHsGs3rpI/AAAAAAAAAcw/u1iQoz4Z1gI/s1600/P2.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 200px; height: 153px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/TDwHsGs3rpI/AAAAAAAAAcw/u1iQoz4Z1gI/s200/P2.png" alt="" id="BLOGGER_PHOTO_ID_5493274099779612306" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/TDwHcuaCsLI/AAAAAAAAAco/bM6H645LqEI/s1600/P2.png"&gt;&lt;br /&gt;&lt;/a&gt;The whole graph is broken down into multiple "partitions", each contains a large number of nodes.  Partition is a unit of execution and typically has an execution thread associated with it.  A "worker" machine can host multiple "partitions".&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/TDx5y6cuY5I/AAAAAAAAAdA/kBuHFQewS4I/s1600/P3.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 236px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/TDx5y6cuY5I/AAAAAAAAAdA/kBuHFQewS4I/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5493399561075319698" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The execution model is based on &lt;span style="font-weight: bold; font-style: italic;"&gt;BSP (Bulk Synchronous Processing) model&lt;/span&gt;.  In this model, there are multiple processing units proceeding in parallel in a sequence of "supersteps".  Within each "superstep", each processing units first receive all messages delivered to them from the preceding "superstep", and then manipulate their local data and may queue up the message that it intends to send to other processing units.  This happens asynchronously and simultaneously among all processing units.  The queued up message will be delivered to the destined processing units but won't be seen until the next "superstep".  When all the processing unit finishes the message delivery (hence the synchronization point), the next superstep can be started, and the cycle repeats until the termination condition has been reached.&lt;br /&gt;&lt;br /&gt;Notice that depends on the graph algorithms, the assignment of nodes to a partition may have an overall performance impact.  Pregel provides a default assignment where partition = nodeId % N but user can overwrite this assignment algorithm if they want.  In general, it is a good idea to put close-neighbor nodes into the same partition so that message between these nodes doesn't need to flow into the network and hence reduce communication overhead.  Of course, this also means traversing the neighboring nodes all happen within the same machine and hinder parallelism.  This usually is not a problem when the context nodes are very diverse.  In my experience of parallel graph processing, coarse-grain parallelism is preferred over fine-grain parallelism as it reduces communication overhead.&lt;br /&gt;&lt;br /&gt;The complete picture of execution can be implemented as follows:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/TDyCgInOV6I/AAAAAAAAAdI/rBiBaNOgbeI/s1600/P4.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 244px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/TDyCgInOV6I/AAAAAAAAAdI/rBiBaNOgbeI/s400/P4.png" alt="" id="BLOGGER_PHOTO_ID_5493409134064588706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The basic processing unit is a "thread" associated with each partition, running inside a worker.  Each worker receive messages from previous "superstep" from its "inQ" and dispatch the message to the corresponding partition that the destination node is residing.  After that, a user defined "compute()" function is invoked on each node of the partition.  Notice that there is a single thread per partition so nodes within a partition are executed sequentially and the order of execution is undeterministic.&lt;br /&gt;&lt;br /&gt;The "master" is playing a central role to coordinate the execute of supersteps in sequence.  It signals the beginning of a new superstep to all workers after knowing all of them has completed the previous one.  It also pings each worker to know their processing status and periodically issue "checkpoint" command to all workers who will then save its partition to a persistent graph store.  Pregel doesn't define or mandate the graph storage model so any persistent mechanism should work well.  There is a "load" phase at the beginning where each partition starts empty and read a slice of the graph storage.  For each node read from the storage, a "partition()" function will be invoked and load the node in the current partition if the function returns the same node, otherwise the node is queue to another partition who the node is assigned to.&lt;br /&gt;&lt;br /&gt;Fault resilience is achieved by having the checkpoint mechanism where each worker is instructed to save its in-memory graph partition to the graph storage periodically (at the beginning of a superstep).  If the worker is detected to be dead (not responding to the "ping" message from the master), the master will instruct the surviving workers to take up the partitions of the failed worker.  The whole processing will be reverted back to the previous checkpoint and proceed again from there (even the healthy worker need to redo the previous processing).  The Pregel paper mention a potential optimization to just re-execute the processing of the failed partitions from the previous checkpoint by replaying the previous received message, of course this requires keeping a log of all received messages between nodes at every super steps since previous checkpoint.  This optimization, however, rely on the algorithm to be deterministic (in other words, same input execute at a later time will achieve the same output).&lt;br /&gt;&lt;br /&gt;Further optimization is available in Pregel to reduce the network bandwidth usage.  Messages destined to the same node can be combined using a user-defined "combine()" function, which is required to be associative and commutative.  This is similar to the same combine() method in Google Map/Reduce model.&lt;br /&gt;&lt;br /&gt;In addition, each node can also emit an "aggregate value" at the end of "compute()".  Worker will invoke an user-defined "aggregate()" function that aggregate all node's aggregate value into a partition level aggregate value and all the way to the master.  The final aggregated value will be made available to all nodes in the next superstep.  Just aggregate value can be used to calculate summary statistic of each node as well as coordinating the progress of each processing units.&lt;br /&gt;&lt;br /&gt;I think the Pregel model is general enough for a large portion of classical graph algorithm.  I'll cover how we map these traditional algorithms in Pregel in subsequent postings.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Reference&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.slideshare.net/slidarko/graph-windycitydb2010"&gt;http://www.slideshare.net/slidarko/graph-windycitydb2010&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-471839538167791257?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/471839538167791257/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=471839538167791257' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/471839538167791257'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/471839538167791257'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html' title='Google Pregel Graph Processing'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j6mB7TMmJJY/TDv2YzZBDTI/AAAAAAAAAcg/GM2qUvp3vJE/s72-c/p1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1396688454230585577</id><published>2010-06-12T09:46:00.000-07:00</published><updated>2010-06-12T18:46:40.072-07:00</updated><title type='text'>Strategy for software engineering position interview</title><content type='html'>From my experience in hiring and attending job interviews, employers are generally looking at 5 areas when hiring a software engineer ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Technical knowledge of specific technology product&lt;/li&gt;&lt;li&gt;Experience of the business problem domain&lt;/li&gt;&lt;li&gt;General technical and architecture sense&lt;/li&gt;&lt;li&gt;Personality and working style&lt;/li&gt;&lt;li&gt;IQ, problem solving skills&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;While 1 to 4 are very straightforward to answer, the most interesting and challenging part is the item (5) because there are effectively no time to prepare.  Depends on how good you can manage pressure, you brain can be totally blank at the interview session.&lt;br /&gt;&lt;br /&gt;I personally believe that a short interview doesn't necessary reflects a person's ability in doing the actual work.  I personally have worked with many people who is a high performer in work but a poor interviewee.  So don't lose confidence because you fail an interview, there is a factor of luck involved in this process.&lt;br /&gt;&lt;br /&gt;Nevertheless, most employers are willing to accept a higher false negative, but the chance of false positive must be very low.  Because the pass rate of these IQ and algorithm questions are generally pretty low, it is a very effective means of filtering the candidates.  Therefore, be able to tackle this kinds of question well is a critical success factor of interviewing.&lt;br /&gt;&lt;br /&gt;Notice that there is no substitution of "good knowledge", "high IQ" and "the ability to speak/think under a pressured environment", but I've found there are some very useful technique and strategies.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;1. Rephrase the question slowly in your own words&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This can help you to make sure you fully understand the question and clarify if there is any hidden assumptions you have made.  Repeat the question "slowly" also gives you more time to think.&lt;br /&gt;&lt;br /&gt;From the interview perspective, he/she can see clearly the candidate's ability to digest a problem.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;2.  Construct a Visual model of the problem&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Use a whiteboard, or paper (if this is a phone interview) to diagram the problem that you perceive.  Our brain is good in understanding picture than words so having a diagram will be very useful to come up with solution ideas.&lt;br /&gt;&lt;br /&gt;From an interview's perspective, he/she can see a clear picture how you analyze the problem.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;3.  Use a special, simple case to guide you&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Never try to tackle the general problem at first, start with a super-simple, special case, and think how you would solve this simple case first.  This is very helpful to reduce the amount of things that you need to consider and let you focus in the core part of the problem.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;4.  Start with a very naive solution as a baseline&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Tell the interviewer that you want to start with a very naive solution to establish a baseline.  The naive solution can usually be constructed using a brute-force approach (try all combination until you find a matched solution).  After that analyze the complexity of this naive solution as a baseline for future comparison.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;5. Improve your solution&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At this stage, you need to evolve and improve your solution.  Here are some general techniques.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Divide and conquer&lt;/span&gt;:  Decompose the problem into smaller ones and solve each sub-problem separately, then combine the solutions for the overall problem.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Reduce to well-known algorithm models&lt;/span&gt;:  Try to model your problem in terms of well-studied computer science data structure model (e.g. Tree, Graph, Search, Sort) and then apply well known algorithm to solve them&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Recursive structure&lt;/span&gt;:  Try to structure your problem in a recursive form.  Finding the solution for the base case and then expand the solution in a recursive manner.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Greedy method&lt;/span&gt;: Try to modify attributes of your current best solution to see if you can get a better one.  Watch out of being trapped in a local optimal solution.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Approximation&lt;/span&gt;:  Instead of finding an exact solution, try to see if it is acceptable to find an approximate solution.  Probabilistic approach (try 100 random combination and pick the best outcome)&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;6.  Keep talking while you  think&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Don't wait until you fully figure out the answer.  Keep talking while  you are thinking about the solution so the interviewer understand how  you analyzing things, and you are also showing how well you can express  your thoughts.  It is also easier for the interviewer to guide you or  give you hints.  And finally you may impressed the interviewer even you  cannot get to the exact answer.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;7.  Generalize your solution  for the final answer&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After you find a working solution for the simple case, extend the simple  case to see how you would  solve it.  See if you can find a general pattern how the solution would  look like.  Once you find it, generalize your solution for the general  problem&lt;br /&gt;&lt;br /&gt;From the interview's perspective, he/she can assess if the candidate can  think in different levels of abstraction, and the ability to apply a  solution in a broader scope of problem.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;8.  Remember the interview hasn't ended at the office building&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;You always have the chance to think through the question that you haven't given satisfactory answer after you walk out from the office.  Submit a solution via email once you get back home (do it ASAP though), along with a thank you note to the interviewer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1396688454230585577?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1396688454230585577/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1396688454230585577' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1396688454230585577'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1396688454230585577'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/06/strategy-for-software-engineering.html' title='Strategy for software engineering position interview'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-8064097644179052780</id><published>2010-05-02T08:52:00.000-07:00</published><updated>2010-05-02T12:08:00.143-07:00</updated><title type='text'>The NOSQL Debate</title><content type='html'>I have attended the Stanford InfoLab conference, and there are 2 panel discussions in Cloud computing transaction processing and analytic processing.&lt;br /&gt;&lt;br /&gt;The session turns out to be a debate between people from the academia side with the open source developer.  Both sides have their points ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Background of RDBMS&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;RDBMS is based on a clear separation between application logic and data management.  This loose coupling allows application and DB technologies to be evolved independently.&lt;br /&gt;&lt;br /&gt;This philosophy drives the DBMS architecture to be more general in order to support a wide range of applications.  Over many years of evolution, it has a well-defined data model + query model based on relational algebra.  On the other hand, it also have a well-defined transaction processing model.&lt;br /&gt;&lt;br /&gt;On the other hand, applications also benefit from having a unified data model.  They have more freedom to switch their DB vendor without too much of code changes.&lt;br /&gt;&lt;br /&gt;For OLTP applications, the RDBMS model has been proven to be highly successful in many large enterprises.  The success of both Oracle and MySQL can speak to that.&lt;br /&gt;&lt;br /&gt;For Analytic applications, the RDBMS model is also used widely for implementing data warehousing based on a STAR schema model (composed of Facts table and Dimension tables).&lt;br /&gt;&lt;br /&gt;This model also put DBA into a very important position in the enterprise.  They are equipped with sophisticated management tools and specialized knowledge to control the commonly shared DB schema.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Background of NOSQL&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In the last 10 years, there are a very few of highly successful web2.0 companies whose applications have gone beyond what a centralized RDBMS can handle.  Partition data across many servers and spread the workload across them seems to be a reasonable first thing to try.  Many RDBMS solution provides a Master/Slave replication mechanism where one can spread the READ-ONLY workload across many servers, and it helps.&lt;br /&gt;&lt;br /&gt;In a partitioned environment, application needs to be more tolerant to data integrity issues, especially when data is asynchronously replicated between servers.  &lt;a href="http://en.wikipedia.org/wiki/CAP_theorem"&gt;The famous CAP theorem&lt;/a&gt; from Eric Brewer capture the essence of the tradeoff decisions for highly scalable applications (which must have partition tolerance), they have to choose between "Consistency" and "Availability".&lt;br /&gt;&lt;br /&gt;Fortunately, most of these web-scale application have a high tolerance  in data integrity, so they choose "availability" over "consistency".  This is a very major deviation from the transaction processing model of RDBMS which typically weight "consistency" much higher than "availability".&lt;br /&gt;&lt;br /&gt;On the other hand, the data structure used in these web-scale application (e.g. user profile, preference, social graph ... etc) are much richer than the simple row/column/table model.  Some of the operations involves navigation within a large social graph which involves too many join operations in a RDBMS model.&lt;br /&gt;&lt;br /&gt;The "higher tolerance of data integrity" as well as "efficiency support for rich data structure" challenges some of the very fundamental assumption of RDBMS.  Amazon has experimented their large scale key/value store called &lt;a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html"&gt;Dynamo&lt;/a&gt; and Google also has their &lt;a href="http://labs.google.com/papers/bigtable.html"&gt;BigTable column-oriented storage model&lt;/a&gt;, both are designed from the ground up with a distributed architecture in mind.&lt;br /&gt;&lt;br /&gt;Since then, there are many open source clones based on these two models.  To represent the movement of this trend, Eric Evans from Rackspace coin a term "NOSQL".  This term in my opinion is not accurately reflecting the underlying technologies but nevertheless provide a marketing term for every non RDBMS technologies to get together, even those (e.g. CouchDB, Neo4j) who is not originally trying to tackle the scalability problem.&lt;br /&gt;&lt;br /&gt;Today, NOSQL provides an alternative approach for Non-Relational Database.&lt;br /&gt;&lt;br /&gt;For analytical application, they also take a highly-parallel brute-force processing model based on the Map/reduce model.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The Debate&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are relatively few criticism on the data model aspects.  Jun Rao from IBM Lab summarized the key difference between the philosophy.  The traditional RDBMS approach is first to figure out the right model, and then provide an implementation and hope it is scalable.  The modern NOSQL approach is doing the opposite by first figuring out how a highly scalable architecture looks like and then layer a data model on top of that.  Basically people on both camps agree that we should use a data model  that is optimized for the application's access patterns, weakening the  "one-size-fits-all" philosophy of RDBMS.&lt;br /&gt;&lt;br /&gt;Most of the debate is centered around the transaction processing model itself.  Basically RDBMS proponents thinks NOSQL camp hasn't spent enough time to understand the theoretical foundation of the transaction processing model.  The new "eventual consistency" model is not well-defined and different implementations may differs significantly with each other.  This means figuring out all these inconsistent behavior lands on the application developer's responsibilities and make their life much harder.  Hard to reason about the DB's behavior can be very dangerous if the application made wrong assumption about the underlying data integrity guarantees.&lt;br /&gt;&lt;br /&gt;While agree that application developers now have more to worry about, the NOSQL camp argues that this is actually a benefit because it gives the domain-specific optimization opportunities back to the application developers who now no longer constrained by a one-size-fits-all model.  But they admit that making such optimization decision requires a lot of experience and can be very error-prone and dangerous if not done by experts.&lt;br /&gt;&lt;br /&gt;On the other hand, the academia also make a note that the movement to NOSQL may deem fashionable and cool to new technologists who may not have the expertise and skills.  The community as a whole should articular the pros and cons of NOSQL.&lt;br /&gt;&lt;br /&gt;Notice that this is not the debate of the first time.  &lt;a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext"&gt;StoneBraker has written a very good article&lt;/a&gt; from the RDBMS side.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-8064097644179052780?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/8064097644179052780/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=8064097644179052780' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/8064097644179052780'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/8064097644179052780'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/05/nosql-debate.html' title='The NOSQL Debate'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-3119703151870998276</id><published>2010-03-02T11:23:00.000-08:00</published><updated>2010-03-02T12:08:49.117-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Multi-tenancy'/><category scheme='http://www.blogger.com/atom/ns#' term='Cloud computing'/><title type='text'>Two approaches on Multi-tenancy in Cloud</title><content type='html'>Continue on &lt;a href="http://horicky.blogspot.com/2009/08/multi-tenancy-in-cloud-computing.html"&gt;my previous blog on how multi-tenancy related to cloud computing&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;My thoughts has changed that now I think both the Amazon approach (Hypervisor isolation) and Salesforce approach (DB isolation) are both valid but attract a different set of audiences.&lt;br /&gt;&lt;br /&gt;First of all, increase efficiency through sharing is a fundamental value proposition underlying all cloud computing initiatives, there is no debate that ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We should "share resource" to increase utilization and hence improve efficiency&lt;/li&gt;&lt;li&gt;We should accommodate highly dynamic growth and shrink requirement rapidly and smoothly&lt;/li&gt;&lt;li&gt;We should "isolate" the tenant so there is no leakage on sensitive information&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;But at which layer should be facilitate that ?  Hypervisor level or DB level.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Hypervisor level&lt;/span&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt; Isolation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Hypervisor is a very low-level layer of software that maps the physical machine to a virtualized machine on which a regular OS runs on.  When the regular OS issue system calls to the VM, it is intercepted by the Hypervisor which maps to the underlying hardware.  The hypervisor also provide some traditional OS functions such as process scheduling to determine which VM to run.  Hypervisor can be considered to be a very lean OS that sits very close to the bare hardware.&lt;br /&gt;&lt;br /&gt;Depends on the specific implementation, Hypervisor introduces an extra layer of indirection and hence incur a certain % of overhead.  If we need a VM with capacity less than a physical machine, Hypervisor allows us to partition the hardware into finer granularity and hence improve the efficiency by having more tenants running on the same physical machine.  For light-usage tenant, such increment in efficiency should offset the lost from the overhead.&lt;br /&gt;&lt;br /&gt;Since Hypervisor focus on low-level system level primitives, it provides the cleanest separation and hence lessen security concerns.  On the other hand, by intercepting at the lowest layer, Hypervisor retain the familar machine model that existing system/network admin are familiar with.  Since Application is now completely agnostic to the presence of Hyervisor, this minimize the change required to move existing apps into the cloud and makes cloud adoption easier. &lt;br /&gt;&lt;br /&gt;Of course, the downside is that virtualization introduce a certain % of overhead.  And the tenant still need to pay for the smallest VM even none of its user is using it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;DB level Isolation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here is another school of thought, if tenants are running the same kind of application, the only difference is the data each tenant store.  Why can't we just introduce an extra attribute "tenantId" in every table and then append a "where tenantId = $thisTenantId" in every query ?  In other words, add some hidden column and modify each submitted query.&lt;br /&gt;&lt;br /&gt;In additional, the cloud provider usually need to re-architect the underlying data layer and move to a distributed and partitioned DB.  Some of the more sophisticate providers also need to invest in developing intelligent data placement algorithm based on workload patterns.&lt;br /&gt;&lt;br /&gt;In this approach, the degree of isolating is as good as the rewritten query.  In my opinion, this doesn't seem to be hard, although it is less proven than the Hypervisor approach.&lt;br /&gt;&lt;br /&gt;The advantage of DB level isolation is there is no VM overhead and there is no minimum charge to the tenant.&lt;br /&gt;&lt;br /&gt;However, we should compare these 2 approach not just from a resource utilization / efficiency perspective, but also other perspectives as well, such as ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Freedom of choice on technology stack&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Hypervisor isolation gives it tenant maximum freedom of the underlying technology stack.  Each tenant can choose the stack that fits best to its application's need and inhouse IT skills.  The tenant can also free to move to latest technologies as they evolve.&lt;br /&gt;&lt;br /&gt;This freedom of choice comes with a cost though.  The tenant need to hire system administrators to configure and maintain the technology stack.&lt;br /&gt;&lt;br /&gt;In a DB level isolation, the tenants are live within a set of predefined data schemas and application flows.  So their degree of freedom is limited to whatever the set of parameters that the cloud provider exposes.  Also the tenants' applications are "lock-in" to the cloud provider's framework, and a tight coupling and dependency is created between the tenant and the cloud provider.&lt;br /&gt;&lt;br /&gt;Of course, the advantage is that there is no administration needed in the technology stack.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Reuse of Domain Specific Logic&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Since it focus in the lowest layer of resource sharing, Hypervisor isolation provides no reuse at the app logic level.  Tenants need to build their own technology stack from the ground up and write their own application logic.&lt;br /&gt;&lt;br /&gt;In the DB isolation approach, the cloud provider pre-defines a set of templates in DB schemas and Application flow logic based on their domain expertise (it is important that the cloud provider must be the recognized expert in that field).  The tenant can leverage the cloud provider's domain expertise and focus in purely business operation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I think each approach will attract a very different (and clearly disjoint) set of audiences.&lt;br /&gt;&lt;br /&gt;Notice that DB-level isolation commoditize everything and make it very hard to create product feature differentiations.  If I am a technology startup company trying to develop a killer product, then my core value is my domain expertise.  In this case, I won't go with the DB-level isolation which impose too much constraints on me to distinguish my product from "anyone else".  Hypervisor level isolation much better because I can outsource the infrastructure layer and focus in my core value.&lt;br /&gt;&lt;br /&gt;On the other hand, if I am operating a business but not building a product, then I would like to outsource all supporting functions including my applications as well.  In this case, I would pick the best app framework provided by the market leader and follow their best practices (also very willing to live by their constraints), the DB level isolation is more compelling in this case.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-3119703151870998276?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/3119703151870998276/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=3119703151870998276' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/3119703151870998276'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/3119703151870998276'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/03/two-approaches-on-multi-tenancy-in.html' title='Two approaches on Multi-tenancy in Cloud'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-3017803827319816847</id><published>2010-03-01T13:25:00.001-08:00</published><updated>2010-03-03T10:07:46.531-08:00</updated><title type='text'>Search Engine Basics</title><content type='html'>Receive the question of "how search works ?" couple times recently so try to document the whole process.  This is intended to highlight the key concepts but not specific implementation details, which will be much more complicated and sophisticated than this one.&lt;br /&gt;&lt;br /&gt;A very basic search engine includes a number of processing phases.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Crawling: to discover the web pages on the internet&lt;/li&gt;&lt;li&gt;Indexing: to build an index to facilitate query processing&lt;/li&gt;&lt;li&gt;Query Procesisng: Extract the most relevant page based on user's query terms&lt;/li&gt;&lt;li&gt;Ranking: Order the result based on relevancy&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/S4140VTxW0I/AAAAAAAAAbk/jWcEZ8zC-rY/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 340px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/S4140VTxW0I/AAAAAAAAAbk/jWcEZ8zC-rY/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5444140365029399362" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that each element in the above diagram reflects a logical function unit but not its physical boundary.  For example, the processing unit in each orange box is in fact executed across many machines in parallel.  Similarly, each of the data store element is spread physically across many machines based on the key partitioning.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Vector Space Model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here we use the "Vector Space Model" where each document is modeled as a multi-dimensional vector (each word represents a dimension).  If we put all documents together, we form a matrix where the rows are documents and columns are words, and each cell contains the &lt;a href="http://horicky.blogspot.com/2009/01/solving-tf-idf-using-map-reduce.html"&gt;TF/IDF value&lt;/a&gt; of the word within the document.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/S46fNdfOGBI/AAAAAAAAAcU/ujpleeWteho/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 239px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/S46fNdfOGBI/AAAAAAAAAcU/ujpleeWteho/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5444464053140199442" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;To determine the similarity between 2 documents, we can apply the dot product between 2 documents and the result will represents the degree of similarity.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Crawler&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Crawler's job is to collect web pages on the internet, it is typically done by a farm of crawlers, who do the following&lt;br /&gt;&lt;br /&gt;Start from a set of seed URLs, repeat following ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Pick the URL that has the highest traversal priority.&lt;/li&gt;&lt;li&gt;Download the page content from the URLs to the content repository (which can be a distributed file system, or DHT), as well as update the entry in the doc index&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Discover new URL links from the download pages.  Add the link relationship into the link index and add these links to the traversal candidates&lt;/li&gt;&lt;li&gt;Prioritize the traversal candidates&lt;/li&gt;&lt;/ol&gt;The content repository can be any distributed file system, here lets say it is a DHT.&lt;br /&gt;&lt;br /&gt;There are a number of considerations.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How to make sure different Crawlers are working on different set of contents (rather than crawling the same page twice) ?  When the crawler detects overlapping is happening (url is already exist in the page repository with pretty recent time), the crawler will skip the processing on this URL and pick up the next best URL to crawl.&lt;/li&gt;&lt;li&gt;How does the crawler determines which is the next candidate to crawl ?  We can use a heuristic algorithm based on some utility function (e.g. we can pick the URL candidate which has the highest page rank score)&lt;/li&gt;&lt;li&gt;How frequent do we re-crawl ?  We can track the rate of changes of the page to determine the frequency of crawling.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;Indexer&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Indexer's job is to build the inverted index for the query processor to serve the online search requests.&lt;br /&gt;&lt;br /&gt;First the indexer will build the "forward index"&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The indexer will parse the documents from the content repository into a token stream.&lt;/li&gt;&lt;li&gt;Build up a "hit list" which describe each occurrence of the token within the document (e.g. position in the doc, font size, is it a title, archor text ... etc).&lt;/li&gt;&lt;li&gt;Apply various "filters" to the token stream (like stop word filters to remove words like "a", "the", or a stemming filter to normalize words "happy", "happily", "happier" into "happy")&lt;/li&gt;&lt;li&gt;Compute the term frequency within the document.&lt;/li&gt;&lt;/ol&gt;From the forward index, the indexer will proceed to build a reverse index (typically through a Map/Reduce mechanism).  The result will be keyed by word and stored in a DHT.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;Ranker&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Ranker's job is to compute the rank of a document, based on how many in-links pointing to the document as well as the rank of the referrers (hence a recursive definition).  Two popular ranking algorithms including the "Page Rank" and "HITs".&lt;br /&gt;&lt;ul style="font-weight: bold;"&gt;&lt;li&gt;Page Rank Algorithm&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: normal;"&gt;Page rank is a global rank mechanism.  It is precomputed upfront and is independent of the query&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/S46R5QlyJ_I/AAAAAAAAAcE/VPN-rl0fPuM/s1600-h/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 351px; height: 400px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/S46R5QlyJ_I/AAAAAAAAAcE/VPN-rl0fPuM/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5444449412429522930" border="0" /&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;HITS Algorithm&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;In HITS, every page is playing a dual role: "hub" role and "authority" role.  It has two corresponding ranks on these two roles.  Hub rank measures the quality of the outlinks.  A good hub is one that points to many good authorities.  Authority ranks measures the quality of my content.  A good authority is one that has many good hubs pointing to.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/S46bN88QRlI/AAAAAAAAAcM/YwCOJtSezd8/s1600-h/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 247px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/S46bN88QRlI/AAAAAAAAAcM/YwCOJtSezd8/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5444459663536965202" border="0" /&gt;&lt;/a&gt;Notice that HITS doesn't pre-compute the hub and authority score.  Instead it invoke a regular search engine (which only do TF/IDF matches but not ranking) to get a set of initial results (typically with a predefined fix size) and then expand this result set by tracing the outlinks into the expand result set.  It also incorporate a fix size of inlinks (by sampling the inlinks into the initial result set) into the expanded result set.  After this expansion, it runs an iterative algorithm to compute the authority ranks and hub ranks.  And use the combination of these 2 ranks to calculate the ultimate rank of each page, usually pages with high hub rank will weight more than high authority rank.&lt;br /&gt;&lt;br /&gt;Notice that the HITS algorithm is perform at query time and not pre-computed upfront.  The advantage of HITS is that it is sensitive to the query (as compare to PageRank which is not).  The disadvantage is that it perform ranking per query and hence expensive.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Query Processor&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When user input a search query (containing multiple words), the query will be treated as a "query document".  Relevancy is computed and combined with the rank of the document and return an ordered list of result.&lt;br /&gt;&lt;br /&gt;There are many ways to compute the relevancy.  We can consider only the documents that contains all the terms specified in the query.  In this model, we search for each term (with the query) a list of document id and then do an intersection with them.  If we order the document list by the document id, the intersection can be computed pretty efficiently.&lt;br /&gt;&lt;br /&gt;Alternatively, we can return the union (instead of intersection) of all document and order them by a combination of the page rank TF/IDF score.  Document that have more terms intersecting with the query will have a higher TF/IDF score.&lt;br /&gt;&lt;br /&gt;In some cases, an automatic query result feedback loop can be used to improve the relevancy.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;In first round, the search engine will perform a search (as described above) based on user query&lt;/li&gt;&lt;li&gt;Construct a second round query by expanding the original query with additional terms found in the return documents which has high rank in the first round result&lt;/li&gt;&lt;li&gt;Perform a second round of query and return the result.&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;Outstanding Issues&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Fighting the spammer is a continuous battle in search engine.  Because of the financial value of being shown up in the first page of search result.  Many spammers try to manipulate their page.  Earlier attempt is to modify a page to repeat the terms many many times (trying to increase the TF/IDF score).  The evolution of Page rank has mitigate this to some degree because page rank in based on "out-of-page" information that the site owner is much harder to manipulate.&lt;br /&gt;&lt;br /&gt;But people use Link-farms to game the page rank algorithms.  The ideas is to trade links between different domains.  There is active research in this area about how to catch these patterns and discount their ranks&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-3017803827319816847?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/3017803827319816847/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=3017803827319816847' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/3017803827319816847'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/3017803827319816847'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/03/search-engine-basics.html' title='Search Engine Basics'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j6mB7TMmJJY/S4140VTxW0I/AAAAAAAAAbk/jWcEZ8zC-rY/s72-c/p1.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-4333539264484963263</id><published>2010-02-17T14:52:00.000-08:00</published><updated>2010-02-18T16:13:26.064-08:00</updated><title type='text'>Spatial Index RTree</title><content type='html'>For location-based search, it is very common to search for objects based on their spatial location.  e.g. find all restaurants within 5 miles of my current location, or find all schools within the zipcode of 95110 ... etc.&lt;br /&gt;&lt;br /&gt;Every spatial object can be represented by an object id, a minimal bounded rectangle (MBR), as well as other attributes.  So the space can be represented by a collection of spatial objects.  (here we use 2 dimension to illustrate the idea, but the concept can be extended to N dimensions.)&lt;br /&gt;&lt;br /&gt;A query can be represented as another rectangle.  The query is about locating the spatial objects whose MBR overlaps with the query rectangle.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S3x3OviiiVI/AAAAAAAAAbE/vw4fAu4GmBI/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 239px; height: 320px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S3x3OviiiVI/AAAAAAAAAbE/vw4fAu4GmBI/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5439353545119926610" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;RTree is a spatial indexing technique such that given a query rectangle, we can quickly locate the spatial object results.&lt;br /&gt;&lt;br /&gt;The concept is similar to BTree.  We group spatial objects that are close by each other and form a tree whose intermediate nodes contains "close-by" objects.  Since the MBR of the parent node contains all MBR of its children, the Objects are close by if their parent's MBR is minimized.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/S3x9e6SjD5I/AAAAAAAAAbU/BFJnyaUvjio/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 330px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/S3x9e6SjD5I/AAAAAAAAAbU/BFJnyaUvjio/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5439360419953315730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Search&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Start from the root, we examine each children's MRB to see if it overlaps with the query MBR.  We skip the whole subtree if there is no overlapping, otherwise, we recurse the search by drilling into each child.&lt;br /&gt;&lt;br /&gt;Notice that unlike other tree algorithm where only traverse down one path.  Our search here needs to traverse down multiple path if the overlaps happen.  Therefore, we need to structure the tree to minimize the overlapping as high in the tree as possible.  This means we want to minimize the sum of MBR areas along each path (from the root to the leaf) as much as possible.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Insert&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To insert a new spatial object, starting from the root node, pick the children node whose MBR will be extended least if the new spatial object is added, walk down this path until reaching the leaf node.&lt;br /&gt;&lt;br /&gt;If the leaf node has space left, insert the object to the leaf node and update the MBR of the leaf node as well as all its parents.  Otherwise, split the leaf node into two (create a new leaf node and copy some of the content of the original leaf node to this new one).  And then add the newly created leaf node to the parent of the original leaf node.  If the parent has no space left, the parent will be split as well.&lt;br /&gt;&lt;br /&gt;If the split goes all the way to the root, the original root will then be split and a new root is created.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Delete&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Deleting a spatial node will first search for the containing leaf node.  Remove the spatial node from the leaf node's content and update its MBR and its parent's MBR all the way to the root.  If the leaf node now has less than m node, then we need to condense the node by marking the leaf node to be delete.  And then we remove the leaf node from its parent's content as well as updating the .  If the parent is now less than m node, we mark the parent to be delete also and remote the parent from the parent's parent.  At this point, all the node that is marked delete is removed from the RTree.&lt;br /&gt;&lt;br /&gt;Notice that the content with these delete node is not all garbage, since they still have some children that are valid nodes (but were removed from the tree).  Now we need to reinsert all these valid nodes back in the tree.&lt;br /&gt;&lt;br /&gt;Finally, we check if the root node contains only one child, we throw away the original root and use its own child to become the new root.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Update&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Update happens when an existing spatial node changes its dimension.  One way is to just change the spatial node's MBR but not change the RTree.  A better way (but more expensive) is to delete the node, modify it MBR and then insert it back to the RTree.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-4333539264484963263?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/4333539264484963263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=4333539264484963263' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4333539264484963263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4333539264484963263'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/02/spatial-index-rtree.html' title='Spatial Index RTree'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j6mB7TMmJJY/S3x3OviiiVI/AAAAAAAAAbE/vw4fAu4GmBI/s72-c/p1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-8288008898526095618</id><published>2010-02-11T08:39:00.000-08:00</published><updated>2010-02-15T12:50:56.892-08:00</updated><title type='text'>Cloud MapReduce Tricks</title><content type='html'>Cloud Map/Reduce developed by Huan Liu and Dan Orban offers some good lessons in how to design applications on top of Cloud environment, which has a set of characteristics and constraints.&lt;br /&gt;&lt;br /&gt;Although it is providing the same map, reduce programming model to the application developer, the underlying implementation architecture of Cloud MR is drastically different from Hadoop.  For a description of &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;Hadoop internals, here is it&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Build on top of a Cloud OS (which is &lt;a href="http://horicky.blogspot.com/2009/11/amazon-cloud-computing.html"&gt;Amazon AWS&lt;/a&gt;), Cloud MR enjoys the inherit scalability and resiliency, which greatly simplifies its architecture.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Cloud MR doesn't need to design a central coordinator components (like the NameNode and JobTracker in the Hadoop environment).  They simply store the job progress status information in the distributed metadata store (SimpleDB).&lt;/li&gt;&lt;li&gt;Cloud MR doesn't need to worry about scalability in the communication path and how data can be moved efficiently between nodes, all is taken care by the underlying CloudOS&lt;/li&gt;&lt;li&gt;Cloud MR doesn't need to worry about disk I/O issue because all storage is effectively remote and being taken care by the Cloud OS.&lt;/li&gt;&lt;/ol&gt;On the other hand, Cloud OS has a set of constraints and limitations that the design of Cloud MR has to deal with&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Network latency and throughput&lt;/span&gt; :  20 - 100 ms for SQS access, SimpleDB domain write througput is 30 - 40 items/sec&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Eventual consistency&lt;/span&gt; :  2 simultaneous requests to dequeue from SQS can both get the same message.  SQS sometimes report empty when there are still messages in the queue.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Cloud MR use a "&lt;span style="font-weight: bold;"&gt;parallel path&lt;/span&gt;" technique to overcome the network throughput issue.  The basic idea is to read/write data across multiple network paths so the effective throughput is the aggregated result of individual path.&lt;br /&gt;&lt;br /&gt;Cloud MR use a "&lt;span style="font-weight: bold;"&gt;double check&lt;/span&gt;" technique to overcome the consistency issue.  Writer will write status into multiple places and reader will read from multiple places also.  If the reader read inconsistent result from different place, that means the eventual consistent state hasn't arrived yet so it needs to retry later.  When the state read from different places agrees with each other, eventual consistency has arrived and the state is valid.&lt;br /&gt;&lt;br /&gt;Following describe the technical details of Cloud MR ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Cloud MR Architecture&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S3ji-ZdDrQI/AAAAAAAAAa0/BHx4ugK5pos/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 296px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S3ji-ZdDrQI/AAAAAAAAAa0/BHx4ugK5pos/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5438346111662402818" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;SimpleDB is used to store Job status.  Client submit jobs to SimpleDB, Map and Reduce workers update and extract job status from the SimpleDB.  The actual data of each job is stored in SQS (which can also points to an Object stored in S3).&lt;br /&gt;&lt;br /&gt;So the job progress in the following way&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Job Client Processing Cycle&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Store data in many S3 file objects&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Create a Mapper task request for each file split (each map task request contains a reference to the S3 object and the byte range).&lt;/li&gt;&lt;li&gt;Create an input queue in SQS and enqueue each Mapper task request to it.&lt;/li&gt;&lt;li&gt;Create a master reduce queue, an result output queue as well as multiple partition queues.&lt;/li&gt;&lt;li&gt;Create one reducer task request for each partition queue.  Each reducer task request contains a pointer to the partition queue.&lt;/li&gt;&lt;li&gt;Enqueue the reducer task requests to the master reducer queue&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Create a job request that contains a mapper task count S as well as a reference to all the SQS queue created so far.&lt;/li&gt;&lt;li&gt;Add the job request into SimpleDB&lt;/li&gt;&lt;li&gt;Invoke AWS commands to start the EC2 instances for Mappers and Reducers, passing along queue and SimpleDB locations as "user data" to the EC2 instances.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;From this point onwards, poll the SimpleDB on the job progress status&lt;/li&gt;&lt;li&gt;When the job is complete, download the result from output queue and S3&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-weight: bold;"&gt;Mapper Processing Cycle&lt;br /&gt;&lt;/span&gt;&lt;ol&gt;&lt;li&gt;Take a mapper task request from the SQS input queue&lt;/li&gt;&lt;li&gt;Fetch the file split and parse out each record&lt;/li&gt;&lt;li&gt;Invoke user defined map() function, for each emit intermediate key, perform a (hash(k1) % no_of_partitions).  Enqueue the intermediate record to the corresponding partition queue.&lt;/li&gt;&lt;li&gt;When done with the mapper task request, write a commit record containing worker id, map request id and number of records processed count per partition (in other words, R[i][j] where i is the map request and j is the partition no.). &lt;/li&gt;&lt;li&gt;Remove the map task request from the SQS input queue&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Due to the eventual consistency model constraint, Mapper cannot stop processing even when it sees the input queue is empty.  Instead it count all the commit records to make sure the unique map request id has been sum up equal to the mapper task count S in the Job request.  When this happens, it enters the reduce phase.&lt;br /&gt;&lt;br /&gt;It is possible that a Mapper worker crashes before it finishes the mapper task, so another mapper will re-process the map task request (after the SQS timeout).  Or due to eventual consistency model, it is possible to have 2 simultaneous mappers working on the same file splits.  In both case, it is possible of causing some duplications in the partition queues.&lt;br /&gt;&lt;br /&gt;To facilitates the duplicate elimination, each intermediate records emit by the mapper will be tagged with [map request id, worker id, a unique number]&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Reducer Processing Cycle&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Monitor SimpleDB and wait for seeing commit records from all mappers.&lt;/li&gt;&lt;li&gt;Dequeue a reducer task request from the master reducer queue&lt;/li&gt;&lt;li&gt;Go to the corresponding partition queue, dequeue each intermediate record&lt;/li&gt;&lt;li&gt;Invoke user define reduce() function and write the reducer output to the output queue&lt;/li&gt;&lt;li&gt;When done with the reducer task request, write a commit record in a similar way as the Mapper worker&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Remove the reduce task request from the master reducer queue&lt;/li&gt;&lt;/ol&gt;To eliminate duplicated intermediate messages in the partition queue, each Reducer will first query the SimpleDB for all the commit records written by the successful mapper worker.  If there are different workers working on the same mapper request, there maybe be multiple commit records with the same mapper request id.  In this case, the Reduce will arbitrary pick the winner and then all intermediary records tagged with the mapper request id but not the winner worker id will be discarded.&lt;br /&gt;&lt;br /&gt;Similar to the Mapper, Reducer j will not stop getting the message from the partition queue even when it is empty, it will keep reading the message up to the sum all R[i][j] over i.&lt;br /&gt;&lt;br /&gt;Due to eventual consistency, it is possible that multiple reducers dequeue the same reducer task request from the master reducer queue and then taking messages from the same partition queue.  Since they are competing on the same partition queue, one of them will find the queue is empty before they reach the sum of R[i][j] over i.  After certain timeout period, the reducer will write a "suspect conflict" record (containing its worker id) in the SimpleDB.  If it found another reducer has written such record, it knows there is another reducer working in the same partition.   Workers with the lowest id is the loser and so the reducer will keep reading until it sees another conflict record with a lower id, then it will drop off existing processing and pickup another reducer task.  All the records read by the loser will come back to the queue after the timeout period, and will be picked up by the winner.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Network Latency and Throughput&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;One of the SimpleDB-specific implementation constraint is they read and write throughput is very asymmetric.  While the read response is very fast, the write is slow.  To mitigate this asymmetry, Cloud MR using multiple domains in SimpleDB.  When it writes to SimpleDB, it randomly pick one domain and write to it.  This way, the write request workload is spread across multiple domains.  When it reads from SimpleDB, it read every domain and aggregate the results (since one domain will have the result).&lt;br /&gt;&lt;br /&gt;To overcome the latency issue of SQS, CloudMR at the Mapper side uses buffering technique to batch up intermediate messages (destined to the same partition), A message buffer is 8k size (the maximum size of a SQS message).  When the buffer is full (or after some timeout period), a designated thread will flush the buffer by writing a composite message (which contains all the intermediate records) into the SQS.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/S3SZUcyuRqI/AAAAAAAAAas/7DB8Rr-dlPU/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 242px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/S3SZUcyuRqI/AAAAAAAAAas/7DB8Rr-dlPU/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5437139226748405410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The reducer side works in a similar way, multiple read threads will dequeue message from the partition queue and put them in a Read buffer, where the Reducer will read the intermediate messages.  Notice that it is possible to have 2 threads reading the same message from the partition queue (remember the eventual consistency scenario described above).  To eliminate the potential duplicated message, the reducer will examine the unique number tagged with the message and discard the message that has seen before.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Difference with Hadoop&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Map/Reduce developers familiar with Hadoop implementation will find Cloud MR behaves in a similar way.  But there are a number of difference that I want to highlight here.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Reducer key is not sorted&lt;/span&gt;:  Unlike Hadoop which guarantee that keys arrived at the same partition (or reducer) will be in sorted order, Cloud MR doesn't provide such feature.  Application need to do their own sorting if they need the sort order.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;For a detail technical description of Cloud MR, as well as how it is compared with Hadoop, read the &lt;a href="http://sites.google.com/site/huanliu/cloudmapreduce.pdf"&gt;original paper from Huan Liu and Dan Orban&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-8288008898526095618?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/8288008898526095618/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=8288008898526095618' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/8288008898526095618'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/8288008898526095618'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/02/cloud-mapreduce-tricks.html' title='Cloud MapReduce Tricks'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j6mB7TMmJJY/S3ji-ZdDrQI/AAAAAAAAAa0/BHx4ugK5pos/s72-c/p1.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-7379177962536417975</id><published>2010-02-04T08:06:00.001-08:00</published><updated>2010-02-09T16:48:55.591-08:00</updated><title type='text'>NoSQL GraphDB</title><content type='html'>I received some constructive criticism regarding &lt;a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;my previous blog in NoSQL patterns&lt;/a&gt; that I covered only the key/value store but have left out Graph DB.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The Property Graph Model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A property graph is a collection of Nodes and Directed Arcs.  Each node represents an entity and has an unique id as well as a Node Type.  The Node Type defines a set of metadata that the node has.  Each arc represents a unidirectional relationship between two entities and has an Arc Type.  The Arc Type defines a set of metadata that the arc has.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S2r3aNqaFKI/AAAAAAAAAaM/8tTvfe5ulRg/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 211px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S2r3aNqaFKI/AAAAAAAAAaM/8tTvfe5ulRg/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5434427930092115106" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;General Graph Processing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I found many of the graph algorithms follows a general processing pattern.  There are multiple rounds of (sequential) processing iterations.  Within each iteration, there are a set of active nodes that perform local processing in parallel.  The local processing can modify the node's properties, adding or removing links to other nodes, as well as sending message across links.  All message passing are done after the local processing.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S3D3VdqTmpI/AAAAAAAAAac/7fs5lf1ENhk/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 376px; height: 400px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S3D3VdqTmpI/AAAAAAAAAac/7fs5lf1ENhk/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5436116698347575954" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This model is similar to the &lt;a href="http://portal.acm.org/citation.cfm?id=1582716.1582723"&gt;Google Pregel model&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Notice that this model maps well into parallel computing environment where the processing of the set of active node can be spread across multiple processors (or multiple machines in a cluster)&lt;br /&gt;&lt;br /&gt;Notice that all messages from all in-coming links are arrived before the link changes within local processing.  On the other hand, all message send to all out-going links after the links have changed after the local processing.  The term "local processing" means it cannot modify the properties of other nodes or other links.&lt;br /&gt;&lt;br /&gt;Because two nodes can simultaneously modify the link in-between them, the following conflicts can happen&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A node delete a link while other node modifies the link properties.&lt;/li&gt;&lt;li&gt;Both nodes on each side modify the properties of the link in-between&lt;/li&gt;&lt;/ul&gt;A conflict resolution mechanism needs to be attached to each Arc Type to determine how the conflict should be resolved, either determining who is the winner or how to merge the effects.&lt;br /&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="font-size:100%;"&gt;&lt;a href="http://neo4j.org/"&gt;Neo4j&lt;/a&gt; provide a restricted, single-threaded graph traversal model&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;At each round, the set of active nodes is always a single node&lt;/li&gt;&lt;li&gt;The set of active nodes of next round is determined by the traversal policy (breath or depth-first), but is still a single node&lt;/li&gt;&lt;li&gt;It offer a callback function to determine whether this node should be included in the result set&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span&gt;&lt;span style="font-size:100%;"&gt;&lt;a href="http://wiki.github.com/tinkerpop/gremlin/"&gt;Gremlin&lt;/a&gt;, on the other hand, provides an interactive graph traversal model where user can step through each iteration.  It uses an XPath like syntax to express the navigation.&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size:100%;"&gt;The node is expressed as Node(id, inE, outE, properties) &lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span&gt;&lt;span style="font-size:100%;"&gt;The arc is expressed as Arc(id, type, inV, outV, properties)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span&gt;&lt;span style="font-size:100%;"&gt;As an example, if the graph represents a relationship between 2 types of entities.&lt;/span&gt;&lt;/span&gt;  Person writes Book, then given a person, her co-author can be expressed in the following XPath expression&lt;br /&gt;&lt;pre&gt;&lt;code&gt;./outE[@type='write']/inV/inE[@type='write']/outV&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Path Algebra&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;a href="http://markorodriguez.com/Lectures_files/mit-ll-2009.pdf"&gt;Marko Rodriguez has described of a set of matrix operations&lt;/a&gt; when the Graph is represented as adjacency matrix.  Graph algorithms can be describe as an algebraic form.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Traverse operation&lt;/span&gt; can be expressed as Matrix multiplication.  If A is the adjacency matrix of the graph.  Then A.A represents connection with path of length = 2.&lt;br /&gt;&lt;br /&gt;Similar, &lt;span style="font-weight: bold; font-style: italic;"&gt;Merge operation&lt;/span&gt; can be expressed as Matrix addition.  For example, (A + A.A + A.A.A) represent connectivity within 3 degree of reach.&lt;br /&gt;&lt;br /&gt;In a special case when the graph represents a relationship between 2 types of entities.  e.g. if A represents a authoring relationship (person write a book).  Then A.(A.transpose())  represents co-author relationship (person co-author with another person).&lt;br /&gt;&lt;br /&gt;Marko also introduce a set of &lt;span style="font-weight: bold; font-style: italic;"&gt;Filter operations&lt;/span&gt;, (not filter / clip filter /column filter /row filter /vertex filter)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-size:130%;"&gt;Map Reduce&lt;/span&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Depends on the traversal strategies inherit from the graph algorithms, certain algorithms which has higher sequential dependency doesn't fit well into parallel computing.  For example, graph algorithms with a breath-first search nature fits better into parallel computing paradigm with those that has a depth-first search nature.  On the other hand, perform search at all nodes fits better in parallel computing than perform search at a particular start node.&lt;br /&gt;&lt;br /&gt;There are different storage representation of graph, from incident list, incident matrix, adjacency list and adjacency matrix.  For sparse graph (which is the majority cases), lets assume adjacency list is used for the storage model for subsequent discussion.&lt;br /&gt;&lt;br /&gt;In other words, the graph is represented as a list of records, each record is [node, connected_nodes].&lt;br /&gt;&lt;br /&gt;There are many graph algorithms and it is not my intend is have an exhausted list.  Below are the one that I have used in my previous projects that can be translated into Map/Reduce form.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Topological Sort&lt;/span&gt; is commonly used to sort out a work schedule based on dependency tree.  It can be done as follows ...&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;# Topological Sort&lt;br /&gt;&lt;br /&gt;# Input records&lt;br /&gt;[node_id, prerequisite_ids]&lt;br /&gt;&lt;br /&gt;# Output records&lt;br /&gt;&lt;/code&gt;&lt;code&gt;[node_id, prerequisite_ids, dependent_ids]&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;class BuildDependentsJob {&lt;br /&gt; map(node, prerequisite_ids) {&lt;br /&gt;   for each prerequisite_id in prerequisite_ids {&lt;br /&gt;     emit(prerequisite_id, node)&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; reduce(node, dependent_ids) {&lt;br /&gt;   emit(node, [node, prerequisite_ids, dependent_ids])&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;class BuildReadyToRunJob {&lt;br /&gt; map(node, node) {&lt;br /&gt;   if ! done?(node) and node.prerequisite_ids.empty? {&lt;br /&gt;     result_set.append(node)&lt;br /&gt;     done(node)&lt;br /&gt;     for each dependent_id in dependent_ids {&lt;br /&gt;       emit(dependent_id, node)&lt;br /&gt;     }&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; reduce(node, done_prerequsite_ids) {&lt;br /&gt;   remove_prerequisites(node, done_prerequsite_ids)&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;# Topological Sort main program&lt;br /&gt;main() {&lt;br /&gt; JobClient.submit(BuildDependentsJob.new)&lt;br /&gt; Result result = []&lt;br /&gt;&lt;br /&gt; result_size_before_job = 0&lt;br /&gt; result_size_after_job = 1&lt;br /&gt;&lt;br /&gt; while (result_size_before_job &amp;lt; result_size_after_job) {&lt;br /&gt;   result_size_before_job = result.size&lt;br /&gt;   JobClient.submit(BuildReadyToRunJob.new)&lt;br /&gt;   result_size_after_job = result.size&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; return result&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Minimum spanning tree&lt;/span&gt; is a pretty common algorithm, the Prim's algorithm looks like the following.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;# Minimum Spanning Tree (MST) using Prim's algorithm&lt;br /&gt;&lt;br /&gt;Adjacency Matrix, W[i][j] represents weights&lt;br /&gt;W[i][j] = infinity if node i, j is disconnected&lt;br /&gt;&lt;br /&gt;MST has nodes in array N = [] and arcs A = []&lt;br /&gt;E[i] = minimum weighted edge connecting to the skeleton&lt;br /&gt;D[i] = weight of E[i]&lt;br /&gt;&lt;br /&gt;Initially, pick a random node r into N[]&lt;br /&gt;N = [r] and A = []&lt;br /&gt;D[r] = 0; D[i] = W[i][r];&lt;br /&gt;&lt;br /&gt;Repeat until N[] contains all nodes&lt;br /&gt; Pick node k outside N[] where D[k] is minimum&lt;br /&gt; Add node k to N; Add E[k] to A&lt;br /&gt; for all node p connected to node k&lt;br /&gt;   if W[p][k] &amp;lt; D[p]&lt;br /&gt;     D[p] = W[p][k]&lt;br /&gt;     E[p] = k&lt;br /&gt;   end&lt;br /&gt; end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;We are doing the map/reduce here because it is very similar to another popular algorithm &lt;span style="font-weight: bold;"&gt;single source shortest path.&lt;/span&gt;  The Map/Reduce form of the SPSS based on Dijkstra's algorithm is as follows ...&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;# Single Source Shortest Path (SSSP) based on Dijkstra&lt;br /&gt;&lt;br /&gt;Adjacency Matrix, W[i][j] represents weights of arc&lt;br /&gt;  connecting node i to node j&lt;br /&gt;W[i][j] = infinity if node i, j is disconnected&lt;br /&gt;&lt;br /&gt;SSSP has nodes in array N = []&lt;br /&gt;L[i] = Length of minimum path so far from the source node&lt;br /&gt;Path[i] = Identified shortest path from source to i&lt;br /&gt;&lt;br /&gt;Initially, put the source node s into N[]&lt;br /&gt;N = [s]&lt;br /&gt;L[s] = 0; L[i] = W[s][i];&lt;br /&gt;Path[i] = arc[s][i] for all nodes directly connected&lt;br /&gt;from source.&lt;br /&gt;&lt;br /&gt;Repeat until N[] contains all nodes&lt;br /&gt; Pick node k outside N[] where L[k] is minimum&lt;br /&gt; Add node k to N;&lt;br /&gt; for all node p connected from node k {&lt;br /&gt;   if L[k] + W[k][p] &amp;lt; L[p] {&lt;br /&gt;     L[p] = L[k] + W[k][p]&lt;br /&gt;     Path[p] = Path[k].append(Arc[k][p])&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;end repeat&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# Here is the map/reduce pseudo code would look like&lt;br /&gt;&lt;br /&gt;class FindMinimumJob&lt;br /&gt; map(node_id, path_length) {&lt;br /&gt;   if not N.contains(node_id) {&lt;br /&gt;     emit(1, [path_length, node_id])&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; reduce(k, v) {&lt;br /&gt;   min_node, min_length = minimum(v)&lt;br /&gt;   for each node in min_node.connected_nodes {&lt;br /&gt;     emit(node, min_node)&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;class UpdateMinPathJob {&lt;br /&gt; map(node, min_node) {&lt;br /&gt;   if L[min_node] + W[min_node][node] &amp;lt; L[node] {&lt;br /&gt;     update L[node] = &lt;/code&gt;&lt;code&gt;L[min_node] + W[min_node][node]&lt;/code&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;      Path[node] =&lt;br /&gt;       Path[min_node].append(arc(min_node, node))&lt;/code&gt;&lt;br /&gt;&lt;code&gt;    }&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;# Single Source Shortest Path main program&lt;br /&gt;main() {&lt;br /&gt; init()&lt;br /&gt; while (not N.contains(V)) {&lt;br /&gt;   JobClient.submit(FindMinimumJob.new)&lt;br /&gt;&lt;/code&gt;&lt;code&gt;    JobClient.submit(UpdateMinPathJob.new)&lt;br /&gt;&lt;/code&gt;&lt;code&gt;  }&lt;br /&gt;&lt;br /&gt; return Path&lt;br /&gt;}&lt;/code&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The same SSSP problem can also be solved using &lt;span style="font-weight: bold;"&gt;breath-first search&lt;/span&gt;.  The intuition is to grow a frontier from the source at each iteration and update the shortest distance from the source.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;# Single Source Shortest Path (SSSP) using BFS&lt;br /&gt;&lt;br /&gt;Adjacency Matrix, W[i][j] represents weights of arc&lt;br /&gt;  connecting node i to node j&lt;br /&gt;W[i][j] = infinity if node i, j is disconnected&lt;br /&gt;&lt;br /&gt;Frontier nodes in array F&lt;br /&gt;L[i] = Length of minimum path so far from the source node&lt;br /&gt;Path[i] = Identified shortest path from source to i&lt;br /&gt;&lt;br /&gt;Initially,&lt;br /&gt;F = [s]&lt;br /&gt;L[s] = 0; L[i] = W[s][i];&lt;br /&gt;Path[i] = arc[s][i] for all nodes directly connected&lt;br /&gt;from source.&lt;br /&gt;&lt;br /&gt;# input is all nodes in the frontier F&lt;br /&gt;# output is frontier of next round FF&lt;br /&gt;&lt;br /&gt;class GrowFrontierJob {&lt;br /&gt; map(node) {&lt;br /&gt;   for each to_node in node.connected_nodes {&lt;br /&gt;     emit(to_node, [node, L[node] + W[node][to_node]])&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; reduce(node, from_list) {&lt;br /&gt;   for each from in from_list {&lt;br /&gt;     from_node = from[0]&lt;br /&gt;     length_via_from_node = from[1]&lt;br /&gt;     if (length_via_from_node &amp;lt; L[node] {&lt;br /&gt;       L[node] = length_via_from_node&lt;br /&gt;       Path[node] =&lt;br /&gt;         Path[from_node].append(arc(from_node, node))&lt;br /&gt;       FF.add(node)&lt;br /&gt;     }&lt;br /&gt;   }&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;# Single Source Shortest Path BFS main program&lt;br /&gt;main() {&lt;br /&gt; init()&lt;br /&gt; while (F is non-empty) {&lt;br /&gt;   JobClient.set_input(F)&lt;br /&gt;   JobClient.submit(FindMinimumJob.new)&lt;br /&gt;   copy FF to F&lt;br /&gt;   clear FF&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; return Path&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-7379177962536417975?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/7379177962536417975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=7379177962536417975' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7379177962536417975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7379177962536417975'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/02/nosql-graphdb.html' title='NoSQL GraphDB'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j6mB7TMmJJY/S2r3aNqaFKI/AAAAAAAAAaM/8tTvfe5ulRg/s72-c/p1.png' height='72' width='72'/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1021984589750791684</id><published>2010-01-15T12:53:00.000-08:00</published><updated>2010-01-15T22:04:48.385-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SOA'/><category scheme='http://www.blogger.com/atom/ns#' term='Cloud computing'/><category scheme='http://www.blogger.com/atom/ns#' term='SaaS'/><category scheme='http://www.blogger.com/atom/ns#' term='Scalable'/><category scheme='http://www.blogger.com/atom/ns#' term='Software as a service'/><title type='text'>RoadMap to SaaS</title><content type='html'>I have observed a pattern from multiple enterprises moving from a traditional web app to a SaaS.  Trying to capture this pattern and a number of lessons learned.  I use a J2EE Web App architecture to illustrate but the same principles apply to other technology platforms.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 1: Some working Web App&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At the very beginning, we have an web application that works well.  We analyze the function of the web application and group the implementation classes accordingly&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S1D-v7MJ9GI/AAAAAAAAAYs/l6gvgdsp7SQ/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 242px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S1D-v7MJ9GI/AAAAAAAAAYs/l6gvgdsp7SQ/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5427117650277430370" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 2: Separate functionality across processes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We analyze the functions and partition them into different processes (JVM).  The partition needs to be coarse-grain and each process will communicate with each other via a service interface exposed by a Facade class.  The service interface can be any remote object invocation interface or XML over HTTP.  &lt;a href="http://horicky.blogspot.com/2009/05/restful-design-patterns.html"&gt;Restful web service&lt;/a&gt; is the de-facto way for the service interface. &lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1EAHU48yuI/AAAAAAAAAY0/-mFTcQiGNBk/s1600-h/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 242px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1EAHU48yuI/AAAAAAAAAY0/-mFTcQiGNBk/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5427119151824816866" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 3: Move different processes to different machines&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To scale out beyond a single server's capacity, we move the process to a separate machine.  Notice that the machine can be a physical machine, or a virtual machine running on top of hypervisor.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/S1EA6O4gTjI/AAAAAAAAAY8/86cC4a_ga28/s1600-h/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 217px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/S1EA6O4gTjI/AAAAAAAAAY8/86cC4a_ga28/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5427120026385665586" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 4: Build service pools&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If the service itself is stateless, we can easily scale out the service capacity by putting multiple machines (running the same service) into a server pool.  A network load balancer will be used to spread the workload evenly to the member servers.&lt;br /&gt;&lt;br /&gt;When the workload increase, more machines can be added to the pool to boost up the overall capacity of the service.  Elastic scalability provided by Cloud computing provider make growing and shrinking the pool even more rapid, and can hence dynamic workload more effectively.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1FEPfyw0MI/AAAAAAAAAZE/VNALnskYtI8/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 231px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1FEPfyw0MI/AAAAAAAAAZE/VNALnskYtI8/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5427194058981298370" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 5: Scale the data tier by partitioning&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After we scale out the processing tier, we found the data tier becomes the bottleneck.  So we also need to distribute the data access workload by partition the database according to the data key.  Here are some classical techniques &lt;a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;how we can build a distributed database&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/S1FIGBOsvsI/AAAAAAAAAZM/T62YDJ1d7Zc/s1600-h/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 193px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/S1FIGBOsvsI/AAAAAAAAAZM/T62YDJ1d7Zc/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5427198294204661442" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 6: Add Cache to reduce server load&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If the application is has a high read/write ratio, or has some tolerance of data staleness, we can add a cache layer to reduce the hit of the actual services.  Clients will check the cache before sending the request to the service.&lt;br /&gt;&lt;br /&gt;We need to make sure the cached items to remain fresh.  There are various schemes to achieve this.  e.g. cached items can be expired after certain timeout, or an explicit invalidation request can be make to specific cached items when the corresponding backend data has been changed.&lt;br /&gt;&lt;br /&gt;We can use local cache (reside in the same machine as the client) or a distributed cache engine such as &lt;a href="http://horicky.blogspot.com/2009/10/notes-on-memcached.html"&gt;Memcached&lt;/a&gt; or &lt;a href="http://horicky.blogspot.com/2010/01/notes-on-oracle-coherence.html"&gt;Oracle Coherence&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S1FKeDBb6XI/AAAAAAAAAZU/jOxO3NC2hLA/s1600-h/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 200px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S1FKeDBb6XI/AAAAAAAAAZU/jOxO3NC2hLA/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5427200906026019186" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 7: Consider which service to expose to public&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At this point, we want to expose some of the services to the public either because this can bring revenue to our company or can facilitate a better B2B integration with our business partners.  There are a lot of considerations to decide what to be expose, such as security consideration, scalability, service level agreement, utilization tracking ... etc.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1FNs4867tI/AAAAAAAAAZc/-A8z3AtFpiU/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 216px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1FNs4867tI/AAAAAAAAAZc/-A8z3AtFpiU/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5427204459555647186" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 8: Deploy an ingress service gateway&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Once we decide what services to be exposed, we decide to use a specialized ingress service gateway to handle the concern that we outline above.  Most of the XML service gateway is equipped with message validation, security checking, message transformation, routing logic.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/S1FOjKzjrJI/AAAAAAAAAZk/30xvp9x8kac/s1600-h/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 191px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/S1FOjKzjrJI/AAAAAAAAAZk/30xvp9x8kac/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5427205392061148306" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 9: Deploy an egress service gateway&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We not only providing service to the public, but may also consume other public services.  In this case we deploy an egress service gateway which can help to lookup the service provider endpoints, extract the service policy of the public service providers ... etc.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/S1FPrgz6H3I/AAAAAAAAAZs/O-tszsjfnhk/s1600-h/P4.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 202px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/S1FPrgz6H3I/AAAAAAAAAZs/O-tszsjfnhk/s400/P4.png" alt="" id="BLOGGER_PHOTO_ID_5427206634918780786" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;One important function of the egress service gateway is to &lt;span style="font-weight: bold; font-style: italic;"&gt;manage my dependencies&lt;/span&gt; to external service providers.  It typically keeps a list of equivalent service providers together with their availability and response time, and routing my request to the one according to my selection criteria (e.g. cheapest, most-reliable, lowest-latency ... etc.)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1FQmItpJaI/AAAAAAAAAZ0/lj455A6PB0g/s1600-h/P5.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 235px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/S1FQmItpJaI/AAAAAAAAAZ0/lj455A6PB0g/s400/P5.png" alt="" id="BLOGGER_PHOTO_ID_5427207642062333346" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 10: Implement service version control&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;My service will evolve after exposing to the public.  In the ideal case, only the service implementation change but not the service interface so there is no need to change the client code.&lt;br /&gt;&lt;br /&gt;But mostly likely it is not that ideal.  There are chances that I need to change the service interface or the message format.  In this case, I may need to run multiple versions of the services simultaneously to make sure I am not breaking an existing clients.  This means I need the ingress service gateway to be intelligent about routing the client request to the right version of the service implementation.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/S1FS-h8E3-I/AAAAAAAAAZ8/5L_MhyNtTPc/s1600-h/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 189px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/S1FS-h8E3-I/AAAAAAAAAZ8/5L_MhyNtTPc/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5427210260173873122" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;A typical way is to maintain a matrix of version and keep track of their compatibilities.  For example, we can use a release convention such that all minor releases is required to be backward compatible but major release is not required for that.  Having this compatibility matrix information, the ingress gateway can determine the client version from its incoming request and route it to the server which has the latest compatible version.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Stage 11: Outsource infrastructure to public Cloud provider&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Purchase necessary hardware equipment and maintaining them can be very costly, especially when there are idle time in the usage of computing resources.  Idle time is usually unavoidable because we need to budget the resource at the peak workload scenarios so there are idle time at non-peak hours.&lt;br /&gt;&lt;br /&gt;For a more efficient use of computing resources, we can consider some public cloud provider such as &lt;a href="http://horicky.blogspot.com/2009/11/amazon-cloud-computing.html"&gt;Amazon AWS&lt;/a&gt; or &lt;a href="http://horicky.blogspot.com/2009/11/cloud-computing-patterns.html"&gt;Microsoft Azure&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;But it is important that running an application in the cloud may need to redesign the application to cope with &lt;a href="http://horicky.blogspot.com/2009/08/traditional-saas-vs-cloud-enabled-saas.html"&gt;some unique characteristics of the cloud environment&lt;/a&gt;, such as &lt;a href="http://horicky.blogspot.com/2009/08/skinny-straw-in-cloud-shake.html"&gt;how to deal with high network latency and bandwidth cost&lt;/a&gt;, as well as how to &lt;a href="http://horicky.blogspot.com/2009/01/design-pattern-for-eventual-consistency.html"&gt;design application to live with an eventually consistent DB&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1021984589750791684?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1021984589750791684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1021984589750791684' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1021984589750791684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1021984589750791684'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/01/roadmap-to-saas.html' title='RoadMap to SaaS'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j6mB7TMmJJY/S1D-v7MJ9GI/AAAAAAAAAYs/l6gvgdsp7SQ/s72-c/p1.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-7462264716460971116</id><published>2010-01-12T15:23:00.000-08:00</published><updated>2010-01-13T16:32:18.580-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Memcached'/><category scheme='http://www.blogger.com/atom/ns#' term='Cache'/><category scheme='http://www.blogger.com/atom/ns#' term='Coherence'/><title type='text'>Notes on Oracle Coherence</title><content type='html'>Oracle Coherence is a distributed cache that functionally comparable to &lt;a href="http://horicky.blogspot.com/2009/10/notes-on-memcached.html"&gt;Memcached&lt;/a&gt;.  On top of the basic cache API function, it has some additional capabilities that is attractive for building large scale enterprise applications.&lt;br /&gt;&lt;br /&gt;The API is based on the Java Map (Hashtable) Interface, which provides a key/value store semantics where the value can be any Java Serializable object.  Coherence allows data stored in multiple caches identified by a unique name (which they called a "named cache").&lt;br /&gt;&lt;br /&gt;Below code examples are extracted from &lt;a href="http://www.jugs.ch/html/events/slides/080310_AnEngineers_Introduction_to_Oracle_Coherence.ppt"&gt;the great presentation from Brian Oliver of Oracle&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The common usage pattern is to first locate a cache by its name, and then act on the cache.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Basic cache function (Map, JCache)&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Get data by key&lt;/li&gt;&lt;li&gt;Update data by key&lt;/li&gt;&lt;li&gt;Remove data by key&lt;/li&gt;&lt;/ul&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;NamedCache nc = CacheFactory.getCache("mine");&lt;br /&gt;&lt;br /&gt;Object previous = nc.put("key", "hello world");&lt;br /&gt;&lt;br /&gt;Object current = nc.get("key");&lt;br /&gt;&lt;br /&gt;int size = nc.size();&lt;br /&gt;&lt;br /&gt;Object value = nc.remove("key");&lt;br /&gt;&lt;br /&gt;Set keys = nc.keySet();&lt;br /&gt;&lt;br /&gt;Set entries = nc.entrySet();&lt;br /&gt;&lt;br /&gt;boolean exists = nc.containsKey("key");&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Cache Modification Event Listener (ObservableMap)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You can register an event listener on a cache to catch certain change events happen within the cache.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;New cache item is inserted&lt;/li&gt;&lt;li&gt;Existing cache item is deleted&lt;/li&gt;&lt;li&gt;Existing cache item is updated&lt;/li&gt;&lt;/ul&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;NamedCache nc = CacheFactory.getCache("stocks");&lt;br /&gt;&lt;br /&gt;nc.addMapListener(new MapListener() {&lt;br /&gt;   public void onInsert(MapEvent mapEvent) {&lt;br /&gt;       ...&lt;br /&gt;   }&lt;br /&gt;&lt;br /&gt;   public void onUpdate(MapEvent mapEvent) {&lt;br /&gt;       ...&lt;br /&gt;   }&lt;br /&gt;&lt;br /&gt;   public void onDelete(MapEvent mapEvent) {&lt;br /&gt;       ...&lt;br /&gt;   }&lt;br /&gt;});&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;View of Filtered Cache (QueryMap)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You can also define a "view" by providing a "filter" which is basically a boolean function.  Only items that is evaluated to be true by this function will be visible in this view.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;NamedCache nc = CacheFactory.getCache("people");&lt;br /&gt;&lt;br /&gt;Set keys =&lt;br /&gt;   nc.keySet(new LikeFilter("getLastName", "%Stone%"));&lt;br /&gt;&lt;br /&gt;Set entries =&lt;br /&gt;   nc.entrySet(new EqualsFilter("getAge", 35));&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Continuous Query Support (ContinuousQueryCache)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The view can also be used as a "continuous query".  All new coming data that fulfilled the filter criteria will be included automatically in the view.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;NamedCache nc = CacheFactory.getCache("stocks");&lt;br /&gt;&lt;br /&gt;NamedCache expensiveItems =&lt;br /&gt;   new ContinuousQueryCache(nc,&lt;br /&gt;                       new GreaterThan("getPrice", 1000));&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Parallel Query Support (InvocableMap)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We can also spread a query execution and partial aggregation across all nodes and have them execute in parallel, followed by the final aggregation.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;NamedCache nc = CacheFactory.getCache("stocks");&lt;br /&gt;&lt;br /&gt;Double total =&lt;br /&gt;   (Double)nc.aggregate(AlwaysFilter.INSTANCE,&lt;br /&gt;                      new DoubleSum("getQuantity"));&lt;br /&gt;&lt;br /&gt;Set symbols =&lt;br /&gt;   (Set)nc.aggregate(new EqualsFilter("getOwner", "Larry"),&lt;br /&gt;                     new DistinctValue("getSymbol"));&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Parallel Execution Processing Support (InvocableMap)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We can ship a piece of processing logic to all nodes which will execute the processing in parallel&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;NamedCache nc = CacheFactory.getCache("stocks");&lt;br /&gt;&lt;br /&gt;nc.invokeAll(new EqualsFilter("getSymbol", "ORCL"),&lt;br /&gt;            new StockSplitProcessor());&lt;br /&gt;&lt;br /&gt;class StockSplitProcessor extends AbstractProcessor {&lt;br /&gt;   Object process(Entry entry) {&lt;br /&gt;       Stock stock = (Stock)entry.getValue();&lt;br /&gt;       stock.quantity *= 2;&lt;br /&gt;       entry.setValue(stock);&lt;br /&gt;       return null;&lt;br /&gt;   }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Implementation Architecture&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Oracle Coherence runs on a cluster of identical server machines connected via a network. Within each server, there are multiple layers of software provide a unified data storage and processing abstraction over a distributed environment.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/S01T8bkUOBI/AAAAAAAAAYk/EVY_4tA-cRw/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 325px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/S01T8bkUOBI/AAAAAAAAAYk/EVY_4tA-cRw/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5426085423708649490" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Smart Data Proxy&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Application typically runs within a node of the cluster as well.  The cache interface is implemented by a set of smart data proxy which knows the location of master (primary) and slave (backup) copy of data based on its key.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Read through with 2 level cache&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When the client "read" data from the proxy, it first try to find the data in a local cache (also called the "near cache" within the same machine). If it is not found, the smart proxy will then locate the distributed cache for the corresponding copy (also called the L2 cache).  Since this is a read, either a master or a slave copy is fine.  If the smart proxy wouldn't find data from the distributed cache, it will lookup data from the backend DB.  The return data will then propagate back to the client and the cache will be populated.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Master/Slave data partitioning&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Updating data (insert, update, delete) is done in the reverse way.  Under the master/slave architecture, all updates will go to the corresponding master node that owns that piece of data.  Coherence support two modes of update; "Write through" and "Write behind".  "&lt;span style="font-weight: bold;"&gt;Write through&lt;/span&gt;" will update the DB backend immediately after updating the master copy, but before updating the slave copy, and therefore keep the DB always up to date.  "&lt;span style="font-weight: bold;"&gt;Write behind&lt;/span&gt;" will update the slave copy and then the DB in an asynchronous fashion.  Data lost is possible in "write behind" mode, which has a higher throughput because multiple write can be merge in a single write, resulting in a fewer number of writes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Moving processing logic towards data&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;While extracting data from the cache to the application is a typical way of processing data, it is not very scalable when large volume of data is required to be processed.  Instead of shipping the data to the processing logic, a much more efficient way is to ship the processing logic to where the data is residing.  This is exactly why Oracle Coherence provide an invocableMap interface where the client can provide a "processor" class that get shipped to every node where processing can be conducted with local data.  Moving code towards the data dstributed across many nodes also enable parallel processing because now every node can conduct local processing in parallel.&lt;br /&gt;&lt;br /&gt;The processor logic is shipped into the processing queue of the execution node, which has an active processor dequeue the processor object and execute it.  Notice that this execution is performed in a serial manner, in other words, the processor will completely finished a processing job before proceeding to the next job.  There is no worry about multi-threading issue and no need to use locks, and therefore no dead lock issue.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-7462264716460971116?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/7462264716460971116/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=7462264716460971116' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7462264716460971116'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7462264716460971116'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2010/01/notes-on-oracle-coherence.html' title='Notes on Oracle Coherence'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_j6mB7TMmJJY/S01T8bkUOBI/AAAAAAAAAYk/EVY_4tA-cRw/s72-c/p1.png' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-5949856658973784713</id><published>2009-12-21T09:17:00.000-08:00</published><updated>2009-12-21T17:23:38.290-08:00</updated><title type='text'>Takeaways on Responsive Design</title><content type='html'>I recently have attended a great talk from Kent Beck who is the thought leader in Extreme Programming, Test-Driven-Development and Code refactoring.  This talk "Responsive Design" outline a set of key principles of how to create a design that is "malleable" so it can respond quickly to future changes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Anticipating Changes&lt;/span&gt;&lt;br /&gt;I don't think one can design a system without knowing the context of the problem.  You have to know what problem your solution is trying to solve.  Of course, you may not know the detail of every aspects, or you may know that certain parts of the requirements may undergoing some drastic changes.  In these areas where changes are anticipated, you need to build in more flexibilities into your design.&lt;br /&gt;&lt;br /&gt;In my opinion, knowing what you know well and what you don't know is important.  Good designers usually have good instinct in sensing between the "known" and "unknown" and adjust the flexibility of his design along the way as more information is gathered.&lt;br /&gt;&lt;br /&gt;As more information is gathered, the dynamics of "change anticipation" also evolves.  Certain parts of your system has reduced its anticipated changes due to less unknowns so now you can trade off some flexibility for efficiency or simplicity.  On the other hand, you may discover that certain parts of the system has increased its anticipated changes and so even more flexibility is needed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Safe Steps&lt;/span&gt;&lt;br /&gt;If the system is already in production, making changes in the architecture is harder because there are other systems that may already depend on it.  And any changes on it may break those other systems.  "Safe Steps" is about how we can design the changes to existing system with minimal impact to those other systems that depends on it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Design for Evolution&lt;/span&gt;&lt;br /&gt;One important aspect when design a system is not just by looking at what the end result should be, but also look at what the evolution path of the system should look like.  The key idea is that a time dimension is introduced here and the overall cost and risk should be summed along the time dimension.&lt;br /&gt;&lt;br /&gt;In other words, it is not about whether you have designed a solution that finally meet the business requirement.  What is important is how your solution bring value to the business as it evolves over time.  A good design is a live animal that can breath and evolve together with your business.&lt;br /&gt;&lt;br /&gt;Kent also talk about 4 key design approaches under different conditions&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;1. Leap&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;"Leap" is a brave approach where you go ahead to design and implement the new system.  When complete, swap out the existing components with the new one.  This approach requires a very good understanding of the functionality of system you want to build, how the existing system and how other systems depends on the existing system.&lt;br /&gt;&lt;br /&gt;"Leap" can be a very effective approach if the system is very self-contained with clearly defined responsibilities.  But in general, this approach is somewhat high-risk and is an exception rather than a norm for large enterprise applications.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;2. Parallel&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;"Parallel" takes a "wrap and replace" approach.  The new system is designed to run in parallel with existing system so that migration can be conducted gradually in a risk-containing manner.  If there is any problem happens during the migration, the whole system can be switched back to the original system immediately so the risk is contained.&lt;br /&gt;&lt;br /&gt;After 100% client has been migrated to the new system for some period of time, the old system can be shutdown without even the clients notice it.&lt;br /&gt;&lt;br /&gt;"Parallel" approach still requires you to have a clear understanding of what you want to build.  But it relax you from knowing the dependencies of existing system.  Of course, the design may be more complicated because it needs to run in parallel with the existing system and has to deal with things like data consistency and synchronization issues.&lt;br /&gt;&lt;br /&gt;"Parallel" is a predominant approach that I've seen people used in reality.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;3. Stepping Stone&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;"Stepping Stone" is very useful when you don't exactly know your destination, but you know that there are some intermediate steps that you have to do.  In this approach, the designer focus in those intermediate steps that will lead to the final destination.&lt;br /&gt;&lt;br /&gt;"Stepping Stone" requires the designer to have a "scope of variability" in mind about the final solution and then identify some common ground across the perceived variability.  Knowing what you don't know is important to define the "stepping stone".&lt;br /&gt;&lt;br /&gt;This approach is also very useful to design the evolution path of the system.  Since you need to look at how the "stepping stone" will provide value to the existing systems while it evolves.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;4. Simplification&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;"Simplification" is the about how you can simplify your intermediate design by breaking down the requirement into multiple phases.  I personally don't think the designer has the flexibility to change the ultimate requirement but she definitely can break down the ultimate requirement into multiple phases so she can control the evolution path of her design.  In other words, by simplifying the requirement in each phase, she can pick the challenges that she want to tackle in different phases.&lt;br /&gt;&lt;br /&gt;"Simplification" is also an important skill for experienced designers.  The only way to tackle any complex system beyond a human's brain power is to break the original complexity down into simpler systems and tackle them in incremental steps.&lt;br /&gt;&lt;br /&gt;"Simplification" is also an important abstraction skills where experience designer can generalize a specific problem case into a generic problem pattern where a generic solution can be found (e.g. design pattern), and then customize a specific solution from the design pattern.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-5949856658973784713?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/5949856658973784713/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=5949856658973784713' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5949856658973784713'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5949856658973784713'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/12/takeaways-on-responsive-design.html' title='Takeaways on Responsive Design'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-5582612444398481723</id><published>2009-11-28T21:19:00.000-08:00</published><updated>2009-11-30T16:34:35.907-08:00</updated><title type='text'>Query Processing for NOSQL DB</title><content type='html'>The recent rise of NoSQL provides an alternative model in building extremely large scale storage system.  Nevetheless, compare to the more mature RDBMS, NoSQL has some fundamental limitations that we need to be aware of.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;It calls for a more relaxed data consistency model&lt;/li&gt;&lt;li&gt;It provides primitive querying and searching capability&lt;/li&gt;&lt;/ol&gt;There are techniques we can employ to mitigate each of these issue.  Regarding the data consistency concern, I have discussed a number of &lt;a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;design patterns in my previous blog&lt;/a&gt; to implement system with different strength of consistency guarantee.&lt;br /&gt;&lt;br /&gt;Here I like to give myself a try to tackle the second issue.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;So what is the problem ?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Many of the NoSQL DB today is based on the DHT (Distributed Hash Table) model, which provides a hashtable access semantics.  To access or modify any object data, the client is required to supply the primary key of the object, then the DB will lookup the object using an equality match to the supplied key.&lt;br /&gt;&lt;br /&gt;For example, if we use DHT to implement a customer DB, we can choose the customer id as the key.  And then we can get/set/operate on any customer object if we know its id&lt;br /&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;cust_data = DHT.get(cust_id)&lt;/li&gt;&lt;li&gt;DHT.set(cust_id, modified_cust_data)&lt;/li&gt;&lt;li&gt;DHT.execute(cust_id, func(cust) {cust.add_credit(200)})&lt;/li&gt;&lt;/ul&gt;In the real world, we may want to search data based on other attributes than its primary key, we may also search attributes based on "greater/less than" relationship, or we may want to combine multiple search criteria using a boolean expression.&lt;br /&gt;&lt;br /&gt;Using our customer DB example, we may do ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Lookup customers within a zip code&lt;/li&gt;&lt;li&gt;Lookup customers whose income is above 200K&lt;/li&gt;&lt;li&gt;Lookup customer using keywords "chief executives"&lt;/li&gt;&lt;/ul&gt;Although query processing and indexing technique is pretty common in RDBMS world, it is seriously lacking in the NoSQL world because of the very nature of the "distributed architecture" underlying most of NoSQL DB.&lt;br /&gt;&lt;br /&gt;It seems to me that the responsibility of building an indexing and query mechanism lands on the NoSQL user.  Therefore, I want to explore some possible techniques to handle these.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Companion SQL-DB&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A very straighforward approach is provide querying capability is to augment NoSQL with an RDBMS or TextDB (for keyword search).  e.g. We add the metadata of the object into a RDBMS so we can query its metadata using standard SQL query.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/SxKcnXYD5GI/AAAAAAAAAX0/5uC0_Uo1fU8/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 274px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/SxKcnXYD5GI/AAAAAAAAAX0/5uC0_Uo1fU8/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5409558302529152098" border="0" /&gt;&lt;/a&gt;Of course, this requires the RDBMS to be large enough to store the search-able attributes of each object.  Since we only store the attributes required for search, rather than the whole object into the RDBMS, this turns out to be a very practical and common approach.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Scatter/Gather Local Search&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Some of the NOSQL DB provides indexing and query processing mechanism within the local DB.  In this case, we can have the query processor broadcast the query to every node in the DHT where a local search will be conducted with results sent back to the query processor which aggregates into a single response.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SxKg-khvbpI/AAAAAAAAAYE/dOUwXxxxmmI/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 171px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SxKg-khvbpI/AAAAAAAAAYE/dOUwXxxxmmI/s320/P2.png" alt="" id="BLOGGER_PHOTO_ID_5409563099242917522" border="0" /&gt;&lt;/a&gt;Notice that the search is happening in parallel across all nodes in the DHT.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Distributed B-Tree&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;B+Tree is a common indexing structure using in RDBMS.   A distributed version of B+Tree can also be used in a DHT environment.  The basic idea is to hash the search-able attribute to locate the root node of the B+ Tree.  The "value" of the root node contains the id of its children node.  So the client can then issue another DHT lookup call to find the children node.  Continue this process, the client eventually navigate down to the leaf node, where the object id of the matching the search criteria is found.  Then the client will issue another DHT lookup to extract the actual object.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SxNVUPHGegI/AAAAAAAAAYU/gKeMbb0lipc/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 205px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SxNVUPHGegI/AAAAAAAAAYU/gKeMbb0lipc/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5409761383544158722" border="0" /&gt;&lt;/a&gt;Caution is needed when the B+Tree node is updated due to split/merge caused by object creation and deletion.  This should be ideally done in an atomic fashion.  &lt;a href="http://www.vldb.org/pvldb/1/1453922.pdf"&gt;This paper&lt;/a&gt; from Microsoft, HP and Toronto U describe a distributed transaction protocol to provide the required atomicity.  Distributed transaction is an expensive operation but its uses here is justified because most of the B+ tree updates rarely involve more than a single machine.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Prefix Hash Table&lt;/span&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt; (distributed Trie)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Trie is an alternative data structure, where every path (from the root) contains the prefix of the key.   Basically, every node in the Trie contains all the data whose key is prefixed by it.  Berkeley and Intel research &lt;a href="http://berkeley.intel-research.net/sylvia/pht.pdf"&gt;has a paper&lt;/a&gt; to describe this mechanism.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/SxQnQMBEnFI/AAAAAAAAAYc/fg8x2kdZH3Y/s1600/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 323px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/SxQnQMBEnFI/AAAAAAAAAYc/fg8x2kdZH3Y/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5409992211435920466" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;1.  Lookup a key&lt;/span&gt;&lt;br /&gt;To locate a particular key, we start with its one digit prefix and do a DHT lookup to see if we get a leaf node.  If so, we search within this leaf node as we know the key must be contained inside.  If it is not a leaf node, we extend the prefix with an extra digit and repeat the whole process again.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;# Locate the key next to input key&lt;br /&gt;def locate(key)&lt;br /&gt; leaf = locate_leaf_node(key)&lt;br /&gt; return leaf.locate(key)&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;# Locate leaf node containing input key&lt;br /&gt;def locate_leaf_node(key)&lt;br /&gt; for (i in 1 .. key.length)&lt;br /&gt;   node = DHT.lookup(key.prefix(n))&lt;br /&gt;   return node if node.is_leaf?&lt;br /&gt; end&lt;br /&gt; raise exception&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;2.  Range Query&lt;/span&gt;&lt;br /&gt;Perform a range query can be done by first locate the leaf node that contains the start key and then walk in the ascending order direction until we exceed the end key.  Note that we can walk across a leaf node by following the leaf node chain.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def range(startkey, endkey) {&lt;br /&gt; result = Array.new&lt;br /&gt; leaf = locate_leaf_node(startkey)&lt;br /&gt; while leaf != nil&lt;br /&gt;   result.append(leaf.range(startkey, endkey))&lt;br /&gt;   if (leaf.largestkey &amp;lt; endkey)&lt;br /&gt;     leaf = leaf.nextleaf&lt;br /&gt;   end&lt;br /&gt; end&lt;br /&gt; return result&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;To speedup the search, we can use a parallel search mechanism.  Instead of walking from the start key in a sequential manner, we can find the common prefix of the start key and end key (as we know all the result is under its subtree) and perform a parallel search of the children leaf nodes of this subtree.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;3.  Insert and Delete keys&lt;/span&gt;&lt;br /&gt;To insert a new key, we first identify the leaf node that contains the inserted key.  If the leaf node has available capacity (less than B keys), then simply add it there.  Otherwise, we need to split the leaf node into two branches and redistribute its existing keys to the newly created child nodes.&lt;br /&gt;&lt;br /&gt;To delete a key, we similarly identify the leaf node that contains the deleted key and then remove it there.   This may cause some of my parents to have less than B + 1 keys so I may need to merge some child nodes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Combining Multiple Search Criteria&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When we have multiple criteria in the search, each criteria may use a different index that resides within a different set of machines in the DHT. Multiple criterias can be combined using boolean operators such as OR / AND.  Performing OR operation is very straightforward because we just need to union the results of each individual index search that is performed separately.   On the other hand, performing AND operation is trickier because we need to deal with the situation that each individual criteria may have a large number of matches but their intersection is small.  The challenge is:  how can we efficiently perform an intersection between two potentially very large sets ?&lt;br /&gt;&lt;br /&gt;One naive implementation is to send all matched object ids of each criteria to a server that performs the set intersection.   If each data set is large, this approach may cause a large bandwidth consumption for sending across all the potential object ids.&lt;br /&gt;&lt;br /&gt;A number of techniques are described here &lt;a href="http://issg.cs.duke.edu/search/search.pdf"&gt;in this paper&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;1. Bloom Filter&lt;/span&gt;&lt;br /&gt;Instead of sending the whole set of matched object id, we can send a more compact representation called "Bloom Filter".  &lt;a href="http://en.wikipedia.org/wiki/Bloom_filter"&gt;Bloom filter&lt;/a&gt; is a much more compact representation that can be used for testing set membership.  The output has zero false negative, but has a chance of false positive p, which is controllable.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SxNN7OJ4IDI/AAAAAAAAAYM/U01_ttgt0lE/s1600/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 270px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SxNN7OJ4IDI/AAAAAAAAAYM/U01_ttgt0lE/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5409753257209241650" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;For minimizing bandwidth, we typically pick the one with the larger set as the sending machine and perform the intersection on the receiving machine who has the smaller set.&lt;br /&gt;&lt;br /&gt;Notice that the false positive can actually be completely eliminated by sending the matched result of Set2 back to Set1 machine, which double check the membership of set1 again.   In most cases, 100% precision is not needed and a small probability of false positive is often acceptable.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;2. Caching&lt;/span&gt;&lt;br /&gt;It is possible that certain search criteria is very popular and will be issued over and over again.  The corresponding bloom filter of this hot spots can be cached in the receiving machine.  Since the bloom filter has a small footprint, we can cache a lot of bloom filters of popular search criterias.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;3.  Incremental fetch&lt;/span&gt;&lt;br /&gt;In case if the client doesn't need to get the full set of matched results, we can stream the data back to client using a cursor mode.  Basically, at the sending side, set1 is sorted and broken into smaller chunks with a bloom filter computed and attached to each chunk.  At the receiving side, every element of set2 is checked for every bloom filter per chunk.&lt;br /&gt;&lt;br /&gt;Notice that we save computation at the sending side (compute the bloom filter for the chunk rather than the whole set1) at the cost of doing more at the receiving side (since we need to repeat the checking of the whole set2 for each chunk of set1).  The assumption is that client only needs a small subset of all the matched data.&lt;br /&gt;&lt;br /&gt;An optimization we can do is to mark the range of each chunk in set1 and ask set2 to skip the objects that falls within the same range.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-5582612444398481723?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/5582612444398481723/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=5582612444398481723' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5582612444398481723'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/5582612444398481723'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/query-processing-for-nosql-db.html' title='Query Processing for NOSQL DB'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_j6mB7TMmJJY/SxKcnXYD5GI/AAAAAAAAAX0/5uC0_Uo1fU8/s72-c/p1.png' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-4430087149556563351</id><published>2009-11-19T12:44:00.000-08:00</published><updated>2009-11-25T07:10:05.524-08:00</updated><title type='text'>Cloud Computing Patterns</title><content type='html'>I have attended a presentation by &lt;a href="http://simonguest.com/blogs/smguest/default.aspx"&gt;Simon Guest&lt;/a&gt; from Microsoft on their cloud computing architecture.  Although there was no new concept or idea introduced, Simon has provided an excellent summary on the major patterns of doing cloud computing.&lt;br /&gt;&lt;br /&gt;I have to admit that I am not familiar with Azure and this is my first time hearing a Microsoft cloud computing presentation.  I felt Microsoft has explained their Azure platform in a very comprehensible way.  I am quite impressed.&lt;br /&gt;&lt;br /&gt;Simon talked about 5 patterns of Cloud computing.  Let me summarize it (and mix-in a lot of my own thoughts) ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;1.   Use Cloud for Scaling&lt;/span&gt;&lt;br /&gt;The key idea is to spin up and down machine resources according to workload so the user only pay for the actual usage.  There is two types of access patterns: passive listener model and active worker model.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Passive listener model&lt;/span&gt; uses a synchronous communication pattern where the client pushes request to the server and synchronously wait for the processing result.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYh8WQq4nI/AAAAAAAAAT8/1iVfT0Bu2Cg/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 204px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYh8WQq4nI/AAAAAAAAAT8/1iVfT0Bu2Cg/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5406045723356226162" border="0" /&gt;&lt;/a&gt;In the passive listener model, machine instances are typically sit behind a load balancer. To scale the resource according to the work load, we can use a monitor service that send NULL client request and use the measured response time to spin up and down the size of the machine resources.&lt;br /&gt;&lt;br /&gt;On the other hand, &lt;span style="font-weight: bold;"&gt;Active worker model&lt;/span&gt; uses an asynchronous communication patterns where the client put the request to a queue, which will be periodically polled by the server.  After queuing the request, the client will do some other work and come back later to pickup the result.  The client can also provide a callback address where the server can push the result into after the processing is done.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYiBY4GmII/AAAAAAAAAUE/HBkz9z8M-io/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 199px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYiBY4GmII/AAAAAAAAAUE/HBkz9z8M-io/s320/P2.png" alt="" id="BLOGGER_PHOTO_ID_5406045809957836930" border="0" /&gt;&lt;/a&gt;In the active worker model, the monitor can measure the number of requests sitting in the queue and use that to determine whether machine instances (at the consuming end) need to be spin up or down.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;2.  Use Cloud for Multi-tenancy&lt;/span&gt;&lt;br /&gt;&lt;a href="http://horicky.blogspot.com/2009/08/multi-tenancy-in-cloud-computing.html"&gt;Multi-tenancy&lt;/a&gt; is more a SaaS provider (rather than an enterprise) usage scenario.  The key idea is to use the same set of code / software to host the application for different customers (tenants)   who may have slightly different requirement in&lt;br /&gt;&lt;ul&gt;&lt;li&gt;UI branding&lt;/li&gt;&lt;li&gt;Business rules for decision criteria&lt;/li&gt;&lt;li&gt;Data schema&lt;/li&gt;&lt;/ul&gt;The approach is to provide sufficient "customization" capability for their customer.  The most challenging part is to determine which aspects should be opened for customization and which shouldn't.  After identifying these configurable parameters, it is straightforward to define configuration metadata to capture that.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;3.  Use Cloud for Batch processing&lt;br /&gt;&lt;/span&gt;This is about running things like statistics computation, report generation, machine learning, analytics ... etc.  These task is done in batch mode and so it is more economical to use the "pay as you go" model.  On the other hand, batch processing has very high tolerance in latency and so is a perfect candidate of running in the cloud.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYiOU8WuPI/AAAAAAAAAUU/yc_ZTspSGZ8/s1600/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 237px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYiOU8WuPI/AAAAAAAAAUU/yc_ZTspSGZ8/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5406046032240228594" border="0" /&gt;&lt;/a&gt;Here is an example of how to run &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;Map/Reduce framework&lt;/a&gt; in the cloud.  Microsoft hasn't provided a Map/Reduce solution at this moment but Simon mentioned that Dryad in Microsoft research may be a future Microsoft solution.  Interestingly, Simon also recommended Hadoop.&lt;br /&gt;&lt;br /&gt;Of course, one challenge is how to move the data from the cloud in the first place.  In my earlier blog, I have describe &lt;a href="http://horicky.blogspot.com/2009/08/skinny-straw-in-cloud-shake.html"&gt;some best practices &lt;/a&gt;on this.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;4.  Use Cloud for Storage&lt;/span&gt;&lt;br /&gt;The idea of storing data into the cloud and no need to worry about DBA tasks.  Most cloud vendor provide large scale key/value store as well as RDBMS services.  Their data storage services will also take care of data partitioning, replication ... etc.  Building &lt;a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;cloud storage is a big topic&lt;/a&gt; involving many distributed computing concepts and techniques, I have covered it in a &lt;a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html"&gt;separate blog&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;5.  Use Cloud for Communication&lt;/span&gt;&lt;br /&gt;A &lt;span style="font-weight: bold;"&gt;queue&lt;/span&gt; (or mailbox) service provide a mechanism for different machines to communicate in an asynchronous manner via message passing.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYsvJbTyOI/AAAAAAAAAUk/cFQGNjplDlM/s1600/P5.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 118px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYsvJbTyOI/AAAAAAAAAUk/cFQGNjplDlM/s320/P5.png" alt="" id="BLOGGER_PHOTO_ID_5406057591200794850" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Azure also provide a &lt;span style="font-weight: bold;"&gt;relay service&lt;/span&gt; in the cloud which is quite useful for machines behind different firewall to communicate.  In a typical firewall setup, incoming connection is not allowed so these machine cannot directly establish a socket to each other.  In order for them to communicate, each need to open an on-going socket connection to the cloud relay, which will route traffic between these connections.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/SwYtJzwTCEI/AAAAAAAAAU0/L_i0ubRjVks/s1600/P6.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 208px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/SwYtJzwTCEI/AAAAAAAAAU0/L_i0ubRjVks/s400/P6.png" alt="" id="BLOGGER_PHOTO_ID_5406058049239713858" border="0" /&gt;&lt;/a&gt;I have used the same technique in a previous P2P project where user's PC behind their firewall need to communicate, and I know this relay approach works very well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-4430087149556563351?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/4430087149556563351/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=4430087149556563351' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4430087149556563351'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4430087149556563351'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/cloud-computing-patterns.html' title='Cloud Computing Patterns'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j6mB7TMmJJY/SwYh8WQq4nI/AAAAAAAAAT8/1iVfT0Bu2Cg/s72-c/p1.png' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-478364449624074493</id><published>2009-11-16T17:27:00.001-08:00</published><updated>2009-11-16T18:36:57.756-08:00</updated><title type='text'>Impression on Scala</title><content type='html'>I have been hearing quite a lot of good comments about the Scala programming language.  I personally use Java extensively in the past and have switched to Ruby (and some Erlang) in last 2 years.  The following features that I heard about Scala really attracts me ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Scala code is compiled in Java byte code and run natively in JVMs.  Code written in Scala immediately enjoy the performance and robustness of Java VM technologies.&lt;/li&gt;&lt;li&gt;Easy to integrate with Java code and libraries, immediately enjoy the wide portfolio of exiting Java libraries.&lt;/li&gt;&lt;li&gt;It has good support to the Actor model, which I believe is an important programming paradigm for multi-core machine architecture.&lt;/li&gt;&lt;/ul&gt;So I decide to take a Scala tutorial from Dean Wampler today in the &lt;a href="http://qconsf.com/sf2009/conference"&gt;Qcon conference&lt;/a&gt;.  This is a summary of my impression on Scala after the class.&lt;br /&gt;&lt;br /&gt;First of all, Scala is a strongly typed language.  However it has a type inference mechanism so you don't have to type declaration is optional.  But in some place (like a method signature), type declaration is mandatory.  It is not very clear to me when I have to declare a type.&lt;br /&gt;&lt;br /&gt;Having the "val" and "var" declaration in variables is very nice because it makes immutability explicit.  In Ruby, you can make an object immutable by sending it a freeze() method but Scala do this more explicitly.&lt;br /&gt;&lt;br /&gt;But I found it confusing to have a method define in 2 different ways&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;class A() {&lt;br /&gt;   def hello {&lt;br /&gt;       ...&lt;br /&gt;   }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;class A() {&lt;br /&gt;   def hello = {&lt;br /&gt;       ...&lt;br /&gt;   }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;The MyFunction[+A1, -A2] is really confusing to me.  I feel the typeless language is much more easy.&lt;br /&gt;&lt;br /&gt;Removing the open and close bracket is also causing a lot of confusion to me.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;class Person(givenName: String) {&lt;br /&gt;   var myName = givenName&lt;br /&gt;   def name =(anotherName: String) = {&lt;br /&gt;       myName = anotherName&lt;br /&gt;   }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;class Person(givenName: String) {&lt;br /&gt;   var myName = givenName&lt;br /&gt;   def name =(anotherName: String) = myName = anotherName&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;The special "implicit" conversion method provides a mechanism to develop DSL (Domain Specific Language) in Scala but it also looks very odd to me.  Basically, you need to import a SINGLE implicit conversion method that needs to take care of all possible conversions.&lt;br /&gt;&lt;br /&gt;All the method that ends with ":" has a reverse calling order is also an odd stuff to me.&lt;br /&gt;&lt;br /&gt;Traits provides mixins for Scala but I feel the "Module" mechanism in Ruby has done a better job.&lt;br /&gt;&lt;br /&gt;Scala has the notion of "function" and can pass "function" as parameters.  Again, I feel Ruby blocks has done a better job.&lt;br /&gt;&lt;br /&gt;Perhaps due to JVM's limitation of supporting a dynamic language, Scala is not very strong in doing meta-programming, Scala doesn't provide the "open class" property where you can modify an existing class (add methods, change method implementation, add class ... etc.) at run time&lt;br /&gt;&lt;br /&gt;Scala also emulate a number of Erlang features but I don't feel it is doing a very clean job.  For example, it emulate the pattern matching style of Erlang programming using the case Class and unapply() method but it seems a little bit odd to me.&lt;br /&gt;&lt;br /&gt;Erlang has 2 cool features which I couldn't find in Scala (maybe I am expecting too much)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The ability to run two version of class at the same time&lt;/li&gt;&lt;li&gt;Able to create and pass function objects to a remote process (kinda like a remote code loading)&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Overall impression&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I have to admit that my impression on Scala is not as good as before I attend the tutorial.  Scala tries to put different useful programming paradigm in the JVM but I have a feeling of force-fit.  Of course its close tie to JVM is still a good reason to use Scala.  But from a pure programming perspective, I will prefer to use a combination of Ruby and Erlang, rather than Scala.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-478364449624074493?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/478364449624074493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=478364449624074493' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/478364449624074493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/478364449624074493'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/impression-on-scala.html' title='Impression on Scala'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1999212675899968006</id><published>2009-11-15T19:33:00.001-08:00</published><updated>2009-11-25T10:04:55.019-08:00</updated><title type='text'>NOSQL Patterns</title><content type='html'>Over the last couple years, we see an emerging data storage mechanism for storing large scale of data.  These storage solution differs quite significantly with the RDBMS model and is also known as the NOSQL.  Some of the key players include ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;GoogleBigTable, HBase, Hypertable&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html"&gt;AmazonDynamo&lt;/a&gt;, Voldemort, Cassendra, Riak&lt;/li&gt;&lt;li&gt;Redis&lt;/li&gt;&lt;li&gt;&lt;a href="http://horicky.blogspot.com/2008/10/couchdb-implementation.html"&gt;CouchDB&lt;/a&gt;, MongoDB&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; These solutions has a number of characteristics in common&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Key value store&lt;/li&gt;&lt;li&gt;Run on large number of commodity machines&lt;/li&gt;&lt;li&gt;Data are partitioned and replicated among these machines&lt;/li&gt;&lt;li&gt;Relax the data consistency requirement.  (because the &lt;a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem"&gt;CAP theorem&lt;/a&gt; proves that you cannot get Consistency, Availability and Partitioning at the the same time)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The aim of this blog is to extract the underlying technologies that these solutions have in common, and get a deeper understanding on the implication to your application's design.  I am not intending to compare the features of these solutions, nor to suggest which one to use.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;API model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The underlying data model can be considered as a large Hashtable (key/value store).&lt;br /&gt;&lt;br /&gt;The basic form of API access is&lt;br /&gt;&lt;ul&gt;&lt;li&gt;get(key)  -- Extract the value given a key&lt;br /&gt;&lt;/li&gt;&lt;li&gt;put(key, value)  -- Create or Update the value given its key&lt;/li&gt;&lt;li&gt;delete(key) -- Remove the key and its associated value&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;More advance form of API allows to execute user defined function in the server environment&lt;br /&gt;&lt;ul&gt;&lt;li&gt;execute(key, operation, parameters)  -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).&lt;/li&gt;&lt;li&gt;mapreduce(keyList, mapFunc, reduceFunc)  --  Invoke a &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;map/reduce function&lt;/a&gt; across a key range.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Machines layout&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The underlying infratructure is composed of large number (hundreds or thousands) of cheap, commoditized, unreliable machines connected through a network.  We call each machine a physical node (PN).  Each PN has the same set of software configuration but may have varying hardware capacity in terms of CPU, memory and disk storage.  Within each PN, there will be a variable number of virtual node (VN) running according to the available hardware capacity of the PN.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwogI5pkEEI/AAAAAAAAAXE/xhrfSf8dmI4/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 277px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwogI5pkEEI/AAAAAAAAAXE/xhrfSf8dmI4/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407169639897894978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Data partitioning (Consistent Hashing)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Since the overall hashtable is distributed across many VNs, we need a way to map each key to the corresponding VN.&lt;br /&gt;&lt;br /&gt;One way is to use&lt;br /&gt;partition = key mod (total_VNs)&lt;br /&gt;&lt;br /&gt;The disadvantage of this scheme is when we alter the number of VNs, then the ownership of existing keys has changed dramatically, which requires full data redistribution.  Most large scale store use a "consistent hashing" technique to minimize the amount of ownership changes.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/SwohQZ9HTAI/AAAAAAAAAXM/X9CAGfpnL2o/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 196px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/SwohQZ9HTAI/AAAAAAAAAXM/X9CAGfpnL2o/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407170868340542466" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In the consistent hashing scheme, the key space is finite and lie on the circumference of a ring.  The virtual node id is also allocated from the same key space.  For any key, its owner node is defined as the first encountered virtual node if walking clockwise from that key.  If the owner node crashes, all the key it owns will be adopted by its clockwise neighbor.  Therefore, key redistribution happens only within the neighbor of the crashed node, all other nodes retains the same set of keys.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Data replication&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To provide high reiability from individually unreliable resource, we need to replicate the data partitions.&lt;br /&gt;&lt;br /&gt;Replication not only improves the overall reliability of data, it also helps performance by spreading the workload across multiple replicas.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwohcYWqCDI/AAAAAAAAAXU/oH0pDuht4vo/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 248px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwohcYWqCDI/AAAAAAAAAXU/oH0pDuht4vo/s320/P2.png" alt="" id="BLOGGER_PHOTO_ID_5407171074069235762" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;While read-only request can be dispatched to any replicas, update request is more challenging because we need to carefully co-ordinate the update which happens in these replicas.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Membership Changes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Notice that virtual nodes can join and leave the network at any time without impacting the operation of the ring.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;When a new node joins the network&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The joining node announce its presence and its id to some well known VNs or just broadcast)&lt;/li&gt;&lt;li&gt;All the neighbors (left and right side) will adjust the change of key ownership as well as the change of replica memberships.  This is typically done synchronously.&lt;/li&gt;&lt;li&gt;The joining node starts to bulk copy data from its neighbor in parallel asynchronously.&lt;/li&gt;&lt;li&gt;The membership change is asynchronously propagate to the other nodes.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/Sw1b9Sv0fmI/AAAAAAAAAXc/4-YNzhA3LCQ/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 233px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/Sw1b9Sv0fmI/AAAAAAAAAXc/4-YNzhA3LCQ/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5408079836104392290" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Notice that other nodes may not have their membership view updated yet so they may still forward the request to the old nodes.  But since these old nodes (which is the neighbor of the new joined node) has been updated (in step 2), so they will forward the request to the new joined node.&lt;br /&gt;&lt;br /&gt;On the other hand, the new joined node may still in the process of downloading the data and not ready to serve yet.  We use the vector clock (described below) to determine whether the new joined node is ready to serve the request and if not, the client can contact another replica.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;When an existing node leaves the network&lt;/span&gt; (e.g. crash)&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The crashed node no longer respond to gossip message so its neighbors knows about it.&lt;/li&gt;&lt;li&gt;The neighbor will update the membership changes and copy data asynchronously&lt;/li&gt;&lt;/ol&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/Sw1jGP6y5tI/AAAAAAAAAXk/rM9k-jNcsKQ/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 249px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/Sw1jGP6y5tI/AAAAAAAAAXk/rM9k-jNcsKQ/s320/P2.png" alt="" id="BLOGGER_PHOTO_ID_5408087686545336018" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We haven't talked about how the virtual nodes is mapped into the physical nodes.  Many schemes are possible with the main goal that Virtual Node replicas should not be sitting on the same physical node.  One simple scheme is to assigned Virtual node to Physical node in a random manner but check to make sure that a physical node doesn't contain replicas of the same key ranges.&lt;br /&gt;&lt;br /&gt;Notice that since machine crashes happen at the physical node level, which has many virtual nodes runs on it.  So when a single Physical node crashes, the workload (of its multiple virtual node) is scattered across many physical machines.  Therefore the increased workload due to physical node crashes is evenly balanced.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Client Consistency&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Once we have multiple copies of the same data, we need to worry about how to synchronize them such that the client can has a consistent view of the data.&lt;br /&gt;&lt;br /&gt;There is a number of client consistency models&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Strict Consistency (one copy serializability):&lt;/span&gt;  This provides the semantics as if there is only one copy of data.  Any update is observed instantaneously.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Read your write consistency:&lt;/span&gt;  The allows the client to see his own update immediately (and the client can switch server between requests), but not the updates made by other clients&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Session consistency:&lt;/span&gt;  Provide the read-your-write consistency only when the client is issuing the request under the same session scope (which is usually bind to the same server)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Monotonic Read Consistency:&lt;/span&gt;  This provide the time monotonicity guarantee that the client will only see more updated version of the data in future requests.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Eventual Consistency:&lt;/span&gt;  This provides the weakness form of guarantee.  The client can see an inconsistent view as the update are in progress.  This model works when concurrent access of the same data is very unlikely, and the client need to wait for some time if he needs to see his previous update.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Depends on which consistency model to provide, 2 mechanisms need to be arranged ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How the client request is dispatched to a replica&lt;/li&gt;&lt;li&gt;How the replicas propagate and apply the updates&lt;/li&gt;&lt;/ul&gt;There are various models how these 2 aspects can be done, with different tradeoffs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Master Slave (or Single Master) Model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Under this model, each data partition has a single master and multiple slaves. In above model, B is the master of keyAB and C, D are the slaves. All update requests has to go to the master where update is applied and then asynchronously propagated to the slaves.  Notice that there is a time window of data lost if the master crashes before it propagate its update to any slaves, so some system will wait synchronously for the update to be propagated to at least one slave.&lt;br /&gt;&lt;br /&gt;Read requests can go to any replicas if the client can tolerate some degree of data staleness.  This is where the read workload is distributed among many replicas.  If the client cannot tolerate staleness for certain data, it also need to go to the master.&lt;br /&gt;&lt;br /&gt;Note that this model doesn't mean there is one particular physical node that plays the role as the master.  The granularity of "mastership" happens at the virtual node level.  Each physical node has some virtual nodes acts as master of some partitions while other virtual nodes acts as slaves of other partitions.  Therefore, the write workload is also distributed across different physical node, although this is due to partitioning rather than replicas&lt;br /&gt;&lt;br /&gt;When a physical node crashes, the masters of certain partitions will be lost.  Usually, the most updated slave will be nominated to become the new master.&lt;br /&gt;&lt;br /&gt;Master Slave model works very well in general when the application has a high read/write ratio.  It also works very well when the update happens evenly in the key range.  So it is the predominant model of data replication.&lt;br /&gt;&lt;br /&gt;There are 2 ways how the master propagate updates to the slave; &lt;span style="font-weight: bold;"&gt;State transfer&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;Operation transfer&lt;/span&gt;.  In State transfer, the master passes its latest state to the slave, which then replace its current state with the latest state.  In operation transfer, the master propagate a sequence of operations to the slave which then apply the operations in its local state.&lt;br /&gt;&lt;br /&gt;The state transfer model is more robust against message lost because as long as a latter more updated message arrives, the replica still be able to advance to the latest state.&lt;br /&gt;&lt;br /&gt;Even in state transfer mode, we don't want to send the full object for updating other replicas because changes typically happens within a small portion of the object.  In will be a waste of network bandwidth if we send the unchanged portion of the object, so we need a mechanism to detect and send just the delta (the portion that has been changed).  One common approach is break the object into chunks and compute a &lt;a href="http://en.wikipedia.org/wiki/Hash_tree"&gt;hash tree&lt;/a&gt; of the object. So the replica can just compare their hash tree to figure out which chunk of the object has been changed and only send those over.&lt;br /&gt;&lt;br /&gt;In operation transfer mode, usually much less data need to be send over the network.  However,  it requires a reliable message mechanism with delivery order guarantee.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Multi-Master (or No Master) Model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If there is hot spots in certain key range, and there is intensive write request, the master slave model will be unable to spread the workload evenly.  Multi-master model allows updates to happen at any replica (I think call it "No-Master" is more accurate).&lt;br /&gt;&lt;br /&gt;If any client can issue any update to any server, how do we synchronize the states such that we can retain client consistency and also eventually every replica will get to the same state ?  We describe a number of different approaches in following ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Quorum Based 2PC&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To provide "strict consistency", we can use a traditional 2PC protocol to bring all replicas to the same state at every update.  Lets say there is N replicas for a data.  When the data is update, there is a "prepare" phase where the coordinator ask every replica to confirm whether each of them is ready to perform the update.  Each of the replica will then write the data to a log file and when success, respond to the coordinator.&lt;br /&gt;&lt;br /&gt;After gathering all replicas responses positively, the coordinator will initiate the second "commit" phase and then ask every replicas to commit and each replica then write another log entry to confirm the update. Notice that there are some scalability issue as the coordinator need to "synchronously" wait for quite a lot of back and forth network roundtrip and disk I/O to complete.&lt;br /&gt;&lt;br /&gt;On the other hand, if any one of the replica crashes, the update will be unsuccessful.  As there are more replicas, chance of having one of them increases.  Therefore, replication is hurting the availability rather than helping.  This make traditional 2PC not a popular choice for high throughput transactional system.&lt;br /&gt;&lt;br /&gt;A more efficient way is to use the quorum based 2PC (e.g. PAXOS).  In this model, the coordinator only need to update W replicas (rather than all N replicas) synchronously.   The coordinator still write to all the N replicas but only wait for positive acknowledgment for any W of the N to confirm.  This is much more efficient from a probabilistic standpoint.&lt;br /&gt;&lt;br /&gt;However, since no all replicas are update, we need to be careful when reading the data to make sure the read can reach at least one replica that has been previously updated successful.  When reading the data, we need to read R replicas and return the one with the latest timestamp.&lt;br /&gt;&lt;br /&gt;For "strict consistency", the important condition is to make sure the read set and the write set overlap.  ie:  W + R &amp;gt; N&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwOHybHlzHI/AAAAAAAAATE/-NaXjP_S2H8/s1600/P7.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 141px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwOHybHlzHI/AAAAAAAAATE/-NaXjP_S2H8/s400/P7.png" alt="" id="BLOGGER_PHOTO_ID_5405313278117530738" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;As you can see, the quorum based 2PC can be considered as a general 2PC protocol where the traditional 2PC is a special case where W = N and R = 1.  The general quorum-based model allow us to pick W and R according to our tradeoff decisions between read and write workload ratio.&lt;br /&gt;&lt;br /&gt;If the user cannot afford to pick W, R large enough, ie: W + R &amp;lt;= N, then the client is relaxing its consistency model to a weaker one.   &lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;If the client can tolerate a more relax consistency model, we don't need to use the 2PC commit or quorum based protocol as above. Here we describe a Gossip model where updates are propagate asynchronous via gossip message exchanges and an auto-entropy protocol to apply the update such that every replica eventually get to the latest state.&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;Vector Clock&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Vector Clock is a timestamp mechanism such that we can reason about causal relationship between updates.  First of all, each replica keeps vector clock.  Lets say replica i has its clock Vi.  Vi[i] is the logical clock which if every replica follows certain rules to update its vector clock.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Whenever an internal operation happens at replica i, it will advance its clock Vi[i]&lt;/li&gt;&lt;li&gt;Whenever replica i send a message to replica j, it will first advance its clock Vi[i] and attach its vector clock Vi to the message&lt;/li&gt;&lt;li&gt;Whenever replica j receive a message from replica i, it will first advance its clock Vj[j] and then merge its clock with the clock Vm attached in the message.  ie:  Vj[k] = max(Vj[k], Vm[k])&lt;/li&gt;&lt;/ul&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwoSGODJQuI/AAAAAAAAAWs/OefcWLxdsmI/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 219px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwoSGODJQuI/AAAAAAAAAWs/OefcWLxdsmI/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407154200671503074" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;A partial order relationship can be defined such that Vi &amp;gt; Vj iff for all k, Vi[k] &amp;gt;= Vj[k].  We can use these partial ordering to derive causal relationship between updates.  The reasoning behind is&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The effect of an internal operation will be seen immediately at the same node&lt;/li&gt;&lt;li&gt;After receiving a message, the receiving node knows the situation of the sending node at the time when the message is send.  The situation is not only including what is happening at the sending node, but also all the other nodes that the sending node knows about.&lt;/li&gt;&lt;li&gt;In other words, Vi[i] reflects the time of the latest internal operation happens at node i. Vi[k] = 6 reflects replica i has known the situation of replica k up to its logical clock 6.&lt;/li&gt;&lt;/ul&gt;Notice that the term "situation" is used here in an abstract sense.  Depends on what information is passed in the message, the situation can be different.  This will affect how the vector clock will be advanced.  In below, we describe the "state transfer model" and the "operation transfer model" which has different information passed in the message and the advancement of their vector clock will also be different.&lt;br /&gt;&lt;br /&gt;Because state is always flow from the replica to the client but not the other way round, the client doesn't have an entry in the Vector clock. The vector clock contains only one entry for each replica.  However, the client will also keep a vector clock from the last replica it contacts.  This is important for support the client consistency model we describe above.  For example, to support monotonic read, the replica will make sure the vector clock attached to the data is &amp;gt; the client's submitted vector clock in the request.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Gossip (State Transfer Model)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a &lt;span style="font-weight: bold; font-style: italic;"&gt;state transfer&lt;/span&gt; model, each replica maintain a vector clock as well as a state version tree where each state is neither &amp;gt; or &amp;lt; among each other (based on vector clock comparison).  In other words, the state version tree contains all the conflicting updates.&lt;br /&gt;&lt;br /&gt;At &lt;span style="font-weight: bold; font-style: italic;"&gt;query &lt;/span&gt;time, the client will attach its vector clock and the replica will send back a subset of the state tree which precedes the client's vector clock (this will provide monotonic read consistency).  The client will then advance its vector clock by merging all the versions.  This means the client is responsible to resolve the conflict of all these versions because when the client sends the update later, its vector clock will precede all these versions.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwmXpttEVKI/AAAAAAAAAVc/BuDsgnTJoZM/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 391px; height: 400px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwmXpttEVKI/AAAAAAAAAVc/BuDsgnTJoZM/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407019570534044834" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;At &lt;span style="font-weight: bold; font-style: italic;"&gt;update&lt;/span&gt;, the client will send its vector clock and the replica will check whether the client state precedes any of its existing version, if so, it will throw away the client's update.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwmX6waHqPI/AAAAAAAAAVk/48TsSr21pUU/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 375px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwmX6waHqPI/AAAAAAAAAVk/48TsSr21pUU/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5407019863317653746" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Replicas also &lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;gossip &lt;/span&gt;among each other in the background and try to merge their version tree together.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwmYWE4O5sI/AAAAAAAAAV0/2QDGlh-JAGA/s1600/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 290px; height: 320px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwmYWE4O5sI/AAAAAAAAAV0/2QDGlh-JAGA/s320/P3.png" alt="" id="BLOGGER_PHOTO_ID_5407020332669134530" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Gossip (Operation Transfer Model)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In an &lt;span style="font-weight: bold; font-style: italic;"&gt;operation transfer&lt;/span&gt; approach, the sequence of applying the operations is very important.  At the minimum causal order need to be maintained.  Because of the ordering issue, each replica has to defer executing the operation until all the preceding operations has been executed.  Therefore replicas save the operation request to a log file and exchange the log among each other and consolidate these operation logs to figure out the right sequence to apply the operations to their local store in an appropriate order.&lt;br /&gt;&lt;br /&gt;"Causal order" means every replica will apply changes to the "causes" before apply changes to the "effect". "Total order" requires that every replica applies the operation in the same sequence.&lt;br /&gt;&lt;br /&gt;In this model, each replica keeps a list of vector clock, Vi is the vector clock the replica itself and Vj is the vector clock when replica i receive replica j's gossip message.  There is also a V-state that represent the vector clock of the last updated state.&lt;br /&gt;&lt;br /&gt;When a query is submitted by the client, it will also send along its vector clock which reflect the client's view of the world.  The replica will check if it has a view of the state that is later than the client's view.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/SwmlOzM4YuI/AAAAAAAAAV8/vXWT2gsQvNc/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 376px; height: 400px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/SwmlOzM4YuI/AAAAAAAAAV8/vXWT2gsQvNc/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407034501315977954" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;When an update operation is received, the replica will buffer the update operation until it can be applied to the local state.  Every submitted operation will be tag with 2 timestamp, V-client indicates the client's view when he is making the update request.  V-@receive is the replica's view when it receives the submission.&lt;br /&gt;&lt;br /&gt;This update operation request will be sitting in the queue until the replica has received  all the other updates that this one depends on.  This condition is reflected in the vector clock Vi when it is larger than V-client&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/Swmll8oYbqI/AAAAAAAAAWE/F_oI7WwWep0/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 353px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/Swmll8oYbqI/AAAAAAAAAWE/F_oI7WwWep0/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5407034898984234658" border="0" /&gt;&lt;/a&gt;&lt;span style="text-decoration: underline;"&gt;&lt;br /&gt;&lt;/span&gt;On the background, different replicas exchange their log for the queued updates and update each other's vector clock.  After the log exchange, each replica will check whether certain operation can be applied (when all the dependent operation has been received) and apply them accordingly.  Notice that it is possible that multiple operations are ready for applying at the same time, the replica will sort these operation in causal order (by using the Vector clock comparison) and apply them in the right order.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/Swml33Xp04I/AAAAAAAAAWM/yCHvTCgTzF8/s1600/P3.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 359px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/Swml33Xp04I/AAAAAAAAAWM/yCHvTCgTzF8/s400/P3.png" alt="" id="BLOGGER_PHOTO_ID_5407035206809539458" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The concurrent update problem at different replica can also happen.  Which means there can be multiple valid sequences of operation.   In order for different replica to apply concurrent update in the same order,  we need a total ordering mechanism.&lt;br /&gt;&lt;br /&gt;One approach is whoever do the update first acquire a monotonic sequence number and late comers follow the sequence.   On the other hand, if the operation itself is commutative, then the order to apply the operations doesn't matter&lt;br /&gt;&lt;br /&gt;After applying the update, the update operation cannot be immediately removed from the queue because the update may not be fully exchange to every replica yet.  We continuously check the Vector clock of each replicas after log exchange and after we confirm than everyone has receive this update, then we'll remove it from the queue.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Map Reduce Execution&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Notice that the distributed store architecture fits well into distributed processing as well.  For example, to process a &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;Map/Reduce operation&lt;/a&gt; over an input key list.&lt;br /&gt;&lt;br /&gt;The system will push the map and reduce function to all the nodes (ie: moving the processing logic towards the data).  The map function of the input keys will be distributed among the replicas of owning those input, and then forward the map output to the reduce function, where the aggregation logic will be executed.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwoeUzwoKrI/AAAAAAAAAW0/ch01mbMkRuk/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 266px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SwoeUzwoKrI/AAAAAAAAAW0/ch01mbMkRuk/s320/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407167645452085938" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Handling Deletes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a multi-master replication system, we use Vector clock timestamp to determine causal order, we need to handle "delete" very carefully such that we don't lost the associated timestamp information of the deleted object, otherwise we cannot even reason the order of when to apply the delete.&lt;br /&gt;&lt;br /&gt;Therefore, we typically handle delete as a special update by marking the object as "deleted" but still keep its metadata / timestamp information around.  Around a long enough time that we are confident that every replica has marked this object deleted, then we garbage collected the deleted object to reclaim its space.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Storage Implementaton&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;One strategy is to use make the storage implementation pluggable.  e.g. A local MySQL DB, Berkeley DB, Filesystem or even a in memory Hashtable can be used as a storage mechanism.&lt;br /&gt;&lt;br /&gt;Another strategy is to implement the storage in a highly scalable way.  Here are some techniques that I learn from &lt;a href="http://horicky.blogspot.com/2008/10/couchdb-implementation.html"&gt;CouchDB&lt;/a&gt; and Google BigTable.&lt;br /&gt;&lt;br /&gt;CouchDB has a MVCC model that uses a copy-on-modified approach.  Any update will cause a private copy being made which in turn cause the index also need to be modified and causing the a private copy of the index as well, all the way up to the root pointer.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/SwjQAV_JShI/AAAAAAAAAU8/ndAucGpmwzI/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 281px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/SwjQAV_JShI/AAAAAAAAAU8/ndAucGpmwzI/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5406800056978852370" border="0" /&gt;&lt;/a&gt;Notice that the update happens in an append-only mode where the modified data is appended to the file and the old data becomes garbage.  Periodic garbage collection is done to compact the data.  Here is how the model is implemented in memory and disks&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwjQd22AlMI/AAAAAAAAAVE/bUDkgpnPu5Q/s1600/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 303px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SwjQd22AlMI/AAAAAAAAAVE/bUDkgpnPu5Q/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5406800564015109314" border="0" /&gt;&lt;/a&gt;In Google BigTable model, the data is broken down into multiple generations and the memory is use to hold the newest generation.  Any query will  search the mem data as well as all the data sets on disks and merge all the return results.  Fast detection of whether a generation contains a key can be done by checking a bloom filter.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/SwnJ4-X4GjI/AAAAAAAAAWk/Wy8cW8f8dwM/s1600/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 293px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/SwnJ4-X4GjI/AAAAAAAAAWk/Wy8cW8f8dwM/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5407074808287992370" border="0" /&gt;&lt;/a&gt;When update happens, both the mem data and the commit log will be written so that if the machine crashes before the mem data flush to disk, it can be recovered from the commit log.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1999212675899968006?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1999212675899968006/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1999212675899968006' title='17 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1999212675899968006'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1999212675899968006'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/nosql-patterns.html' title='NOSQL Patterns'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j6mB7TMmJJY/SwogI5pkEEI/AAAAAAAAAXE/xhrfSf8dmI4/s72-c/p1.png' height='72' width='72'/><thr:total>17</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1634162767426376135</id><published>2009-11-12T21:42:00.000-08:00</published><updated>2009-11-13T08:52:37.770-08:00</updated><title type='text'>Amazon Cloud Computing</title><content type='html'>Cloud computing is becoming a very hot area as it provides cost savings and time-to-market benefits to a wide spectrum of organizations.&lt;br /&gt;&lt;br /&gt;At the consumer end, small startup companies found Cloud computing can significantly reduce their initial setup cost.  Large enterprises also found Cloud computing allows them to improve resource utilization and cost effectiveness, although they also have security and control concerns.  &lt;a href="http://horicky.blogspot.com/2008/12/does-cloud-computing-make-sense-for.html"&gt;Here is a very common cloud deployment model&lt;/a&gt; across many large enterprises.&lt;br /&gt;&lt;br /&gt;Traditional software companies who distributes software on CD also look into the SaaS model as a new way of doing business.  However, a SaaS model typically requires the companies to build some kind of web site.  But these companies may not have the expertise to build large scale web sites and operate it.  Cloud computing also allows them to outsource the SaaS infrastructure.&lt;br /&gt;&lt;br /&gt;Here we look at the leader in the cloud computing provider space.  AWS from Amazon.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Amazon Web Service&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Amazon is the current leading provider in the Cloud computing space.  At the heart of its technology stack (which is known as the Amazon Web Services), it includes an IaaS stack, a PaaS stack and a SaaS stack.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Their &lt;span style="font-style: italic;font-size:130%;" &gt;&lt;span style="font-weight: bold;"&gt;IaaS stack&lt;/span&gt;&lt;/span&gt; includes infrastructure resource such as virtual machine, virtual mount disks, virtual network, load balancer, VPN, Databases.&lt;/li&gt;&lt;li&gt;Their &lt;span style="font-weight: bold; font-style: italic;font-size:130%;" &gt;PaaS stack&lt;/span&gt; provides a set of distributed computing services including queuing, data storage, metadata, parallel batch processing, &lt;/li&gt;&lt;li&gt;Their &lt;span style="font-weight: bold; font-style: italic;font-size:130%;" &gt;SaaS stack&lt;/span&gt; provides a set of high level services such as content delivery network, payment processing services, ecommerce fulfillment services.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Since we are focusing in the Cloud Computing aspects, we will describe their IaaS and PaaS stack below but will skip their SaaS stack.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;EC2 – Elastic Computing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Amazon has procured a large number of commoditized Intel boxes running virtualization software Xen. On top of Xen, Linux or Windows can be run as the guest OS . The guest operating system can have many variations with different set of software packages installed.&lt;br /&gt;&lt;br /&gt;Each configuration is bundled as a custom machine image (called AMI). Amazon host a catalog of AMI for the users to choose from. Some AMI is free while other requires a usage charge. User can also customize their own setup by starting from a standard AMI, make their special configuration changes and then create a specific AMI that is customized for their specific needs. The AMIs are stored in Amazon’s storage subsystem S3.&lt;br /&gt;&lt;br /&gt;Amazon also classifies their machines in terms of their processor power (no of cores, memory and disk size) and charged their usage at a different rate. These machines can be run in different network topology specified by the users. There is an “availability zone” concept which is basically a logical data center. “Availability zone” has no interdependency and is therefore very unlikely to fail at the same time. To achieve high availability, users should consider putting their EC2 instances in different availability zones.&lt;br /&gt;&lt;br /&gt;“Security Group” is the virtual firewall of Amazon EC2 environment. EC2 instances can be grouped under “security group” which specifies which port is open to which incoming range of IP addresses. So EC2 instances that running applications at various level of security requirements can be put into appropriated security groups and managed using ACL (access control list). Somewhat very similar to what network administrator configure their firewalls.&lt;br /&gt;&lt;br /&gt;User can start the virtual machine (called an EC2 instance) by specifying the AMI, the machine size, the security group, and its authentication key via command line or an HTTP/XML message. So it is very easy to startup the virtual machine and start running the user’s application. When the application completes, the user can also shutdown the EC2 instance via command line or HTTP/XML message. The user is only charged for the actual time when the EC2 instance is running.&lt;br /&gt;&lt;br /&gt;One of the issue of extremely dynamic machine configuration (such as EC2) is that a lot of configuration setting is transient and does not survive across reboot. For example, the node name and IP address may have been changed, all the data stored in local files is lost. Latency and network bandwidth between machines may also have changed. Fortunately, Amazon provides a number of ways to mitigate these issues.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;By paying some charge, user can reserve a stable IP address, called “elastic IP”, which can be attached to EC2 instance after they bootup. External facing machine is typically done this way.&lt;/li&gt;&lt;li&gt;To deal with data persistence, Amazon also provides a logical network disk, called “elastic block storage” to store the data. By paying some charges, EBS is reserved for the user and it survives across EC2 reboots. User can attach the EBS to EC2 instances after the reboot.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;EBS – Elastic Block Storage&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Based on RAID disks, EBS provides a persistent block storage device for data persistence where user can attach it to a running EC2 instance within the same availability zone. EBS is typically used as a file system that is mounted to EC2 instance, or as raw devices for database.&lt;br /&gt;&lt;br /&gt;Although EBS is a network devices to the EC2 instance, benchmark from Amazon shows that it has higher performance than local disk access. Unlike S3 which is based on eventual consistent model, EBS provides strict consistency where latest updates are immediately available.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;CloudWatch -- Monitoring Services&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;CloudWatch provides an API to extract system level metrics for each VM (e.g. CPU, network I/O and disk I/O) as well as for each load balancer services (e.g. response time, request rate). The collected metrics is modeled as a multi-dimensional data cube and therefore can be queried and aggregated (e.g. min/max/avg/sum/count) in different dimensions, such as by time, or by machine groups (by ami, by machine class, by particular machine instance id, by auto-scaling group).&lt;br /&gt;&lt;br /&gt;This metrics is also used to drive the auto-scaling services (described below). Note that the metrics are predefined by Amazon and custom metrics (application level metrics) is not supported at this moment.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Load Balancing Services&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Load balancer provides a way to group identical VMs into a pool. Amazon provides a way to create a software load balancer in a region and then attach EC2 instances (of the same region) to the it. The EC2 instances under a particular load balancer can be in different availability zone but they have to be in the same region.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Auto-Scaling Services&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Auto-scaling allows the user to group a number of EC2 instances (typically behind the same load balancer) and specify a set of triggers to grow and shrink the group. Trigger defines the condition which is matching the collected metrics from the CloudWatch and match that against some threshold values. When match, the associated action can be to grow or shrink the group.&lt;br /&gt;&lt;br /&gt;Auto-scaling allows resource capacity (number of EC2 instances) automatically adjusted to the actual workload. This way user can automatically spawn more VMs as the workload increases and shutdown the VM as the load decreases.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Relational DB Services&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;RDS is basically running MySQL in the EC2.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;S3 – Simple Storage Service&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Amazon S3 provides a HTTP/XML services to save and retrieve content. It provides a file system-like metaphor where “objects” are group under “buckets”. Based on a REST design, each object and bucket has its own URL.&lt;br /&gt;&lt;br /&gt;With HTTP verbs (PUT, GET, DELETE, POST), user can create a bucket, list all the objects within the bucket, create object within a bucket, retrieve an object, remove an object, remove a bucket … etc.&lt;br /&gt;&lt;br /&gt;Under S3, each object has a unique URI which serves as its key. There is no query mechanism in S3 and User has to lookup the object by its key. Each object is stored as an opaque byte array with maximum 5GB size. S3 also provides an interesting partial object retrieval mechanism by specifying the ranges of bytes in the URL.&lt;br /&gt;&lt;br /&gt;However, partial put is not current support but it can be simulated by breaking the large object into multiple small objects and then do the assembly at the app level. Breaking down the object also help to speed up the upload and download by doing the data transfer in parallel.&lt;br /&gt;&lt;br /&gt;Within Amazon S3, each S3 objects are replicated across 2 (or more) data center and also cache at the edge for fast retrieval.&lt;br /&gt;&lt;br /&gt;Amazon S3 is based on an “eventual consistent” model which means it is possible that an application won’t see the change it just made. Therefore, some degree of tolerance of inconsistent view is required by the application. Application should avoid the situation of having two concurrent modifications to the same object. And application should wait for some time between updates, and also should expect all the data it reads is potentially stale for few seconds.&lt;br /&gt;&lt;br /&gt;There is also no versioning concept in S3, but it is not hard to build one on top of S3.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;SimpleDB – queriable data storage&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Unlike S3 where data has to be looked up by key, SimpleDB provides a semi-structured data store with querying capability. Each object can be stored as a number of attributes where the user can search the object by the attribute name.&lt;br /&gt;&lt;br /&gt;Similar to the concepts of “buckets “ and “objects” in S3, SimpleDB is organized as a set of “items” grouped by “domains”. However, each item can have a number of “attributes” (up to 256). Each attribute can store one or multiple values and the value must be a string (or a string array in case of multi-valued attribute). Each attribute can store up to 1K bytes, so it is not appropriate to store binary content.&lt;br /&gt;&lt;br /&gt;SimpleDB is typically used as a metadata store in conjuction with S3 where the actual data is being stored. SimpleDB is also schema-less. Each item can define its own set of attributes and is free to add more or remove some attributes at runtime.&lt;br /&gt;&lt;br /&gt;SimpleDB provides a query capability which is quite different from SQL. The “where” clause can only match an attribute value with a constant but not with other attributes. On the other hand, the query result only return the name of the matched items but not the attributes, which means subsequent lookup by item name is needed. Also, there is no equivalent of “order by” and the returned query result is unsorted.&lt;br /&gt;&lt;br /&gt;Since all attribute are store as strings (even number, dates … etc). All comparison operation is done based on lexical order. Therefore, special encoding is needed for data type such as date, number to string to make sure comparison operation is done correctly.&lt;br /&gt;&lt;br /&gt;SimpleDB is also based on an eventual consistency model like S3.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;SQS – Simple Queue Service&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Amazon provides a queue services for application to communicate in an asynchronous way with each other. Message (up to 256KB size) can be sent to queues. Each queue is replicated across multiple data centers.&lt;br /&gt;&lt;br /&gt;Enterprises use HTTP protocol to send messages to a queue. “At least once” semantics is provided, which means, when the sender get back a 200 OK response, SQS guarantees that the message will be received by at least one receiver.&lt;br /&gt;&lt;br /&gt;Receiving messages from a queue is done by polling rather than event driven calling interface. Since messages are replicated across queues asynchronously, it is possible that receivers only get some (but not all) messages sent to the queue. But the receiver keep polling the queue, he will eventually get all messages sent to the queue. On the other hand, message can be delivered out of order or delivered more than once. So the message processing logic needs to be idempotent as well as independent of message arrival order.&lt;br /&gt;&lt;br /&gt;Once message is taken by a receiver, the message is invisible to other receivers for a period of time but it is not gone yet. The original receiver is supposed to process the message and make an explicit call to remove the message permanently from the queue. If such “removal” request is not made within the timeout period, the message will be visible in the queue again and will be picked up by subsequent receivers.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Elastic Map/Reduce&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Amazon provides an easy way to run Hadoop Map/Reduce in the EC2 environment.  They provide a web UI interface to start/stop a Hadoop Cluster and submit jobs to it.  For a detail of &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;how Hadoop works&lt;/a&gt;, see &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Under elastic MR, both input and output data are stored into S3 rather than HDFS.  This means data need to be loaded to S3 before the Hadoop processing can be started.  Elastic also provides a job flow definition so user can concatenate multiple Map/Reduce job together.  Elastic MR supports the program to be written in Java (jar) or any programming language (Hadoop streaming) as well as PIG and Hive.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Virtual Private Cloud&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;VPC is a VPN solution such that the user can extend its data center to include EC2 instances running in the Amazon cloud.  Notice that this is an "elastic data center" because its size can grow and shrink when the user starts / stops EC2 instances.&lt;br /&gt;&lt;br /&gt;User can create a VPC object which represents an isolated virtual network in the Amazon cloud environment and user can create multiple virtual subnets under a VPC.  When starting the EC2 instance, the subnet id need to be specified so that the EC2 instance will be put into the subnet under the corresponding VPC.&lt;br /&gt;&lt;br /&gt;EC2 instances under the VPC is completely isolated from the rest of Amazon's infrastructure at the network packet routing level (of course it is software-implemented isolation).  Then a pair of gateway objects (VPN Gateway on the Amazon side and Customer gateway on the data center side) need to be created.  Finally a connection object is created that binds these 2 gateway objects together and then attached to the VPC object.&lt;br /&gt;&lt;br /&gt;After these steps, the two gateway will do the appropriate routing between your data center and the Amazon VPC with VPN technologies used underneath to protect the network traffic.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Things to watch out&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I personally think Amazon provides a very complete set of services that is sufficient for a wide spectrum of deployment scenarios.  Nevertheless, there are a number of limitations that needs to pay attention to …&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There are no Cloud standards today.  Whatever choice made for a provider will imply some degree of lock-in to a vendor specific architecture.  Amazon is no exception.  One way to minimize such lock-in is to introduce an insulation layer to localize all the provider-specific API.&lt;/li&gt;&lt;li&gt;Cloud providers typically run their infrastructure on low-cost commodity hardware inside some data center with network connected between them.  Amazon is not making their hosting environment very transparently and so it is not very clear how much reliability one can expect from their environment.  On the other hand, the SLA guarantee that Amazon is willing to provide is relatively low.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multicast communication is not supported between EC2 instances. This means application has to communicate using TCP point-to-point protocol. Some cluster replication framework based on IP multicast simply doesn’t work in EC2 environment.&lt;/li&gt;&lt;li&gt;EBS currently cannot be attached to a multiple EC2 instance at the same time. This means some application (e.g. Oracle cluster) which based on having multiple machines accessing a shared disk simply won’t work in EC2 environment.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1634162767426376135?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1634162767426376135/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1634162767426376135' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1634162767426376135'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1634162767426376135'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/amazon-cloud-computing.html' title='Amazon Cloud Computing'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1161475109912842132</id><published>2009-11-09T10:42:00.000-08:00</published><updated>2009-11-09T12:02:52.246-08:00</updated><title type='text'>Support Vector Machine</title><content type='html'>Support vector machine is a very powerful classification technique.  Its theory is based on the linear model but can also handle non-linear model very well.  It is also immute to the curse of high dimensionality.&lt;br /&gt;&lt;br /&gt;In support vector machine (SVM), inputs are numeric and output are binary. Each data sample can be consider as a m dimension point label as + or - depends on the output.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Optimal separating hyperplane&lt;/span&gt;&lt;br /&gt;Assume there are m numeric input attributes, the key approach of SVM is to try finding a (m - 1) dimension hyperplane which can separate the points in the best way. (ie:  all the +ve points on one side of the plane and all the -ve points on the other side).&lt;br /&gt;&lt;br /&gt;There are many planes that divide the regions.  But we need to find the red line which has the maximum margin.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SvhkRYqGe8I/AAAAAAAAARU/j53vVQcyg3Y/s1600-h/P10.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 317px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SvhkRYqGe8I/AAAAAAAAARU/j53vVQcyg3Y/s400/P10.png" alt="" id="BLOGGER_PHOTO_ID_5402178002870500290" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Sometimes there may be noise or variation that not all points lies in the same side of the plane. So we modify the equation to allow for some errors in the constraints and we want to minimize the overall errors in the optimization goal.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/SvhkyNoQcxI/AAAAAAAAARc/w3w7QO5rBks/s1600-h/P11.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 248px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/SvhkyNoQcxI/AAAAAAAAARc/w3w7QO5rBks/s400/P11.png" alt="" id="BLOGGER_PHOTO_ID_5402178566845657874" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;At first glance, it seems like the classification will be O(n) where n is the size of training data.  This is not the case because most of the alpha values are zero except the supporting vectors (points touching the margin band) which is a very small value.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Non-Linearity&lt;/span&gt;&lt;br /&gt;So far, we have made an assumption that the data is "linearly separable".  What if this assumption is not true ?  (e.g.  y = a1.x1.x1 + a2.x1.x2)&lt;br /&gt;&lt;br /&gt;The answer is to create another attribute x3 = x1.x1 and x4 = x1.x2.&lt;br /&gt;In other words, we can always make a non-linear equation becomes linear by introducing extra variables which is a non-linear combination of existing variables.  Notice that adding these extra variables effectively is increasing the dimension of the data, but we maintain the linearity of the data point.&lt;br /&gt;&lt;br /&gt;As an example, a quadratic equation  y = 3x.x + 2x + 5  is a one variable, non-linear equation.  But if we introduce z = x.x, then it becomes a 2 variable, linear equation with&lt;br /&gt;y = 3z + 2x + 5&lt;br /&gt;&lt;br /&gt;So by adding more attributes to increase the dimensionality of the data points, we can keep the model in linear form.  So, we can solve non-linear model by transforming the current data into a higher dimension (adding extra attributes by combining existing attributes in a non-linear way).  And then apply the hyperplane separation technique described above to build the model (figure out the alpha value) and use that to classify new data points.&lt;br /&gt;&lt;br /&gt;But !  How do we decide what extra attributes should be added and how they should be composed from existing attributes, and how many of them do we need to reconstruct the linearity ?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The Kernal Trick&lt;/span&gt;&lt;br /&gt;From examine the above algorithm, an interesting finding is that we only need to know the dot product between two data points but not individual input attribute values.  In other words, we don't need to care about how to calculate the extra attributes as long as we know how to calculate the dot product of the new transform space.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/SvhyQfNtFRI/AAAAAAAAARk/fTc1KOONbyE/s1600-h/P13.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 309px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/SvhyQfNtFRI/AAAAAAAAARk/fTc1KOONbyE/s400/P13.png" alt="" id="BLOGGER_PHOTO_ID_5402193380613362962" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The process of using SVM is the same as the other machine learning algorithms&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Pick a tool (such as libSVM)&lt;/li&gt;&lt;li&gt;Prepare the input data (convert them to numeric or filter them, normalize their range)&lt;/li&gt;&lt;li&gt;Pick a Kernel function and its parameters&lt;/li&gt;&lt;li&gt;Run cross-validation against different combination of parameters&lt;/li&gt;&lt;li&gt;Pick the best parameter and retrain them.&lt;/li&gt;&lt;li&gt;Now we have the learned model, we can use this for classifying new data&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1161475109912842132?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1161475109912842132/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1161475109912842132' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1161475109912842132'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1161475109912842132'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/support-vector-machine.html' title='Support Vector Machine'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j6mB7TMmJJY/SvhkRYqGe8I/AAAAAAAAARU/j53vVQcyg3Y/s72-c/P10.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-7500304908818237398</id><published>2009-11-08T17:33:00.000-08:00</published><updated>2009-11-11T17:15:37.281-08:00</updated><title type='text'>Machine Learning with Linear Model</title><content type='html'>Linear Model is a family of model-based learning approaches that assume the output y can be expressed as a linear algebraic relation with the input attributes x1, x2 ...&lt;br /&gt;&lt;br /&gt;The input attributes x1, x2 ... is expected to be numeric and the output is expected to be numeric as well.&lt;br /&gt;&lt;br /&gt;Here our goal is to learn the parameters of the underlying model, which is the coefficients.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;Linear Regression&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SvdykStfq6I/AAAAAAAAAO8/45vFciCQ8Xo/s1600-h/p1.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 194px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SvdykStfq6I/AAAAAAAAAO8/45vFciCQ8Xo/s200/p1.png" alt="" id="BLOGGER_PHOTO_ID_5401912245877713826" border="0" /&gt;&lt;/a&gt;Here the input and output are both numeric, related through a simple linear relationship. The learning goal is to figure out the hidden weight value (ie: the W vector).&lt;br /&gt;&lt;br /&gt;Notice that non-linear relationship is equivalent of a linear relationship at a higher dimension.  e.g. if x2 = x1 * x1, then it becomes a quadratic relationship.  Because of this, the polynomial regression can be done using linear regression technique.&lt;br /&gt;&lt;br /&gt;Given a batch of training data, we want to figure out the weight vector W such that the total sum of error (which is the difference between the predicted output and the actual output) to be minimized.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/SvdzmjxeQBI/AAAAAAAAAPM/TpIcG5c9kk0/s1600-h/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 366px; height: 400px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/SvdzmjxeQBI/AAAAAAAAAPM/TpIcG5c9kk0/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5401913384329166866" border="0" /&gt;&lt;/a&gt;Instead of using the batch processing approach, a more effective approach is to learn incrementally (update the weight vector for each input data) using a gradient descent approach.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;Gradient Descent&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/Svd0RSNc_ZI/AAAAAAAAAPU/jg1lZTlrA1M/s1600-h/P3.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 142px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/Svd0RSNc_ZI/AAAAAAAAAPU/jg1lZTlrA1M/s200/P3.png" alt="" id="BLOGGER_PHOTO_ID_5401914118349061522" border="0" /&gt;&lt;/a&gt;Gradient descent is a very general technique that we can use to incrementally adjust the parameters of the linear model. The basic idea of "gradient descent" is to adjust each dimension (w0, w1, w2) of the W vector according to their contribution of the square error. Their contribution is measured by the gradient along the dimension which is the differentiation of the square error with respect to w0, w1, w2.&lt;br /&gt;&lt;br /&gt;In the case of Linear Regression ...&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/Svd0v_D-KnI/AAAAAAAAAPc/zvZtfwSD4dI/s1600-h/P4.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 258px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/Svd0v_D-KnI/AAAAAAAAAPc/zvZtfwSD4dI/s320/P4.png" alt="" id="BLOGGER_PHOTO_ID_5401914645784963698" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;Logistic Regression&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/Svd1w1FvDWI/AAAAAAAAAP0/qEQsk1RuY1s/s1600-h/P5.png"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 172px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/Svd1w1FvDWI/AAAAAAAAAP0/qEQsk1RuY1s/s200/P5.png" alt="" id="BLOGGER_PHOTO_ID_5401915759799504226" border="0" /&gt;&lt;/a&gt;Logistic Regression is used when the output y is binary and not a real number. The first part is the same as linear regression while a second step sigmod function is applied to clamp the output value between 0 and 1.&lt;br /&gt;&lt;br /&gt;We use the exact same gradient descent approach to determine the weight vector W.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/Svd2mPDj_mI/AAAAAAAAAQM/YEhvVvIMafg/s1600-h/P6.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 280px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/Svd2mPDj_mI/AAAAAAAAAQM/YEhvVvIMafg/s400/P6.png" alt="" id="BLOGGER_PHOTO_ID_5401916677302779490" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;Neural Network&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j6mB7TMmJJY/Svd3JCUAywI/AAAAAAAAAQU/-Qvs5g-p3tM/s1600-h/P7.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 118px;" src="http://2.bp.blogspot.com/_j6mB7TMmJJY/Svd3JCUAywI/AAAAAAAAAQU/-Qvs5g-p3tM/s200/P7.png" alt="" id="BLOGGER_PHOTO_ID_5401917275177536258" border="0" /&gt;&lt;/a&gt;Inspired by how our brain works, Neural network organize many logistic regression units into layers of perceptrons (each unit has both input and outputs in binary form).&lt;br /&gt;&lt;br /&gt;Learning in Neural network is to discover all the hidden values of w. In general, we use the same technique above to adjust the weight using gradient descent layer by layer. We start from the output layer and move towards the input layer (this technique is called &lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;backpropagation&lt;/span&gt;). Except the output layer, we don't exactly know the error at the hidden layer, we need to have a way to estimate the error at the hidden layers.&lt;br /&gt;&lt;br /&gt;But notice there is a symmetry between the weight and the input, we can use the same technique how we adjust the weight to estimate the error of the hidden layer.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/Svd3mF-DFWI/AAAAAAAAAQc/mdfCZeuOscs/s1600-h/P8.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 390px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/Svd3mF-DFWI/AAAAAAAAAQc/mdfCZeuOscs/s400/P8.png" alt="" id="BLOGGER_PHOTO_ID_5401917774375359842" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-7500304908818237398?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/7500304908818237398/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=7500304908818237398' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7500304908818237398'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7500304908818237398'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/machine-learning-with-linear-model.html' title='Machine Learning with Linear Model'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j6mB7TMmJJY/SvdykStfq6I/AAAAAAAAAO8/45vFciCQ8Xo/s72-c/p1.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-6891909195655424029</id><published>2009-11-05T08:25:00.000-08:00</published><updated>2010-12-15T15:21:23.309-08:00</updated><title type='text'>What Hadoop is good at</title><content type='html'>Hadoop is getting more popular these days.  Lets look at what it is good at and what not.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The Map/Reduce Programming model&lt;/span&gt;&lt;br /&gt;Map/Reduce offers a different programming model for handling concurrency than the traditional multi-thread model.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/SvNTgstDrzI/AAAAAAAAAOU/A7ExpEGe0x8/s1600-h/p1.png"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 172px; height: 200px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/SvNTgstDrzI/AAAAAAAAAOU/A7ExpEGe0x8/s200/p1.png" alt="" id="BLOGGER_PHOTO_ID_5400752199368421170" border="0" /&gt;&lt;/a&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Multi-thread programming model&lt;/span&gt; allows multiple processing units (with different execution logic) to access the shared set of data. To maintain data integrity, each processing units co-ordinate their access to the shared data by using Locks, Semaphores.  Problem such as "race condition", "deadlocks" can easily happen but hard to debug.  This makes multi-thread programming difficult to write and hard to maintain.  (Java provides a concurrent library package to ease the development of multi-thread programming)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j6mB7TMmJJY/SvNanzFoX6I/AAAAAAAAAOs/RkSJygQj-KM/s1600-h/P2.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 152px;" src="http://1.bp.blogspot.com/_j6mB7TMmJJY/SvNanzFoX6I/AAAAAAAAAOs/RkSJygQj-KM/s200/P2.png" alt="" id="BLOGGER_PHOTO_ID_5400760017922580386" border="0" /&gt;&lt;/a&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Data-driven programming model&lt;/span&gt; feeds data into different processing units (with same or different execution logic).  Execution is triggered by arrival of data.  Since processing units can only access data piped to them, data sharing between processing units  is prohibited upfront.  Because of this, there is no need to co-ordinate access to data.&lt;br /&gt;&lt;br /&gt;This doesn't mean there is no co-ordination for data access.  We should think of the co-ordination is done explicitly by the graph.  ie: by defining how the nodes (processing units) are connected to each other via data pipes.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SvNYrNhGJ3I/AAAAAAAAAOk/EYgZYXqe3bw/s1600-h/p1.png"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 198px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SvNYrNhGJ3I/AAAAAAAAAOk/EYgZYXqe3bw/s200/p1.png" alt="" id="BLOGGER_PHOTO_ID_5400757877533452146" border="0" /&gt;&lt;/a&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Map-Reduce programming model&lt;/span&gt; is a specialized form of data-driven programming model where the graph is defined as a "sequential" list of MapReduce jobs.  Within each Map/Reduce job, execution is broken down into a "map" phase and a "reduce" phase.  In the map phase, each data split is processed and one or multiple output is produced with a key attached.  This key is used to route the outputs (of the Map phase) to the second "reduce" phase, where data with the same key is collected and processed in an aggregated way.&lt;br /&gt;&lt;br /&gt;Note that in a Map/Reduce model, parallelism happens only within a Job and execution between jobs are done in a sequential manner.  As different jobs may access the same set of data, knowing that jobs is executed serially eliminate the needs of coordinating data access between jobs.&lt;br /&gt;&lt;br /&gt;Design application to run in Hadoop is a matter of breaking down the algorithm in a number of sequential jobs and then exploit data parallelism within each job.  Not all algorithms can fit in to the Map Reduce model.  For a more &lt;a href="http://horicky.blogspot.com/2008/11/design-for-parallelism.html"&gt;general approach to break down an algorithm into parallel&lt;/a&gt;, please visit &lt;a href="http://horicky.blogspot.com/2008/11/design-for-parallelism.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Characteristics of Hadoop Processing&lt;/span&gt;&lt;br /&gt;A detail &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;explanation of Hadoop&lt;/a&gt; implementation can be found &lt;a href="http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html"&gt;here&lt;/a&gt;.  Basically Hadoop has the following characteristics ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Hadoop is "&lt;span style="font-weight: bold; font-style: italic;"&gt;data-parallel&lt;/span&gt;", not "&lt;span style="font-weight: bold; font-style: italic;"&gt;process-sequential&lt;/span&gt;".  Within a job, parallelism happens within a map phase as well as a reduce phase.  But these two phases cannot run in parallel, the reduce phase cannot be started until the map phase is fully completed.&lt;/li&gt;&lt;li&gt;All data being accessed by the map process need to be freezed (update cannot happen) until the whole job is completed.  This means Hadoop processes data in chunks using a batch-oriented fashion, making it not very suitable for stream-based processing where data flows in continuously and immediate processing is needed.&lt;/li&gt;&lt;li&gt;Data communication happens via a distributed file system (HDFS).  Latency is introduced as extensive network I/O is involved in moving data around (ie: Need to write 3 copies of data synchronously).  This latency is not an issue for batch-oriented processing where throughput is the primary factor.  But this means Hadoop is not suitable for online access where low latency is critical.&lt;/li&gt;&lt;/ul&gt;Given the above characteristics, Hadoop is NOT good at the following ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Perform online data access where low latency is critical (Hadoop can be used together with HBase or NOSQL store to deliver low latency query response)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Perform random ad/hoc processing of a small subset of data within a large data set (Hadoop is designed to scan all data in parallel)&lt;/li&gt;&lt;li&gt;Process small data volume (for data volume less than hundred GB range, many more mature solutions exist)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Perform real-time, stream-based processing where data is arrived continuously and immediate processing is needed (to keep the overhead small enough, typically data need to be batched for at least 30 minutes, which you won't be able to see the current data until 30 minutes has passed)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-6891909195655424029?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/6891909195655424029/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=6891909195655424029' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/6891909195655424029'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/6891909195655424029'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/what-hadoop-is-good-at.html' title='What Hadoop is good at'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j6mB7TMmJJY/SvNTgstDrzI/AAAAAAAAAOU/A7ExpEGe0x8/s72-c/p1.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1295189068752759145</id><published>2009-11-01T18:28:00.000-08:00</published><updated>2009-11-02T14:41:48.239-08:00</updated><title type='text'>Principal Component Analysis</title><content type='html'>One common problem of machine learning is the "&lt;span style="font-weight: bold;"&gt;curse of high dimensionality&lt;/span&gt;".  When there are too many attributes in the input data, many of the ML algorithms will be very inefficient or some of them will even be non-performing (e.g. in nearest neighbor computation, data points in a high-dimensional space are pretty much equal distance with each other).&lt;br /&gt;&lt;br /&gt;It is quite possible that the attributes we selected are inter-dependent on each other.  If so, we may be able to extract a smaller subset of independent attributes that may still be very useful to describe the data characteristics.  In other words, we may be able to reduce the number of dimensions significantly without losing much fidelity of the data.&lt;br /&gt;&lt;br /&gt;"&lt;span style="font-weight: bold;"&gt;Dimension Reduction&lt;/span&gt;" is a technique to determine how we can reduce the number of dimensions while minimizing the loss of fidelity of data characteristics.  It is typically applied during the data cleansing stage before feeding into the machine learning algorithm.&lt;br /&gt;&lt;br /&gt;"&lt;span style="font-weight: bold;"&gt;Feature Selection&lt;/span&gt;" is a simple techniques to select a subset of features that is more significant.  A very simple "filtering" approach can be used by looking at each attribute independently and rank their significance using some measurement (e.g. info gain) and throw away those that has minimum significance.  A more sophisticated "wrapper" approach is to look at different subset of features to do the evaluation.  There are two common model in the "wrapper" approach, "forward selection" and "backward elimination". &lt;br /&gt;&lt;br /&gt;In &lt;span style="font-weight: bold;"&gt;forward selection&lt;/span&gt;, we start with no attribute and then start to pick the attribute with the highest statistical significance, (ie: prediction improves a lot from the cross validation check).  After picking the first attribute, and we start to pair it up with another unselected attribute and find the one with the most significant improvement in the cross-validation-check.  We keep growing the set of attributes until we don't find significant improvements.  One issue of the "forward selection" approach is that it may miss "grouped features".  For example, attribute1 and attribute2 may be insignificant when they are stand-alone, but combining them will give very big improvement.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Backward elimination&lt;/span&gt; can be used to handle this problem.  It basically goes the opposite direction, starts with the full set of attributes and start to drop those attribute that has least statistical significance  (ie: prediction degrades very little from the cross validation check).  The downside of "backward elimination" is that it is much more expensive to run.&lt;br /&gt;&lt;br /&gt;A more powerful approach called "&lt;span style="font-weight: bold;"&gt;Feature Extraction&lt;/span&gt;" is more commonly used to extract a different set of attributes by linearly combining the existing set of attributes.  Principal Component Analysis "PCA" is a very popular technique in this arena.  PCA can analyze the interdependency between pair of attributes and identify those significant ones.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The intuition of PCA&lt;/span&gt;&lt;br /&gt;The intuition is to rearrange the linear combination of existing m attributes in different way to form another set of m attributes.  The new set of attributes has the characteristic that&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Each attribute is independent of each other&lt;/li&gt;&lt;li&gt;The set of attributes is ranked according to the range of variation&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Note that attributes with narrow range of variation doesn't provide much information to describe the data samples and so can be ignored with minimal lost of fidelity.  So we remove that to reduce the dimensionality.&lt;br /&gt;&lt;br /&gt;The question is :  How do we recompose the m attributes to exhibit the above 2 characteristics.  Lets take a deeper look into it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Underlying theory of PCA&lt;/span&gt;&lt;br /&gt;Assume there are N data points in the input data set and each data point is described by M attributes.  We use the statistical definition for the "mean", "variance" of each attribute and "co-variance" for every pair of attributes.  Co-variance is an indicator of dependencies of two attributes with zero implies independence.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/Su8-xCLVcNI/AAAAAAAAANE/oQNPHFazsbA/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 376px; height: 400px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/Su8-xCLVcNI/AAAAAAAAANE/oQNPHFazsbA/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5399603490359439570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In an ideal situation, we want COV-x to be a diagonal matrix, which means COV(i, j) to be zero. In other words, all pairs of attribute-i and attribute-j are independent to each other.  We also want the diagonal to be ranked in descending order.&lt;br /&gt;&lt;br /&gt;So the problem can be reduced to finding a different combination of the m attributes to form a new set of m attributes (Y = P. X) such that  COV-y is a ranked diagonal matrix.&lt;br /&gt;&lt;br /&gt;How do we determine P ?&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/Su6OuLQbYsI/AAAAAAAAAMk/mJ-7YOjEsd8/s1600-h/P2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 250px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/Su6OuLQbYsI/AAAAAAAAAMk/mJ-7YOjEsd8/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5399409927210623682" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Some Matrix theory&lt;/span&gt;&lt;br /&gt;Here is a review of Matrice theory that will be used&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/Su86YCnmgMI/AAAAAAAAAM8/k-0qym9MWKc/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 339px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/Su86YCnmgMI/AAAAAAAAAM8/k-0qym9MWKc/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5399598662934757570" border="0" /&gt;&lt;/a&gt;Lets find the transformation matrice P&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j6mB7TMmJJY/Su9DHZYgiGI/AAAAAAAAANM/8EumXU4HEbY/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 212px;" src="http://3.bp.blogspot.com/_j6mB7TMmJJY/Su9DHZYgiGI/AAAAAAAAANM/8EumXU4HEbY/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5399608272592341090" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;So the PCA process can be summarized in following ...&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Input:  X, a matrice of (m * n),   a set of N sample data points, each with M attributes.&lt;/li&gt;&lt;li&gt;Compute Cov-X, a matrice of (m * m),  the Covariance matrice of X&lt;/li&gt;&lt;li&gt;Compute the m Eigenvectors and m Eigenvalues of Cov-X&lt;/li&gt;&lt;li&gt;Order the Eigenvectors according to the Eigenvalues&lt;/li&gt;&lt;li&gt;Now found the transformation matrice P, which is a matrice of (m * m).  Note that each row vector of P corresponding to an eigenvector, which is effectively the axis of the new co-ordinate system.&lt;/li&gt;&lt;li&gt;Truncate P to just take the top k rows.  Now P' is a (k * m) matrice.&lt;/li&gt;&lt;li&gt;Apply P' . X to all input data to result in a matrice of (k * n).  This is effectively reduce each data point from m-dimension to k-dimension.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;References&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.snl.salk.edu/%7Eshlens/pub/notes/pca.pdf"&gt;A very good paper&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf"&gt;Some Matrix math review and step by step PCA calculation&lt;br /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1295189068752759145?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1295189068752759145/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1295189068752759145' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1295189068752759145'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1295189068752759145'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/11/principal-component-analysis.html' title='Principal Component Analysis'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j6mB7TMmJJY/Su8-xCLVcNI/AAAAAAAAANE/oQNPHFazsbA/s72-c/p1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-6106791357890439883</id><published>2009-10-30T22:28:00.000-07:00</published><updated>2009-11-05T08:20:01.395-08:00</updated><title type='text'>Notes on Memcached</title><content type='html'>Some notes about Memcached.  Here is its architecture.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SuxV2X1QCVI/AAAAAAAAAL0/FWVw5tkU3FM/s1600-h/p1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 298px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SuxV2X1QCVI/AAAAAAAAAL0/FWVw5tkU3FM/s400/p1.png" alt="" id="BLOGGER_PHOTO_ID_5398784445909043538" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;How it works ?&lt;/span&gt;&lt;br /&gt;Memcached is organized as a farm of N servers.  The storage model can be considered as a huge HashTable partitioned among these N servers.&lt;br /&gt;&lt;br /&gt;Every API request takes a "key" parameter.  There is a 2-step process at the client lib ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Given the key, locate the server&lt;/li&gt;&lt;li&gt;Forward the request to that server&lt;/li&gt;&lt;/ul&gt;The server receiving the request will do a local lookup for that key.  The servers within the farm doesn't gossip with each other at all.  Each server use asynchronous, non-blocking I/O and one thread can be used to handle large number of incoming TCP sockets.  Actually a thread pool is being used but the number of threads is independent of the number of incoming sockets.  This architecture is highly scalable for large number of incoming network connections.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;API&lt;/span&gt;&lt;br /&gt;Memcached provide a HashTable-like interface, so it has ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;get(key)&lt;/li&gt;&lt;li&gt;set(key, value)&lt;/li&gt;&lt;/ul&gt;Memcached also provides a richer "multi-get" so that one read request can retrieve values for multiple keys.  The client library will issue different requests to multiple servers and doing the lookup in parallel.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;get_multi(["k1", "k2", "k3"])&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Some client lib offers a "master-key" concept such that a key contains 2 parts, the prefix master-key and the suffix key.  In this model, the client lib only use the prefix to located the server (rather than looking at the whole key) and then pass the suffix key to that server.  So user can group entries to be stored by the same server by using the same prefix key.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;get_multi(["user1:k1", "user1:k2", "user1:k3"])  -- This request just go to the server hosting all keys of "user1:*"&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;For updating data, Memcached provides a number of variations.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;set(key, value, expiration)  -- Memcached guarantees the item will never be staying in the cache once the expiration time is reached.  (Note that it is possible that the item being kicked out before expiration due to cache full)&lt;/li&gt;&lt;li&gt;add(key, value, expiration) -- Success only when no entry of the key exist.&lt;/li&gt;&lt;li&gt;replace(key, value, expiration) -- Success only when an entry of the key already exist.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Server Crashes&lt;/span&gt;&lt;br /&gt;When one of the server crashes, all entries owned by that server is lost.  Higher resilience can be achieved by storing redundant copies of data in different servers.  Memcached has no support for data replication.  This has to be taken care by the application (or client lib).&lt;br /&gt;&lt;br /&gt;Note that the default server hashing algorithm doesn't handle the growth and shrink of the number of servers very well.   When the number of servers changes, the ownership equation (key mod N) will all be wrong.   In other words, if the crashed server needs to be taken out from the pool, the total number of servers will be decreased by one and all the existing entries needs to be redistributed to different server.  Effectively, the whole cache (among all server) is invalidated even when just one server crashes.&lt;br /&gt;&lt;br /&gt;So one approach to address this problem is to retain the number of Memcached servers across system crashes.  We can have a monitor server to detect the heartbeat of all Memcached server and in case any crashes is detected, start a new server with the same IP address as the dead server.   In this case, although the new server will still lost all the entries and has to repopulate the cache, the ownership of the keys are unchanged and data within the surviving node doesn't need to be redistributed.&lt;br /&gt;&lt;br /&gt;Another approach is to run logical servers within a farm of physical machines.  When a physical machine crashes, its logical servers will be re-start in the surviving physical machines.  In other words, the number of logical servers is unchanged even when crashes happens.  This logical server approach is also good when the underlying physical machines has different memory capacity.  We can start more Memcached process in the machine with more memory and proportionally spread the cache according to memory capacity.&lt;br /&gt;&lt;br /&gt;We also can use a more sophisticated technique called "consistent hashing", which localize the ownership changes to just the neighbor of the crashed server.  Under this schema, each server is assigned with an id under the same key space.  The ownership of a key is determined by the closest server whose key is the first one encountered when walking in the anti-clockwise direction.   When a server crashes, its immediate upstream neighbor server (walking along the anti-clockwise direction) will adopt the key ownership of the dead server, while all other servers has the same ownership of key range unchanged.&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j6mB7TMmJJY/SuzN7VoYbEI/AAAAAAAAAMM/o4eN3u39nwA/s1600-h/P2.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 223px;" src="http://4.bp.blogspot.com/_j6mB7TMmJJY/SuzN7VoYbEI/AAAAAAAAAMM/o4eN3u39nwA/s400/P2.png" alt="" id="BLOGGER_PHOTO_ID_5398916472612875330" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;Atomicity&lt;/span&gt;&lt;br /&gt;Each request to Memcached is atomic by itself.   But there is no direct support for atomicity across multiple requests.  However, App can implement its own locking mechanism by using the "add()" operation provide by Memcached as follows ...&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;success = add("lock", null, 5.seconds)&lt;br /&gt;if success&lt;br /&gt;  set("key1", value1)&lt;br /&gt;  set("key2", value2)&lt;br /&gt;  cache.delete("lock")&lt;br /&gt;else&lt;br /&gt;  raise UpdateException.new("fail to get lock")&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;Memcached also support a "check-and-set" mechanism that can be used for optimistic concurrency control.  The basic idea is to get a version stamp when getting an object and pass that version stamp in the set method.  The system will verify the version stamp to make sure the entry hasn't been modified by something else or otherwise, fail the update.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;data, stamp = get_stamp(key)&lt;br /&gt;...&lt;br /&gt;check_and_set(key, value1, stamp)&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;What Memcached doesn't do ?&lt;/span&gt;&lt;br /&gt;Memcached's design goal is centered at performance and scalability.   By design, it doesn't deal with the following concerns.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Authentication for client request&lt;/li&gt;&lt;li&gt;Data replication between servers for fault resilience&lt;/li&gt;&lt;li&gt;Key &gt; 250 chars&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Large object &gt; 1MB&lt;/li&gt;&lt;li&gt;Storing collection objects&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Data Replication DIY&lt;/span&gt;&lt;br /&gt;First of all, think carefully about whether you really need to have data replication at the cache level, given that cache data should always be able to recreated from the original source (although at a higher cost).&lt;br /&gt;&lt;br /&gt;The main purpose of using a cache is for "performance" reason.   If your system cannot tolerate data lost at the cache level, rethink your design !&lt;br /&gt;&lt;br /&gt;Although Memcached doesn't provide data replication, it can easily be done by the client lib or at the application level, based on a similar idea described below.&lt;br /&gt;&lt;br /&gt;At the client side, we can use multiple keys to represent different copies of the same data.  A monotonically increasing version number is also attached with the data.  This version number is used to identify the most up-to-date copy and will be incremented for each update.&lt;br /&gt;&lt;br /&gt;When doing update, we update all the copies of the same data via different keys.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def reliable_set(key, versioned_value)&lt;br /&gt;  key_array = [key+':1', key+':2', key+':3']&lt;br /&gt;  new_value = versioned_value.value&lt;br /&gt;  new_version = versioned_value.version + 1&lt;br /&gt;  new_versioned_value =&lt;br /&gt;          combine(new_value, new_version)&lt;br /&gt;&lt;br /&gt;  for k in key_array&lt;br /&gt;      set(k, new_versioned_value)&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;For reading the data from cache, use "multi-get" for multiple keys (one for each copy) and return the copy which has the latest version.   If any discrepancy is detected (ie: some copies have a lacking version, or some copies are missing), start a background thread to write the latest version back to all copies.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def reliable_get(key)&lt;br /&gt;  key_array = [key+':1', key+':2', key+':3']&lt;br /&gt;  value_array = get_multi(key_array)&lt;br /&gt;&lt;br /&gt;  latest_version = 0&lt;br /&gt;  latest_value = nil&lt;br /&gt;  need_fix = false&lt;br /&gt;&lt;br /&gt;  for v in value_array&lt;br /&gt;      if (v.version &amp;gt; latest_verson)&lt;br /&gt;          if (!need_fix) &amp;amp;&amp;amp; (latest_version &amp;gt; 0)&lt;br /&gt;              need_fix = true&lt;br /&gt;          end&lt;br /&gt;          latest_version = v.version&lt;br /&gt;          latest_value = v.value&lt;br /&gt;      end&lt;br /&gt;  end&lt;br /&gt;  versioned_value =&lt;br /&gt;          combine(latest_value, latest_version)&lt;br /&gt;&lt;br /&gt;  if need_fix&lt;br /&gt;      Thread.new do&lt;br /&gt;          reliable_set(key, versioned_value)&lt;br /&gt;      end&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  return versioned_value&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;When we delete the data, we don't actually remove it.  Instead, we mark the data as deleted but keep it in the cache and let it expire.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;User Throttling&lt;/span&gt;&lt;br /&gt;An interesting use case other than caching is to throttle user that is too active.  Basically you want to disallow user request that is too frequent.&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;user = get(userId)&lt;br /&gt;if user != null&lt;br /&gt;  disallow request and warn user&lt;br /&gt;else&lt;br /&gt;  add(userId, anything, inactive_period)&lt;br /&gt;  handle request&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-6106791357890439883?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/6106791357890439883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=6106791357890439883' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/6106791357890439883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/6106791357890439883'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/10/notes-on-memcached.html' title='Notes on Memcached'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j6mB7TMmJJY/SuxV2X1QCVI/AAAAAAAAAL0/FWVw5tkU3FM/s72-c/p1.png' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1511164719307703638</id><published>2009-10-26T07:04:00.000-07:00</published><updated>2010-02-22T09:08:56.497-08:00</updated><title type='text'>Machine Learning: Association Rule</title><content type='html'>A typical example is the "market-basket" problem.  Lets say a supermarket keep track of all the purchase transactions.  Each purchase transaction is a subset of all the item available in the store.  e.g. {beer, milk, diaper, butter}.&lt;br /&gt;&lt;br /&gt;The problem is:  By analyzing a large set of transactions, can be discover the correlation between subsets ?  ie:  people buying milk and butter has a high tendency of buying diaper.  Or people buying diaper tends to buy soda and ice-cream.&lt;br /&gt;&lt;br /&gt;Such correlation is called an association rule, which has the following form:&lt;br /&gt;A =&gt; B  where A, B are disjoint subsets of U (a universal set)&lt;br /&gt;&lt;br /&gt;This rule can be interpreted as:  From the transaction statistics, people buying all items in set A tends to also buy all items in set B.&lt;br /&gt;&lt;br /&gt;Note that people buying both set A AND set B is denoted as (A union B) rather than (A intersect B).&lt;br /&gt;&lt;br /&gt;There are two concepts need to be defined here ...&lt;br /&gt;&lt;br /&gt;"&lt;span style="font-weight: bold; font-style: italic;font-size:130%;" &gt;Support&lt;/span&gt;" is defined with respect to a subset X as the % of total transaction that has contains subset X.  This can be indicate as P(contains X).  e.g.  The support of {beer, diaper} is P(contains {beer, diaper})  which means if we randomly pick a transaction, how likely that it will contain both beer and diaper.&lt;br /&gt;&lt;br /&gt;"support" of an association rule A =&gt; B is defined as the "support" of (A union B)&lt;br /&gt;&lt;br /&gt;"&lt;span style="font-weight: bold; font-style: italic;font-size:130%;" &gt;Confidence&lt;/span&gt;" is defined with respect to a rule  (A =&gt; B) that given we know a transaction contains A, how likely that it also contains B.&lt;br /&gt;&lt;br /&gt;P(contains B | contains A)  = P(contains B union A) / P(contains A)  which is the same as Support(A union B) / Support(A)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Mining Association Rules&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The problem is how can we discover the association rules that has a high enough "support" and "confidence".  First of all, an arbitrary threshold of "support" and "confidence" is set according to domain specific concerns.  There are two phases.&lt;br /&gt;&lt;br /&gt;1) Extract all subset X where support(X) &gt; thresholdOfSupport&lt;br /&gt;2) For all extracted subset X, discover A =&gt; B where A is subset of X and B is (X - A)&lt;br /&gt;&lt;br /&gt;1) is also known as the "finding frequent subsets" problem.  A naive implementation can generate all possible subsets and check their support value.  The naive approach has exponential complexity 2 exp(N) .&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Apriori Algorithm&lt;/span&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt; to find frequent subsets&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This algorithm exploit the fact that if X is not a frequent subset, then (X union Anything) will never be a frequent subset.  So it starts with scanning small subsets and throw away those that doesn't has high enough support.  In other words, it prune the tree as it grows.&lt;br /&gt;&lt;br /&gt;Lets say the universal set is {I1, I2, I3, I4, I5, I6}&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;First round:  &lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Generate possible candidates of 1-item subset.  ie: {I1}, {I2}, ... {I6}&lt;/li&gt;&lt;li&gt;Find out all supports of the candidate set.  ie: support({I1}),  support({I2}), .... support({I6})&lt;/li&gt;&lt;li&gt;Filter out those whose value &amp;lt; supportThreshold &lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Second round:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;From the surviving 1-item subset, generate possible 2-item subset candidates.  ie: {I1, I2}, {I1, I4} ...  Note that we can skip any subset that contains I3 because it is out.&lt;/li&gt;&lt;li&gt;Find out all supports of the 2-item candidate set.&lt;/li&gt;&lt;li&gt;Filter out those whose value &amp;lt; supportThreshold &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;K round: (repeat until no more surviving k-1 item subset)&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;From the surviving k-1 item subset, generate possible k-item subset candidates by adding one more item that is not already in the k-1 item subset.  Skip any k item subset that contains any throw-away k-1 item candidates from the last round.&lt;/li&gt;&lt;li&gt;Calculate the support of k-item candidates&lt;/li&gt;&lt;li&gt;Throw away those whose support value &amp;lt; supportThreshold &lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Find association rules from frequent subsets&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After knowing a frequent k-item subset X, we want to find its subset A such that the confidence value of (A =&gt; X-A) is higher than the confidence threshold.&lt;br /&gt;&lt;br /&gt;Note that confidence = support(X) / support(A)&lt;br /&gt;&lt;br /&gt;Since for a given X, support(X) is fixed, we start with trying to find lowest support(A) because that will give the highest confidence.  So we do the reverse process, starting from the largest subset A within X, and reduce the set A in each round.&lt;br /&gt;&lt;br /&gt;Notice that if support(A) is not low enough, there is no need to try subset A' because support(A') can only be higher.  Therefore we only focus our energy to try subset with the surviving A.&lt;br /&gt;&lt;br /&gt;First round:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Within X, generate possible candidates of k-1 item subset.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Find out all confidence of the candidate set.&lt;/li&gt;&lt;li&gt;Filter out those whose confidence value &amp;lt; confidenceThreshold &lt;/li&gt;&lt;li&gt;For those surviving k-1 item subset A, mark the rule (A =&gt; X-A)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;J round:&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;(repeat until no more surviving j-1 item subset)&lt;/span&gt;&lt;ul&gt;&lt;li&gt;Within the surviving j-item subset A, generate possible candidates of j-1 item subset.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Find out all confidence of the candidate set.&lt;/li&gt;&lt;li&gt;Filter out those whose confidence value &amp;lt; confidenceThreshold &lt;/li&gt;&lt;li&gt;For those surviving j-1 item subset A', mark the rule (A' =&gt; X-A')&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;span style="font-weight: bold;font-size:130%;" &gt;Miscellaneous&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Note that confidence is absolute but not relative.&lt;br /&gt;When (A =&gt; B) has confidence = 75%, it is also possible that  (!A =&gt; B) has confidence = 90%.  In other words, it is possible that some rules are contradict to each other and usually the one with higher support and confidence wins.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1511164719307703638?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1511164719307703638/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1511164719307703638' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1511164719307703638'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1511164719307703638'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/10/machine-learning-association-rule.html' title='Machine Learning: Association Rule'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-4724021543464565960</id><published>2009-10-07T11:40:00.001-07:00</published><updated>2009-10-07T17:38:39.221-07:00</updated><title type='text'>Re-engineer Legacy Architecture</title><content type='html'>Legacy system may sounds like a negative term, but in fact it is the result of a successful system. Most legacy systems lies in the core business operation of many enterprises. In fact, it is because of the criticality of it to the business that no one dare to make changes, because a small bug can have huge financial impact.&lt;br /&gt;&lt;br /&gt;Most of those legacy system started with a clean design at the beginning as the originally problem it tried to solve was smaller and well-defined. However, as business competition and organization evolution continuously demand new enhancements/features within a relatively short timeframe, these new features will typically be implemented in a completely separated module so that existing working code won't be touched. Common functionality coded in the original system will got copied into the new module.  This is a very common syndrome of what I call &lt;span style="font-weight: bold; font-style: italic;"&gt;"reuse via copy and paste"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here is &lt;a href="http://horicky.blogspot.com/2007/10/how-bad-code-is-formed.html"&gt;an earlier blog&lt;/a&gt; about some common mindset that is causing &lt;a href="http://horicky.blogspot.com/2007/10/how-bad-code-is-formed.html"&gt;the formation of bad code&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Basically, the idea of "not touching existing working code" encourage a "copy and paste code" culture which over time, causing a lot of code duplication across many places. Once a bugfix is developed, you need to make sure the fix is put in all the copied code. Once you enhance a feature, you need to make sure to roll it into all the copies. When there is 20 different places in the code doing the same thing (but slightly different), you start of losing visibility which takes the ultimate responsibility of certain piece of logic. Now the code is very hard to maintain and also hard to understand it.&lt;br /&gt;&lt;br /&gt;Because you cannot understand it, so you are more scare about making changes to existing code (since they are working). This further encourage you to put new feature in a complete separated module, and further worsen the situation.  The cycle repeats.&lt;br /&gt;&lt;br /&gt;Over a period of time, the code is so unmaintainable that adding any new features takes a long time and usually breaks many places of existing code, development team doesn't feel they are productive and work in a low morale condition. In my past career, I was brought in to help on this situation.&lt;br /&gt;&lt;br /&gt;At a high level, here are the key steps ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;1.  Identify your target architecture&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Define a "to-be" architecture that can support the business objectives in next 5 years. It is important to purposely ignore the current legacy system at this stage because otherwise you won't be able to think "outside the box".&lt;br /&gt;&lt;br /&gt;Be cautious not to pass out a feeling that this exercise is going to suggest throwing the existing system away and start everything from scratch. It is important to understand that the "to-be architecture" is primarily a thought exercise for us to define our target. And we should clearly separate our "vision" from the "execution" which we shouldn't be worrying at this stage.&lt;br /&gt;&lt;br /&gt;The long-term architecture establish a vision on where we want the ultimate architecture to be and serve as our long-term target.  A core vs non-core analysis is also necessary to decide which components should be built vs buy.&lt;br /&gt;&lt;br /&gt;It is also important to get a sense of possible changes in future and build enough flexibility into the architecture such that it can adapt to future changes when it happens. Knowing what you don't know is very important.&lt;br /&gt;&lt;br /&gt;A top down approach is typically used to design the to-be architecture. The level of detail is determined by how well the requirements are known and how likely will they be changed in future. Since the to-be architecture mainly serve the purpose of a guiding target, I usually won't get too deep into implementation details at this stage.&lt;br /&gt;&lt;br /&gt;The next step is to get on to the ground to understand where you are now.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;2:  Understand your existing system&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To get quickly up to speed, my first attempt is talk to people who understand the current code base as well as the pain points. I'd also try to skim through existing documents, presentation slides to get a basic understanding of the existing architecture.&lt;br /&gt;&lt;br /&gt;In case people who are knowledgeable about how the legacy system works still available, conducting a &lt;a href="http://horicky.blogspot.com/2009/10/architecture-review-process.html"&gt;formal architecture review process&lt;/a&gt; can be a very efficient process to get start on understanding the legacy system.&lt;br /&gt;&lt;br /&gt;In case these people has already left, a different &lt;a href="http://horicky.blogspot.com/2007/10/understand-legacy-code.html"&gt;reverse engineering process&lt;/a&gt; is needed.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;3:  Define your action plan&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At this point, you have a clear picture of where you are and where you want to be.  The next step is to figure out how to move from here to there. In fact this stage is the hardest because many factors needs to be taken into considerations.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Business priorities and important milestone dates&lt;/li&gt;&lt;li&gt;Risk factors and opportunity costs&lt;/li&gt;&lt;li&gt;Organization skill set distribution and culture&lt;/li&gt;&lt;/ul&gt;The next step is to construct an execution plan that optimize business opportunities and minimize cost and risks.  Each risk also need to have an associated contingency plan (plan B).  In my experience, the action plan usually take on one of the following options.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:100%;" &gt;&lt;span style="font-weight: bold;"&gt;Parallel development of a green-field project&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;A small team of the best developers will form an effort in parallel (alongside with the legacy system) to create the architecture from scratch.  The latest, best of breed technologies will typically be used such that most of the infrastructure pieces can either be bought or fulfilled by open source technologies.  The team will focus in just rewriting the core business logic.  The green-field system is typically more easy to understand and more efficient.&lt;br /&gt;&lt;br /&gt;After the green field system is sufficiently tested.  The existing legacy system will be swapped out (or serve as a contingent backup).  Careful planning on data migration, traffic migration is important to make sure the transition is smooth.&lt;br /&gt;&lt;br /&gt;One problem of this approach is development cost, because now you need to maintain (within the transition period) two teams of developers working on two systems.  New feature requirements may come in continuously and you may need to do the same thing twice in both systems.&lt;br /&gt;&lt;br /&gt;Another problem is the morale of the developers who maintain the legacy system.  They know the system is going away in future and so they may need to find another job.  You may endup losing those people who are knowledgeable about your legacy system even faster.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;Refactoring&lt;/span&gt;&lt;br /&gt;Another approach is to refactor the current code base and incrementally bring them back into a good shape.  This involve repartitioning of responsibilities of existing components, break down complex components or long methods.&lt;br /&gt;&lt;br /&gt;When I run into code that I am not able to understand which may be dead code that never get exercise, or logic that is hidden in many level of indirection.  What I typically do is to add trace statements into the code and rerun the system to see when this code is execute and who is calling it (by observing the stack trace).  I will also put a wrapper around the code that I don't understand, shrink the wrapper's perimeter to a point that I can safely just swap out the component which I don't understand.&lt;br /&gt;&lt;br /&gt;It is also quite common that legacy system lacks of unit test, so a fair amount of effort may need to spend in writing unit test around the components.&lt;br /&gt;&lt;br /&gt;One problem of the refactoring approach is it is not easy to get management buy-in because they don't see any new feature coming out from the engineering effort being spent.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-4724021543464565960?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/4724021543464565960/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=4724021543464565960' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4724021543464565960'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/4724021543464565960'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/10/re-engineer-legacy-architecture.html' title='Re-engineer Legacy Architecture'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-154977010425538201</id><published>2009-10-07T09:22:00.000-07:00</published><updated>2009-10-07T16:29:22.704-07:00</updated><title type='text'>Architecture Review Process</title><content type='html'>If you are lucky enough to keep the engineers who are knowledgeable about the current system around, then conducting a formal "Architecture Review Process" is probably the most efficient way to understand how the existing system work.&lt;br /&gt;&lt;br /&gt;However, if these people have already left the company already, then a &lt;a href="http://horicky.blogspot.com/2007/10/understand-legacy-code.html"&gt;different reverse engineering effort&lt;/a&gt; need to be taken instead.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Participants&lt;/span&gt;&lt;br /&gt;It is usually involved a number of key persons&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A &lt;span style="font-weight: bold; font-style: italic;"&gt;facilitator&lt;/span&gt; who orchestrate the whole review process and control the level of depth at different stages&lt;/li&gt;&lt;li&gt;A &lt;span style="font-weight: bold; font-style: italic;"&gt;recorder&lt;/span&gt; who documents key points, ideas, observations and outstanding issues throughout the review process&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Key architects and developers&lt;/span&gt; who collectively understand the details of the legacy systems, and be able to get down to any level of details (even code walk-through) if necessary.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Domain expert&lt;/span&gt; who understand every details of how people use the system, what are their current pain points and what features will help them most.  The domain expert also helps to set business priorities between conflicting goals.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Process&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The architecture review process has the following steps.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Use Cases and Actors&lt;/span&gt;&lt;br /&gt;From the business domain expert, we focus in the "actors" (who use the system) as well as the "use case" (how they use it).  We also look at the key metrics to measure the efficiency of their tasks (e.g. how long does it take to complete the use case)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Activity Analysis&lt;/span&gt;&lt;br /&gt;Drill down from each use case, we identify activities that each actor perform.  We look at what data need to be captured at each activity and how actors interact with each other as well as with the system.&lt;br /&gt;&lt;br /&gt;At this point, we should establish a good external view of the system.  Now we dig into the internals of it ...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Technology Stack&lt;/span&gt;&lt;br /&gt;This purpose is to understand what are those building blocks underlie the system and get a good sense of whether the build vs buy combination is correct.  Things like which programming language, Java vs DOtNet, which App Server, what DB, any ORM vs direct SQL, XML vs JSON, which IOC or AOP container, Messaging framework ... etc need to be discussed.  We also need to distinguish the core features (which you mostly want to build) from the non-core features (which you mostly want to leverage 3rd party code).  By the end of this exercise, we'll get a very good understanding about the foundation on which we write our code and perhaps we can also identify certain areas where we can swap in 3rd party code.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Component Analysis&lt;/span&gt;&lt;br /&gt;This is the portion where most of the time is being spent.  Here we dissect the whole system into components.  It starts off by the architect highlighting a list of major components of current system.  For each component, we look at&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The responsibility of the component&lt;/li&gt;&lt;li&gt;The persistent data owned by the component and the life cycle of maintaining this data&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The interface of the component&lt;/li&gt;&lt;li&gt;The thread model executing the logic of the component (ie:  Caller thread vs listening thread vs a new spawn thread) as well as any concurrent access implications&lt;/li&gt;&lt;li&gt;What are the potential bottleneck of this component and how we can remove the bottleneck when it occurs.&lt;/li&gt;&lt;li&gt;How does the component scale up along growth of different dimensions (e.g. more users, more data, more traffic rate ... etc) ?&lt;br /&gt;&lt;/li&gt;&lt;li&gt;What is the impact if this component crashes ?  How does the recovery happen ?&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;It is important to realize whether the component communicates across VM boundaries.  If so,&lt;br /&gt;&lt;ul&gt;&lt;li&gt;What is the authentication and authorization mechanism ?&lt;/li&gt;&lt;li&gt;What is the message format being communicated ?&lt;/li&gt;&lt;li&gt;Is the data transfer in clear text or encrypted ?  And where is the secret key being stored ?&lt;/li&gt;&lt;li&gt;What is the handshaking sequence (protocol) ?&lt;/li&gt;&lt;li&gt;Is the component stateful or stateless ?&lt;/li&gt;&lt;/ul&gt;Since we already dive deep into the architecture, I usually take a further step to drill into the code by asking the developers to walk me through the code of some key components.  This usually will give me a good sense about the code quality.  Whether the code is easy to read, whether the logic is easy to follow, whether the method is too many lines of code, whether there is duplicated logic scattering around ... etc, and more important, whether there are sufficient unit tests around.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Maintainability Analysis&lt;/span&gt;&lt;br /&gt;This focus in the ongoing maintenance of the architecture, things like ...&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Is the system sufficiently instrumented for monitoring purpose ?&lt;/li&gt;&lt;li&gt;When the problem happens, is there enough trace around to quickly identify what went wrong ?&lt;/li&gt;&lt;li&gt;Can the system continue working (with some tolerable degradation) when some components fail ?&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Extensibility Analysis&lt;/span&gt;&lt;br /&gt;Understand the parameters that affects the behavior of the system. When different scenarios of changes happens, how much code need to be change to accommodate that ?  Or can the system still serve by just changing the configurable parameters ?   For example, does the system hard code business rules or using some kind of rule engine ?  Does the system hard code the business flow or using some kind of workflow system ?  What if the system need to serve a different UI device (like mobile devices) ?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Output&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The output of an architecture review process is typically&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A set of documents/diagrams of the key abstractions that gives a better view of how the overall system.  This documents should help a new comer to get up to speed quicker as well as communicate the architecture to a broader audiences.&lt;/li&gt;&lt;li&gt;A set of recommended action plans on what can be done to improve the current architecture.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-154977010425538201?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/154977010425538201/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=154977010425538201' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/154977010425538201'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/154977010425538201'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/10/architecture-review-process.html' title='Architecture Review Process'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-7754102968628589796</id><published>2009-09-27T08:09:00.001-07:00</published><updated>2009-10-09T10:30:13.377-07:00</updated><title type='text'>Reinforcement Learning</title><content type='html'>Reinforcement Learning (RL) is a type of Machine Learning other than "supervised learning" (having a teaching phase that shows the learning between inputs and correct answers) and "unsupervised learning" (discovering clusters and outliers from a set of input samples).&lt;br /&gt;&lt;br /&gt;In RL, consider there exist a set of "states" (from the environment) where the agent is going to make some decision of which actions to take and this action will cause it to transfer to a different state.  A reward is assigned to the agent after this state transition.  During the RL process, the agent's goal is to go through a trial and error process to learn what would be the optimal decision at each state such that the reward is maximized.&lt;br /&gt;&lt;br /&gt;The hard part of RL is to know which action has a long term effect on the final outcome.  For example, a wrong decision may not have an immediate bad result and therefore may be hidden.  RL is about how to assign blames to previous decisions when a bad outcome has been detected.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Basic Iteration Approach&lt;/span&gt;&lt;br /&gt;There is a reward matrix, each row represents the from-state and each column represent the to-state.  The cell (i, j) represent the "immediate reward" obtained when moving from state i to state j.&lt;br /&gt;&lt;br /&gt;The goal is to find an &lt;span style="font-weight: bold;"&gt;optimal policy&lt;/span&gt; which recommends the action that should be taken at each state in order to maximize the sum of reward.&lt;br /&gt;&lt;br /&gt;We can use a value vector of each element (i) to represent agent's perception of the overall gained reward if he is at state (i).  At the beginning, the value vector is set with random value.  We use the following iterative approach to modify the value vector until it converges.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def learn_value_vector&lt;br /&gt; current_state = initial_state&lt;br /&gt; set value_vector to all zeros&lt;br /&gt; repeat until value_vector.converges&lt;br /&gt;   # Need to enumerate all reachable next states&lt;br /&gt;   for each state(j) reachable by current state(i)&lt;br /&gt;     Take action to reach next state(j)&lt;br /&gt;     Collect reward(i, j)&lt;br /&gt;     action_value(j) =&lt;br /&gt;       reward(i, j) + discount * value_vector(j)&lt;br /&gt;   end&lt;br /&gt;   # Since you will always take the path of max overall reward&lt;br /&gt;   value_vector(i) = max_over_j(action_value(j))&lt;br /&gt;   current_state = state(maxj)&lt;br /&gt; end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;After we figure out this value vector, deriving the policy is straightforward.  We just need to look across all the value of subsequent next states and pick the one with the highest value.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def learn_policy1&lt;br /&gt; for each state(i)&lt;br /&gt;   best_transition = nil&lt;br /&gt;   max_value = 0&lt;br /&gt;   for each state(j) reachable from state(i)&lt;br /&gt;     if value(j) &gt; max_value&lt;br /&gt;       best_transition = j&lt;br /&gt;       max_value = value(j)&lt;br /&gt;     end&lt;br /&gt;   end&lt;br /&gt;   policy(i) = best_transition&lt;br /&gt; end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;One problem of this approach is requiring us to try out all possible actions and evaluate all the rewards to the next state.  So there is an improve iterative approach described below.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Q-Learning&lt;/span&gt;&lt;br /&gt;In Q-Learning, we use a Q Matrix instead of the value vector.  Instead of estimating the value of each state, we estimate the value of each transition from the current state.  In other words, we associate the value with the &lt;state,&gt; pair instead of just &lt;state&gt;.&lt;br /&gt;&lt;br /&gt;Therefore, the cell(i, j) of the Q matrix represents the agent's perceived value of the transition from state(i) to state(j).  We use the following iterative approach to modify the value vector until it converges.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def learn_q_matrix&lt;br /&gt; current_state = initial_state&lt;br /&gt; set Q matrix to all zeros&lt;br /&gt; repeat until Q matrix converges&lt;br /&gt;   select the next state(j) randomly&lt;br /&gt;   collect reward (i, j)&lt;br /&gt;   value(j) = max Q(j, k) across k&lt;br /&gt;   Q(i, j) = reward(i, j) + discount * value(j)&lt;br /&gt; end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;After figuring out the Q matrix, the policy at state (i) is simply by picking state(j) which has the max Q(i, j) value.&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;def learn_policy2&lt;br /&gt; for each state(i)&lt;br /&gt;   best_transition = nil&lt;br /&gt;   max_value = 0&lt;br /&gt;   for each state(j) reachable from state(i)&lt;br /&gt;     if Q(i,j) &amp;gt; max_value&lt;br /&gt;       best_transition = j&lt;br /&gt;       max_value = Q(i,j)&lt;br /&gt;     end&lt;br /&gt;   end&lt;br /&gt;   policy(i) = best_transition&lt;br /&gt; end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;The relationship between the action (what the agent do) and the state transition (what is the new state end up) doesn't necessary be deterministic.  In real life, the action and its effect is usually probabilistic rather than deterministic.  (e.g. if you leave your house early, you are more likely to reach your office earlier, but it is not guaranteed).  Imagine of a probabilistic state transition diagram, where each action has multiple branches leading to each possible next state with a probability assigned to each branch. Making decisions in this model is called the Marchov Decision Process.&lt;br /&gt;&lt;br /&gt;The Q-learning approach described above is also good for Marchov Decision Process.&lt;br /&gt;&lt;br /&gt;For some good articles in RL,&lt;br /&gt;&lt;a href="http://www.nbu.bg/cogs/events/2000/Readings/Petrov/rltutorial.pdf"&gt;Reinforcement Learning: A tutorial&lt;/a&gt;&lt;br /&gt;&lt;a href="http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/index.html"&gt;Q-learning by examples&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/state&gt;&lt;/state,&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-7754102968628589796?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/7754102968628589796/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=7754102968628589796' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7754102968628589796'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/7754102968628589796'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/09/reinforcement-learning.html' title='Reinforcement Learning'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-911461986397433055</id><published>2009-09-10T16:59:00.000-07:00</published><updated>2009-10-28T09:16:36.083-07:00</updated><title type='text'>Math Concepts for kids and teens</title><content type='html'>Summarizing some key math concepts that I teach my kids.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Fundamentals&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A correct value system is the most important foundation (the goal to excel, the willingness to help)&lt;/li&gt;&lt;li&gt;How to make decisions ?  (differentiate emotional decision and strategic decision)&lt;/li&gt;&lt;li&gt;How to do planning ?&lt;/li&gt;&lt;li&gt;How to do reasoning, analyzing and drawing conclusion ?&lt;/li&gt;&lt;li&gt;How to be open minded, humble but not blindly follow conventional wisdom ? (why human walk with 2 legs, why do we have supermarkets, how do we decide where to put a bus station, why an apple is more expensive than an orange)&lt;/li&gt;&lt;li&gt;How to be patient and control emotions ?&lt;/li&gt;&lt;li&gt;Develop a good sense of numbers and able to read different charts and graphs, observing relationship between variables and their trends.&lt;/li&gt;&lt;li&gt;Appreciation of doing things in a smart way&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Basic Math Concepts&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Numbers, counting (Integer) and quantities (Real)&lt;/li&gt;&lt;li&gt;Cause and effect&lt;/li&gt;&lt;li&gt;Set (belongs, union, intersect, subset)&lt;/li&gt;&lt;li&gt;Function (dependent and independent variable, continuous vs discrete).  Various graphing (histogram, line graph, plot), 2D curve and 3D plane&lt;/li&gt;&lt;li&gt;Linear equations, degree of freedoms, relationship between number of variables and number of equations.&lt;/li&gt;&lt;li&gt;Calculus (differentiation and integration), multi-variables and partial differentiation&lt;/li&gt;&lt;li&gt;Logic (if/then, necessary/sufficient conditions, equivalence) and Proof establishments&lt;/li&gt;&lt;li&gt;Debate and Logic fallacies&lt;/li&gt;&lt;li&gt;Geometry and Vector (think 3D instead of 2D)&lt;/li&gt;&lt;li&gt;Probabilities (Draw a tree of all outcomes and counting)&lt;/li&gt;&lt;li&gt;Probability distribution function and expected gains&lt;/li&gt;&lt;li&gt;Permutations and Combinations (how to find out "all possibilities")&lt;/li&gt;&lt;li&gt;Mathematical induction, recursion in proofs.&lt;/li&gt;&lt;li&gt;Digits with different bases (and their relationship with Polynomials)&lt;/li&gt;&lt;li&gt;Making predictions: False positives, False negatives and how trade-off decisions should be made&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Math Models&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Decision tree (decision and outcome alternations, min/max strategy).  Expected gain and optimization&lt;/li&gt;&lt;li&gt;Game theory (Nash equilibrium).  Outcome prediction within a social group.  Win/win and win/lose and lose/lose situations.&lt;/li&gt;&lt;li&gt;Finding solution using Search tree, exhaustive search in all possibilities in a systematic way (tree traversal, breath-first vs depth-first vs heuristic)&lt;/li&gt;&lt;li&gt;Linear programming for constraint satisfaction and optimization&lt;/li&gt;&lt;li&gt;Deterministic vs Stochastic process (Markov chains), Queuing theory&lt;/li&gt;&lt;li&gt;Control system (equilibrium, stability and feedback loop)&lt;/li&gt;&lt;li&gt;Graph model (nodes and arcs, path finding, shortest path, minimal spanning tree)&lt;/li&gt;&lt;li&gt;Finite State Machines (everything happens in a cycle)&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-911461986397433055?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/911461986397433055/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=911461986397433055' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/911461986397433055'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/911461986397433055'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/09/math-concepts-for-kids-and-teens.html' title='Math Concepts for kids and teens'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-1053362063632195376</id><published>2009-08-26T23:30:00.000-07:00</published><updated>2009-08-26T23:45:54.025-07:00</updated><title type='text'>Traditional SaaS vs Cloud enabled SaaS</title><content type='html'>Inspired by Gilad's &lt;a href="http://cloud-silver-lining.blogspot.com/2009/08/cloud-computing-programming-model-draft.html"&gt;great summary on the Cloud Programming model&lt;/a&gt;, I try to summarize the difference that I observe between the traditional SaaS model and the "cloud-enabled SaaS model".  Although cloud providers advocates zero effort is need to migrate existing applications into the cloud, it is my belief that this "strict-port" approach doesn't fully exploit the full power of cloud computing.  There are a number of characteristic that cloud is different from traditional data center environment, application which design along these characteristic will take more advantages from the cloud.&lt;br /&gt;&lt;br /&gt;I believe a Cloud-enabled-Application should have the following characteristic in its fundamental design.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Latency Awareness&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Traditional SaaS App typically run within a single data center and assume low latency among server components.  Now in the cloud environment that span many distant geographic locations, but the assumption of low latency cannot hold any more.  We need to be “smarter” when choosing where to deploy to avoid the situation of putting frequently communicating components between far-distant locations.  “Cloud-enabled SaaS app” need to be aware of latency difference and built in self-configuring and self-tuning mechanism to cope with that.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Cost Awareness&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Traditional SaaS app typically run on already purposed hardware where utilization efficiency is not a concern.  Now with the “pay as you go” model, application need to pay more attention to its usage pattern and efficiency of underlying resources because it will affect the operation cost.  Cloud-enabled SaaS application need to understand the cost model of different resources utilization (such as CPU cost may be very different from Bandwidth cost) and adjust their usage strategy to minimize the operation cost.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Security Awareness&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Traditional SaaS app typically run on a fully trusted data center based on perimeter security.  But in the Hybrid cloud model, the perimeter being drawn is very different now.  Application need to carefully select where to store its data such that sensitivity will not be leaking.  This involve careful determination of storage provider or use encryption for protection.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Capitalize on Elasticity&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Traditional SaaS App is not used to large-scale growth / shrink of compute resources and typically haven’t designed well to handle how data get distributed to newly joined machines (in a growth scenario) or redistributed among remaining machines (in a shrink scenario).  This ends up having a very inefficient use of network bandwidth and results in high cost and low performance.  More sophisticated data distribution protocol that align with the growth and shrink dimension is needed for “Cloud-enabled SaaS app”&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-1053362063632195376?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/1053362063632195376/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=1053362063632195376' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1053362063632195376'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/1053362063632195376'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/08/traditional-saas-vs-cloud-enabled-saas.html' title='Traditional SaaS vs Cloud enabled SaaS'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-8461415464199041880</id><published>2009-08-17T17:17:00.000-07:00</published><updated>2009-08-17T17:27:23.306-07:00</updated><title type='text'>Multi-tenancy in cloud computing</title><content type='html'>Followup on an interesting discussion in Cloud Computing discussion group.  What is a tenant ?  Is multi-tenancy an important feature of cloud ?  Who are the participants and their roles in the cloud ecosystem ?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Participants in the cloud&lt;/span&gt;&lt;br /&gt;In my model, a "&lt;span style="font-weight: bold; font-style: italic; color: rgb(255, 0, 0);"&gt;SaaS provider&lt;/span&gt;" is the organization that provides a domain specific SaaS App to its users (e.g. SmugMug for photo sharing).  In this case, the &lt;span style="font-weight: bold; font-style: italic; color: rgb(255, 0, 0);"&gt;SaaS consumer&lt;/span&gt; is just any individual who has a SmugMug account.  The SaaS provider may choose an &lt;span style="font-weight: bold; font-style: italic; color: rgb(255, 0, 0);"&gt;infrastructure provider&lt;/span&gt; (e.g. Amazon) to host its SaaS App.  In this example, SmugMug is a SaaS provider and &lt;span style="font-weight: bold; font-style: italic; color: rgb(255, 0, 0);"&gt;Infrastructure consumer&lt;/span&gt; at the same time.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Definition of a Tenant&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Now, who is the "tenant" in this picture.  I think Amazon will consider SmugMug as a tenant.  But I doubt SmugMug will consider its individual user a tenant.&lt;br /&gt;&lt;br /&gt;But what if SmugMug offer a services to car manufacturers so they can store, organize and image process their photos, which will show up in the car manufacturer's website.  Will SmugMug consider BMW a tenant ?  I think the answer is "yes".  So maybe the definition of a tenant is "&lt;span style="font-weight: bold; font-style: italic; color: rgb(255, 0, 0);"&gt;my user who has her own users&lt;/span&gt;".&lt;br /&gt;&lt;br /&gt;You can see there can be a value chain built up.  So except the start and end point of this value chain, everyone is a "tenant" to its service provider.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Multi-tenancy&lt;/span&gt;&lt;br /&gt;After we defne what a "tenant" is, what does "multi-tenancy" mean ?  In my opinion, "multi-tenancy" is for the benefit of the service provider so they can manage the resource ultization more efficiently, but multi-tenancy is not to the tenant's advantage at all.  In the fake example I gave above, would BMW prefers a multi-tenancy environment from SmugMug ?  My guess is that BMW would in fact worry if their data is sitting together with their competitors in a shared infrastructure.  I bet they would prefer an environment which is isolated as much as possible.&lt;br /&gt;&lt;br /&gt;While "multi-tenancy" indicates that some infrastructure is shared, at what layers are things being shared can make a big difference.  For example, Amazon AWS is multi-tenant at the hardware level in that its users may be sharing a physical machine.  On the other hand, Force.com is multi-tenant at the DB level in that its users are sharing data in the same DB tables.  And Amazon is relying on the hypervisor to provide the isolation between tenants while Force.com is relying on a query rewriter to do the same.&lt;br /&gt;&lt;br /&gt;While "multi-tenancy" at the highest layer basically advocates a shared-DB approach, does it enables better collaboration or sharing between tenants ?  I don't think so.  I think all we need is to have an authentication model such that spontaneous workgroup can be formed and membership can be identified.  Then it is just a matter of a requesting tenant to presents his membership to another tenant when making a SaaS service call.  What I mean is they are using an SOA approach to access data, rather than directly access a shared-DB.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7994087232040033267-8461415464199041880?l=horicky.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://horicky.blogspot.com/feeds/8461415464199041880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7994087232040033267&amp;postID=8461415464199041880' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/8461415464199041880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7994087232040033267/posts/default/8461415464199041880'/><link rel='alternate' type='text/html' href='http://horicky.blogspot.com/2009/08/multi-tenancy-in-cloud-computing.html' title='Multi-tenancy in cloud computing'/><author><name>Ricky Ho</name><uri>http://www.blogger.com/profile/03793674536997651667</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://farm2.static.flickr.com/1148/buddyicons/10062317@N02.jpg?1184418980'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7994087232040033267.post-9025005265012849228</id><published>2009-08-09T09:53:00.000-07:00</published><updated>2009-09-09T15:54:21.259-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Network latency'/><category scheme='http://www.blogger.com/atom/ns#' term='Cloud computing'/><category scheme='http://www.blogger.com/atom/ns#' term='Cost optimization'/><title type='text'>Skinny Straw in the Cloud Shake</title><content type='html'>There is recently&lt;a href="http://tinyurl.com/lrvctl"&gt; an article by Bernard Golden&lt;/a&gt; talking about network constraint (bandwidth and latency) as well as the associated bandwidth usage cost continues to become one main obstacle in cloud computing.&lt;br /&gt;&lt;br /&gt;There are two concerns here.  One is about not meeting the application's performance goal (throughput and response time).  The other is about the cost of running in the cloud.  (receive a large phone bill from your cloud provider)&lt;br /&gt;&lt;br /&gt;The goal is to reduce the total amount of data transfer.  A number of cloud app design patterns can be used ...&lt;br /&gt;&lt;br /&gt;How do you put the code and data together before the processing can start ?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Try to be as stateless as possible&lt;/span&gt;&lt;br /&gt;There is zero data data transfer to be transferred if your component is stateless by nature.  Following techniques are assuming that there are some unavoidable stateful components involved.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Move your data creation process into the cloud first&lt;br /&gt;&lt;/span&gt;Instead of uploading huge volume of data from your data center into the cloud so processing can be started, can you move the data creation process into the cloud ?  Of course, you need to carefully evaluate the security implications here.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Distribute the architecture of your data creation&lt;/span&gt;&lt;br /&gt;If the subsequent processing is based on a parallel execution architecture, why not distribute the data creation processing also.  This will save a data repartition step.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Move the code to the data&lt;/span&gt;&lt;br /&gt;Code usually has a much smaller footprint than the data it processes.  Therefore it is more economical to move processing logic to the data rather than downloading the data to process.  Of course, we need to check to make sure the machine hosting the data has enough CPU power to execute the processing logic.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Do as much as possible along current partition&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;A typically parallel processing architecture partitions data along some dimensions, conduct the processing in parallel, and then repartition data along other dimensions, conduct the next stage of processing, and so on ...&lt;br /&gt;&lt;br /&gt;See if you can rearrange the order of processing such that you can do as much as possible within the current partition.  The goal is to minimize the number of repartitions where a lot of data transfer is needed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Minimize data redistribution at grow/shrink&lt;br /&gt;&lt;/span&gt;How do you redistribute data to newly joined VM such that the overall data transfer can be minimized ?  For example, "consistent hashing" algorithm can be used such that data redistribution only happens within the neighbor of newly joined VM rather than every other existing VMs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Conduct data redistribution in the background&lt;/span&gt;&lt;br /&gt;Data redistribution should have an impact on performance but not accuracy.  In other words, the newly joined VMs should be able to serve immediately while doing data redistribution in the background.  The data redistribution algorithm (which may take a longer time to finish) also need to adapt to continuous joining VMs.  In other words, data redistribution can be just an ongoing performance improvement process in a highly dynamic workload environment.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Place component with bandwidth cost in mind&lt;/span&gt;&lt;br /&gt;Other than the amount of data being transferred (which should be minimized anyway), it is equally important to look into bandwidth cost.  Typically the cloud provider will charge a substantial amount in bandwidth usage across the cloud boundary.  Therefore, it is important to place the components such that if data transfer do need to occur, it will occur within the cloud rather than across the cloud boundary.  This requires a careful analysis of the communication pattern among application components and group frequently communicating components so they will be deployed within the same cloud.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Migrate data as communication pattern changes&lt;/span&gt;&lt;br /&gt;Communication pattern may change after the system is deployed.  It is important to continuously monitor the actual communication patterns and determine if a migration is needed to minimize the bandwidth cost.  It is important to consider the gain versus the cost of migration.  Gain is estimated by multiplying the communication frequency with the time that the new communication pattern is going to persist.  Cost is estimated by the total among of data redistribution traffic caused by component migration.  And only when the migration cost is smaller than the gain will the migration take place.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Exploit Caching&lt;/span&gt;&lt;br /&gt;Use a local cache to reduce the need of data access, especially if the data is relatively static.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Allow direct access to data&lt;br /&gt;&lt;/span&gt;This is against the philosophy of SOA where the internal state should be encapsulated behind an API interface.  In this model, when a client want to extract the data, it need to first make a request to 
