Saturday, May 30, 2009

RESTFul Design Patterns

Summarize a set of RESTful design practices that I have used quite successfully.

Object Patterns

If there are many objects of the same type, the object URL should contains the id of the object.
http://www.xyz.com/library/books/668102
If this object is a singleton object of that type, the id is not needed.
http://www.xyz.com/library

Get the object representation
HTTP GET is used to obtain a representation of the object. By default the URI refers to the object's metadata but not actual content. To get the actual content ...
http://www.xyz.com/library/books/668102.content
HTTP header "Accept" is also used to indicate the expected format. Note also that the representation of the whole object is returned. There is no URL representation at the attribute level.
GET /library/books/668102 HTTP/1.1
Host: www.xyz.com
Accept: application/json

Modify an existing Object

HTTP PUT is used to modify the object, the request body contains the representation of the Object after successful modification.

Create a new Object
HTTP PUT is also used to create the object if the caller has complete control of assigning the object id, the request body contains the representation of the Object after successful creation.
PUT /library/books/668102 HTTP/1.1
Host: www.xyz.com
Content-Type: application/xml
Content-Length: nnn

<book>
<title>Restful design</title>
<author>Ricky</author>
</book>
HTTP/1.1 201 Created

If the caller has no control in the object id, HTTP POST is made to the object's parent container with the request body contains the representation of the Object. The response body should contain a reference to the URL of the created object.
POST /library/books HTTP/1.1
Host: www.xyz.com
Content-Type: application/xml
Content-Length: nnn

<book>
<title>Restful design</title>
<author>Ricky</author>
</book>
HTTP/1.1 301 Moved Permanently
Location: /library/books/668102

Invoke synchronous operation of the Object
HTTP POST is used to invoke a operation of the object, which has the side effect. The operation is indicated in a mandated parameter "action". The arguments of the method can also be encoded in the URL (for primitive types) or in the request body (for complex types)
POST /library/books/668102?action=buy&user=ricky HTTP/1.1
Host: www.xyz.com
POST /library/books/668102?action=buy HTTP/1.1
Host: www.xyz.com
Content-Type: application/xml; charset=utf-8
Content-Length: nnn

<user>
<id>ricky</id>
<addr>175, Westin St. CA 12345</addr>
</user>

Invoke asynchronous operation of the Object

In case when the operation takes a long time to complete, an asynchronous mode should be used. In a polling approach, a transient transaction object is return immediately to the caller. The caller can then use GET request to poll for the result of the operation

We can also use a notification approach. In this case, the caller pass along a callback URI when making the request. The server will invoke the callback URI to POST the result when it is done.

Destroy an existing Object
HTTP DELETE is used to destroy the object. This release all the resources associated with this object.
DELETE /library/books/668102 HTTP/1.1
Host: www.xyz.com


Container Patterns

The immediate parent of a container must be an object (can be a singleton object without an id or an object with an id). Container "contains" other objects or containers. If a container is destroyed, everything underneath will be destroyed automatically in a recursive manner.
http://www.xyz.com/library/books
http://www.xyz.com/library/dvds
http://www.xyz.com/library/books/668102/chapters

In GET operation, by default the container only return the URL reference of its immediate children. An optional parameter "expand" can be used to request the actual representation of all children and descendants.

A more sophisticated GET operation can contain a "criteria" parameter to show only the children that fulfills certain criteria.
GET http://www.xyz.com/library/book?criteria=[author likes 'ricky']


Reference Patterns

In many case, objects are referring to each other. The reference is embedded inside the representation of the object that reference it. In case the object being referenced is deleted, all these references need to be fixed in an application specific way.

Friday, May 15, 2009

Cloud Security Considerations

"Security concern" is perhaps one of the biggest hurdle for enterprises to move their application from in-house data center to the public cloud. While a number of these concerns are real, I've found most of them are just fears from the lack of understanding of the cloud operating environment.

In this post, I will dig into each common security concern to look at its implication and suggest a set of practices and techniques to tackle the associated risks. I argue than with proper use of existing security technology, running application in the public cloud can often be more secure than running within your own data center.

Legal and Compliance

Country laws, auditing practices, compliance requirement typically evolves pretty slow compare to technology. For example, if the country law requires you to store your data within certain geography location, then there is not much that technology can help. You need to identify these set of data as well as the corresponding application that manipulate them because you have to run them at your data center residing in that particular geography location.

Other than acknowledging that legal practices needs to be modified before enterprise can legally run their application in the cloud, I will not go further into this area because there is no much we can do at the technical level.


Trust the Cloud Provider

First of all, you have to establish some level of trust to the cloud provider. At the very least, you need to believe your cloud provider have no incentive to steal your information. Also, you need to have confidence that they have sufficient security expertise to handle different kinds of attacks. Typically, an extensive dual diligence is necessary to get a deep understand of the cloud provider's operation procedures, underlying technologies being used, expertise/experience of handling various kind of security attacks ... etc.

In the dual diligence session, here are some typical questions ...
  • What is the level of physical security that their data center runs on ?
  • How much power does their system administrator has ? Can they see all your sensitive information or modify your environment ? e.g. Can the system admin modify the machine image after creation ?
  • Is the cloud provider subcontract certain functionalities to another 3rd party provider ? If so, you may need to expand the evaluation to those 3rd parties ?
  • Are they have a bigger consequence (legally, financially, or reputation) in case any security compromises happen ?
  • Do they have sufficient security expertise and toolset for detecting and isolating attacks ? A good probe is to understand how they tackle DDOS attack ?
  • What guarantee do they provide in the SLA and are they willing to share your lost in case of security breach happens ? (almost no cloud provider today provide risk sharing in their SLA)
  • In the cloud provider goes out of business, is there a process to get your data back ? How is your existing data disposed ?
  • In case any government agencies or legal entities need to examine the cloud provider, how will your sensitive data being protected ?
Having a comprehensive set of security operating procedure is very important.
  • How do they control the access of their personnel to the data center environment ?
  • In case of security attacks that has caused damage, how and when will you be notified ?
  • How do they isolate different tenants running in their shared physical environment ?
  • What tools do they use to monitor their operating environment ? How do they diagnose a security attack incident ?
  • Are all sensitive operations leave an audit trail ?
In order to stay competitive in business, public cloud providers generally have acquired the best system administration talent usually better than your own data center administrator (particularly when you are a SME). In this regard, running your application in the public cloud can actually be more secure.

On the other hand, cloud providers are more likely to be the target of attacks.


Trust the Underlying Technologies

Since your application will be running in a shared environment, you highly depend on the virtualization technology (e.g. Hypervisor) to provide the fundamental isolation from other tenants. Any defects/bugs of the virtualization technology can leak your sensitive information to other tenants.

Talk to your cloud provider to get a deep understand of their selected technology to make sure it is proven and robust.

Computation / Execution environment
Most likely your cloud provider runs a hypervisor on top of physical machines and therefore the isolation mechanism is especially important. You need to be sure that when your virtual machine instance (guest VM) is started, even the cloud provider's administrator is technically impossible to gain access of any information residing in memory or local disk. And after you shut down your VM, all data in memory, swap device, and local disk will be completely wiped out and unrecoverable.

Data Storage / Persistence
Make sure your data is stored in a highly reliable environment such that the chance of data lost or corruption is practically zero even in event of disaster. To achieve high data resilience, Cloud providers typically create additional copies of your data and put them into geographically distributed locations. An auto-sync mechanism is also involved to keep the copies up to date.
  • How many copies of your data are made ?
  • Are these copies being placed in independent availability zones ?
  • How do they keep these copies in sync ? Is there any window that the data is inconsistent ?
  • In case of some copies are lost, what is the mechanism to figure out which is the valid copy ? How is the recovery done and what is the worst possible damage ?
  • Is my data recoverable after delete ? If so, what is their data retention policies ?
For scalability reasons, the cloud provider usually offer an "eventual consistency model" which relax the data integrity guarantee. This model is quite different from the traditional DB semantics so you need to reevaluate what assumptions your application is making on the underlying DB and adjust your application logic accordingly.

Network communication
Cloud providers in general do a pretty good job in isolating network communications between guest VMs. For example, Amazon EC2 allow the setup of security group which is basically a software firewall to control who can talk to who. Nevetheless, "security group" is not equivalent to the same LAN segment. The guest VMs (even within the same security group) can only see the network traffic that is directed to themselves but not communication among other VMs. By disallowing access to the physical network, it is much harder for other tenants to sniff your network communication traffic. On the other hand, running any low-level network intrusion detection software becomes unnecessary or even impossible. Also note that multicast communication is typically not supported in the cloud environment. Your application need some changes in case it is relying on multicast enabled network.

Nevertheless, the cloud provider's network admin can get access to the physical network and see everything flowing across them. For highly sensitive data, you should encrypt the communication.

Best Practices

There are some common practices that you can use to improve the cloud security

Encrypt all sensitive information
If your application is running in an environment that you have no control, cryptography is your best friend. In general, you should encryption your data whenever there is a possibility of them being exposed to an environment accessible by someone you don't trust.
  • Network Communication: For all communication channels that has sensitive information flowing across, you should encrypt the network traffic.
  • Data Storage: For all sensitive data that you plan to store into the cloud, encrypt them first.
Secure your private key
The effectiveness of any cryptography technique depends largely on how well you secure your private key. Make sure you put your private key in a separate safe environment (e.g. in a different cloud provider's environment). You can also encrypt your private key with another secret key, or split the secret key into multiple pieces and put each piece in a different place. The basic idea is to increase the number of independent systems that the hacker need to compromise before gaining access to your sensitive information.

On the other hand, you also need to consider how to recover your data if you lost your private key (e.g. the system where your private key is stored has been irrecoverably damaged). One way is to store additional copies of the keys but this will increase chances of it being stolen. Another way is to apply erasure coding technique to break your private key into n pieces such that the key can be recovered by any m pieces where m is less than n.

Build your image carefully
If the underlying OS is malicious, then all the cryptographic technique is useless. Therefore, it is extremely important that your application is run on top of a trustworthy OS image. In case you are building your own image, you need to make sure you start from a clean and trustworthy OS source.

The base OS may come with a configuration containing software components that you don't need. You should harden the OS by removing these unused software, as well as the services, ports and user accounts that you don't need. It is good practices to take a default close policy: Every service is disabled by default and explicit configuration change is needed to turn it on.

Afterwards, you need to make sure each software that you install on the clean image is also trustworthy. The effective trustworthiness of your final OS image is the weakest link, which is the minimum trustworthiness of all the installed software and the base OS.

Never put any sensitive information (e.g. your private key) into your customized image. Even you are think to use your image in a private environment, you may share your image to the public in future and forget to remove the sensitive information. In case you need the password or secret key to start your VM, such sensitive information should be pushed-in (via startup user paremeters, or data upload) by the initiator when it starts the VM.

Over the life cycle of your customized image, you need to monitor the relevant security patches and apply them in a timely manner.

Abnormaly detection
Continuous health monitoring is important to detect security attacks or abnormal workload to your application. One effective way to take a baseline of your application by observing the traffic patterns over a period of time. Novelty detection technqiue can be used to detect sudden change of traffic pattern which usually indicates DDOS attacks.

Keep sensitive data within your data center
The bottomline is ... you don't have to put your sensitive data into the public cloud. For any sensitive information that you have concerns, just keep them and the associated operations within your data center and carefully select a set of sensitive operations to be exposed as a service. Instead of worrying about data storage protection, this way you just concentrate the protection at the service interface level and apply proven authentication/authorization technique.

e.g. you can run your user database within your data center and expose an authentication services to applications running in the public cloud.

Backup cloud data periodically
To make sure your data is not lost even when the cloud provider goes out of business, you should install a periodic backup process. The backup should be in a format restorable in a different environment and also should be transferred and stored in a different location (e.g. a different cloud provider).

A common approach is to utilize DB replication mechanism where data updates are propagated from the master to the slaves in an asynchronous manner. One of the slave replica will be taken offline regularly for the backup purpose. The backup data will be transferred to a separated environment, restored and tested/validated. The whole process should be automated and execute periodically base on how much data lost you can tolerate.

Run on Multiple Clouds simultaneously
An even better approach is to prepare your application to run on multiple clouds simultaneously. This force you to isolate your application from vendor specific features upfront and all the vendor lock-in issues disappears.

Related to security, since each cloud providers has their own set of physical resources, it is extremely unlikely that hackers or DDOS can bring all of them down at the same time. Therefore you gain additional reliability and resiliency when run your application across different cloud providers.

Since your application is already sitting in multiple cloud provider's environment, you no longer need to worry about migrating your application across different provider when the primary provider goes out of business. On the other hand, storing encrypted data in one cloud provider and your key in another cloud provider is a very good idea as the hacker now need to compromise 2 cloud provider's security measures (which is very difficult) before gaining access to your sensitive data.

You can also implement your data backup strategy by asynchronously replicated data from one cloud provider to another cloud provider and provide application services using the replicated data on different cloud providers. In this model, no data restoration is necessary when one cloud provider is inaccessible.

Friday, May 8, 2009

Machine Learning: Linear Model

Linear Model is a family of model-based learning approaches that assume the output y can be expressed as a linear algebraic relation with the input attributes x1, x2 ...

Here our goal is to learn the parameters of the underlying model, which the coefficients.

Linear Regression

Here the input and output are both real numbers, related through a simple linear relationship. The learning goal is to figure out the hidden weight value (ie: the W vector).

Given a batch of training data, we want to figure out the weight vector W such that the total sum of error (which is the difference between the predicted output and the actual output) to be minimized.


Instead of using the batch processing approach, a more effective approach is to learn incrementally (update the weight vector for each input data) using a gradient descent approach.

Gradient Descent

Gradient descent is a very general technique that we can use to incrementally adjust the parameters of the linear model. The basic idea of "gradient descent" is to adjust each dimension (w0, w1, w2) of the W vector according to their contribution of the square error. Their contribution is measured by the gradient along the dimension which is the differentiation of the square error with respect to w0, w1, w2.

In the case of Linear Regression ...


Logistic Regression

Logistic Regression is used when the output y is binary and not a real number. The first part is the same as linear regression while a second step sigmod function is applied to clamp the output value between 0 and 1.

We use the exact same gradient descent approach to determine the weight vector W.

Neural Network

Inspired by how our brain works, Neural network organize many logistic regression units into layers of perceptrons (each unit has both input and outputs in binary form).

Learning in Neural network is to discover all the hidden values of w. In general, we use the same technique above to adjust the weight using gradient descent layer by layer. We start from the output layer and move towards the input layer (this technique is called backpropagation). Except the output layer, we don't exactly know the error at the hidden layer, we need to have a way to estimate the error at the hidden layers.

But notice there is a symmetry between the weight and the input, we can use the same technique how we adjust the weight to estimate the error of the hidden layer.



Support Vector Machine

Tuesday, May 5, 2009

Machine Learning: Nearest Neighbor

This is the simplest technique for instance based learning. Basically, find a previous seen data that is "closest" to the query data point. And then use its previous output for prediction.

The concept of "close" is defined by a distance function, dist(A, B) gives a quantity which need to observe the triangular inequality.
ie: dist(A, B) + dist(B, C) >= dist(A, C)

Defining the distance function can be domain specific. One popular generic distance function is to use the Euclidean distance.
dist(A, B) = square_root(sum_over_i(square(xai - xbi)))

In order to give each attribute the same degree of influence, you need to normalize their scale within the same range. On the other hand, you need to figure out a way to compute the difference between categorical values (ie: whether "red" is more similar to "blue" or "green"). A common approach is to see whether "red" and "blue" affects the output value in a similar way. If both colors has similar probability distribution across each output value, then we consider the two colors are similar.

Therefore you need to transform the attributes xi.
  • Normalize their scale: transform xi = (xi - mean) / std-deviation
  • Quantify categorical data: If xi is categorical, then (xai - xbi) = sum_over_k(P(class[k] | xai) – P(class[k] | xbi))
Nearest neighbor will be sensitive to outliers, say you have a few abnormal data and query point around these outliers will be wrongly estimated. One solution is to use multiple nearest neighbors and combine their output in a certain way. This is known as KNN (k-nearest-neighbor). If the problem is classification, every neighbor will cast a vote with a weight inversely proportional to the "distance" with the query point, and the majority win. If the problem is regression, the weighted average of their output will be used instead.

Execution Optimization

One problem of instance-based learning is that you need to store all previously seen data and also compute the distance of query point to each of them. Both time and space complexity to serve a single query is O(M * N) where M is the number of dimensions and N is the number of previous data points.

Instead of compute the distance between the query point to each of the existing data points, you can organized the existing points into a KD Tree based on the distance function. The KD Tree has the properties that the max distance between two nodes is bound by the level of their common parent.

Using the KD Tree, you navigate the tree starting at the root node. Basically, you calculate the dist(current_node, query_point)

and each of the child nodes of the current_node
dist(child_j, query_point)

And then find the minimum of them, if the minimum is one of its child, then you navigate down the tree by setting current_node to this child and repeat the process. You terminate if the current_node is the minimum, or when there is no more child nodes.

After terminating at a particular node, this node is pretty close to the query point. You need to explore the surrounding nodes around this node (its siblings, siblings child, parent's siblings) to locate the K nearest neighbors.

By using a KD Tree, the time complexity depends on the depth of the tree and hence of order O(M * log N)

Note that KD Tree is not effective when the data has high dimensions (> 6).

Another way is to throw away some of the previous seen data if they won't affect the result prediction (especially effective for classification problem, you can just keep the data at the boundary between two different output values and throw away the interior points of a cluster of data points all has the same output values). However, if you are using KNN, then throwing away some points may change the result. So a general approach is to verify the previous seen data is still correctly predicted after throwing out various combination of points.

Recommendation Engine

A very popular application of KNN is the recommendation engine of many e-commerce web sites using a technique called "Collaborative Filtering". E.g. An online user have purchased a book, the web site looks at other "similar" users to see what other books they have seen and recommends that to the current user.

First of all, how do we determine what attributes of the users to be captured. This is a domain-specific questions because we want to identify those attributes that are most influential, maybe we can use static information such as user's age, gender, city ... etc. But here lets use something more direct ... the implicit transaction information (e.g. if the user has purchased a book online, we know that he likes that book) as well as explicit rating information (e.g. the user rates a book he bought previously so we know whether he/she likes the book or not).

Lets use a simple example to illustrate the idea. Here we have a number of users who rates a set of movies. The ratings is from [0 - 10] where 0 means hates it and 10 means extremely likes it.


The next important things is to define the distance function. We don't want to use the rating directly because of the following reasons.

Some nice users give an average rating of 7 while some tough users give an average rating of 5. On the other hand, the range of ratings of some users are wide while other users are narrow. However, we don't want these factors to affect our calculation of user similarity. We consider two users of same taste as long as they rate the same movie above their average or below their average with the same percentage of their rating range. Two users has different taste if they rate the movies in different directions.

Lets call rating_i_x to denote user_i's rating on movie_x

We can use the correlation coefficient to capture this.

sum_of_product =
sum_over_x( (rating_i_x - avg(rating_i)) * (rating_j_x - avg(rating_j)) )

If this number is +ve, then user_i and user_j are moving in the same direction. If this number is -ve, then they are moving in opposite direction (negatively correlated). If this number is zero, then they are uncorrelated.

We also need to normalize them with the range of the user's ratings, so we compute
root_of_product_square_sum =
square_root(sum_over_x( ((rating_i_x - avg(rating_i)) **2) * ((rating_j_x - avg(rating_j)) **2) )))

Define Pearson Coefficient = sum_of_product / root_of_product_square_sum

Let Pearson Coefficient to quantify the "similarity" between 2 users.

We may also use negatively correlated users to make recommendation. For example, if user_i and user_j is negatively correlated, then we can recommend the movies that user_j hates to user_i. However, this seems to be a bit risky so we are not doing it here.

Monday, May 4, 2009

Machine Learning: Probabilistic Model

Probabilistic model is a very popular approach of “model-based learning” based on Bayesian theory. Under this approach, all input attributes is binary (for now) and the output is categorical.

Here, we are given a set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data

We assume there exist a hidden probability distribution from the input attributes to the output. The goal is to learn this hidden distribution and apply it to the input attributes of the later encountered query point to pick the class that has the maximum probability.

Making Prediction

Lets say the possible value of output y is {class_1, class_2, class_3}. Given input [x1, x2, x3, x4], we need to compute the probability of each output class_j, and predict the one which has the highest value.
max {P(class_j | observed_attributes)}

According to Bayes theorem, this value is equal to …
max { P(observed_attributes | class_j) * P(class_j) / P(observed_attributes) }

The dominator is the same for all class_j, so we can ignore it, so we just need to find
max { P(observed_attributes | class_j) * P(class_j) }

P(class_j) is easy to find, we just calculate
P(class_j) = (samples_of_class_j / total samples)

Now, lets look at the other term, P(observed_attributes | class_j), from Bayesian theory
P(x1 ^ x2 ^ x3 ^ x4 | class_j) =
P(x1 | class_j) *
P(x2 | class_j ^ x1) *
P(x3 | class_j ^ x1 ^ x2) *
P(x4 | class_j ^ x1 ^ x2 ^ x3)

Learning the probability distribution

In order to provide all the above terms for the prediction, we need to build the probability distribution model by observing the training data set.

Notice that finding the last term is difficult. Assume x1, x2 ... is binary, there can be 2 ** (m – 1) possible situations to watch. It is very likely that we haven’t seen enough situations in the training data, in this case this term for all class_j will be zero. So a common solution is to start with count = 1 for all possible combinations and increase the count when we see an occurrence in the training data.


Bayesian Network

Bayesian Network base on the fact we know certain attributes are clearly independent. By applying this domain knowledge, we draw a dependency graph between attributes. The probability of occurrence of a particular node only depends on the occurrence of its parent nodes and nothing else. To be more precise, nodeA and nodeB (which is not related with a parent-child relationship) doesn't need to be completely independent, they just need to be independent given their parents.

In other words, we don't mean P(A|B) = P(A),
we just need P(A | parentsOfA ^ B) = P(A | parentsOfA)
Therefore P(x4 | class_j ^ x1 ^ x2 ^ x3) = P(x4 | class_j ^ parentsOf_x4)

we only need to find 2 ** p situations of the occurrence combination of the parent nodes where p is the number of parent nodes.


Naive Bayes

Naive Bayes takes a step even further by assuming every node is completely independent


Note that x1, x2, x3, x4 can be categorical as well. For example, if x1 represents zip code, then P('95110' | class_j) is the same as P(x1 = '95110' | class_j).

What if x1 is a continuous variable ? (say height of a person). The challenge is that a continuous variable has infinite possibility such that the chance of seeing one in the training data is almost zero.

One way to deal with continuous variable is to discretetize it. In other words, we can partition x1 into buckets which has an associated range and assign x1 to the corresponding bucket if it falls into that range.

Another way is for each output class_j, we presume a arbitrary distribution function for x1. (lets say we pick the normal distribution function). So the purpose of the training phase is to figure out the parameter of this distribution function (in this case, it is the mean and standard deviation).

In other words, we use the subset of training data whose output class is class_j. Within this subset, we compute the mean and standard deviation of x1 (height of the person). Later on when we want to compute P(x1 = 6.1 feet | class_j), we just apply the distribution function (plug-in the learned mean and standard deviation).

Note that the choice of the form of the distribution function is quite arbitrary here and it may be simply wrong. So it is important to analyze how x1 affects the output class in order to pick the right distribution function.


Spam Filter Application

Lets walk through an application of the Naive Bayes approach. Here we want to classify a particular email to determine whether it is spam or not.

The possible value of output y is {spam, nonspam}

Given an email: "Hi, how are you doing ?", we need to find ...
max of P(spam | mail) and P(nonspam | mail), which is same as ...
max of P(mail | spam) * P(spam) and P(mail | nonspam) * P(nonspam)

Lets focus in P(mail | spam) * P(spam)

We can view mail as an array of words [w1, w2, w3, w4, w5]
P(mail | spam) =
P(w1='hi' | spam) *
P(w2='how' | spam ^ w1='hi') *
P(w3='are' | spam ^ w1='hi' ^ w2='how') *
...
We make some naive assumptions here
  • Chance of occurrence is independent of preceding words. In other words, P(w2='how' | spam ^ w1='hi') is the same as P(w2='how' | spam)
  • Chance of occurrence is independent of word position. P(w2='how' | spam) is the same as (number of 'how' in spam mail) / (number of all words in spam mail)
With these assumptions ...
P(mail | spam) =
(hi_count_in_spam / total_words_in_spam) *
(how_count_in_spam / total_words_in_spam) *
(are_count_in_spam / total_words_in_spam) *
...
What if we haven't seen the word "how" from the training data ? Then the probability will becomes zero. Here we adjust terms such that a reasonable probability is assigned to unseen words.
P(mail | spam) =
((hi_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
((how_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
((are_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
...

What we need is the word_count per word/class combination as well as the word_count per class. This can be done through feeding a large number of training sample mails labeled with "spam" or "nonspam" into a learning process.

The learning process can also be done in parallel using a 2-rounds of Map/Reduce.


Alternatively, we can also update the counts incrementally as new mail arrives.

Saturday, May 2, 2009

Machine Learning Intuition

As more and more user data are gathered on different web sites (such as e-commerce, social network), data mining / machine learning technique becomes an increasingly important tool to analysis them and extract useful information out of it.

There a wide variety of machine learning applications, such as …
  • Recommendation: After buying a book at Amazon, or rent a movie from Netflix, they recommends what other items that you may be interested
  • Fraud detection: To protect its buyer and seller, an auction site like EBay detect abnormal patterns to identify fraudulent transaction
  • Market segmentation: Product company divide their market into segments of similar potential customers and design specific marketing campaign for each segment.
  • Social network analysis: By analysis the user’s social network profile data, social networking site like Facebook can categorize their users and personalize their experience
  • Medical research: Analyzing DNA patterns, Cancer research, Diagnose problem from symptoms
However, machine learning theory involves a lot of math which is non-trivial for people who doesn’t have the rigorous math background. Therefore, I am trying to provide an intuition perspective behind the math.

General Problem

Each piece of data can be represented as a vector [x1, x2, …] where xi are the attributes of the data.

Such attributes can be numeric or categorical. (e.g. age is an numeric attribute and gender is a categorical attribute)

There are basically 3 branch of machine learning ...

Supervised learning
  • The main use of supervised learning is to predict an output based on a set of training data. A set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data.
  • When y is numeric, the prediction is called regression. When y is categorical, the prediction is called classification.

Unsupervised learning
  • The main use of unsupervised learning is to discover unknown patterns within data. (e.g. grouping similar data, or detecting outliers).
  • Identifying clusters is a classical scenario of unsupervised learning

Reinforcement learning
  • This is also known as “continuous learning” where the final output is not given. The agent will choose an action based on its current state and then will be present with an award. The agent learns how to maximize its award and come up with a model call “optimal policy”. A policy is a mapping between from “state” to “action” (given I am at a particular state, what action should I take).

Data Warehouse

Data warehouse is not “machine learning”, it is basically a special way to store your data so that it can be easily group in many ways for doing analysis in a manual way.

Typically, data is created from OLTP systems which runs the company’s business operation. OLTP capture the “latest state” of the company. Data are periodically snapshot to the data-warehouse for OLAP, in other words, data-warehouse add a time dimension to the data.

There is an ETL process that extract data from various sources, cleansing the data, transform to the form needed by the data-warehouse and then load into the data cube.

Data-warehouse typically organize the data as a multi-dimensional data cube based on a "Star schema" (1 Fact table + N Dimension tables). Each cell contains aggregate data along different (combination) of dimensions.


OLAP processing involves the following operations
  • Rollup: Aggregate data within a particular dimension. (e.g. For the “time” dimension, you can “rollup” the aggregation from “month” into “quarter”)
  • Drilldown: Breakdown the data within a particular dimension (e.g. For the “time” dimension, you can “drilldown” from months” into “days”)
  • Slice: Cut a layer out of a particular dimension (e.g. Look at all data at “Feb”)
  • Dice: Select a sub data cube (e.g. Look at all data at “Jan” and “Feb” as well as product “laptop” and “hard disk”
The Data-warehouse can be further diced into specific “data marts” that focus in different areas for further drilldown analysis.

Some Philosophy

To determine the output from a set of input attributes, one way is to study the physics behinds them and write a function that transform the input attributes to the output. However, what if the relationship is unknown ? or the relationship hasn’t been formally specified ?

Instead of based on a sound theoretical model, machine learning is trying to make prediction based on previously observed data. There are 2 broad type of learning strategies

Instance-based learning
  • Also known as lazy learning, the learner remembers all the previous seen examples. When a new piece of input data arrives, it tried to find the best matched data it previous seen and use its output to predict the output of the new data. It has an underlying assumption that if two piece of data are “similar” in their input attributes, their output are also similar.
  • Nearest neighbor is a classical approach for instance-based learning

Model-based learning
Eager learning that learn a generalized model upfront, and lazy learning learn from seen examples at the time of query. Instead of learning a generic model that fits all observed data, lazy learning can focus its analysis close to the query point. However, getting a comprehensible model in lazy learning is harder and it also require large memory to store all seen data.