Pragmatic Programming Techniques: Machine Learning: Nearest Neighbor

This is the simplest technique for instance based learning. Basically, find a previous seen data that is "closest" to the query data point. And then use its previous output for prediction.

The concept of "close" is defined by a distance function, dist(A, B) gives a quantity which need to observe the triangular inequality.
ie: dist(A, B) + dist(B, C) >= dist(A, C)

Defining the distance function can be domain specific. One popular generic distance function is to use the Euclidean distance.
dist(A, B) = square_root(sum_over_i(square(xai - xbi)))

In order to give each attribute the same degree of influence, you need to normalize their scale within the same range. On the other hand, you need to figure out a way to compute the difference between categorical values (ie: whether "red" is more similar to "blue" or "green"). A common approach is to see whether "red" and "blue" affects the output value in a similar way. If both colors has similar probability distribution across each output value, then we consider the two colors are similar.

Therefore you need to transform the attributes xi.

Normalize their scale: transform xi = (xi - mean) / std-deviation
Quantify categorical data: If xi is categorical, then (xai - xbi) = sum_over_k(P(class[k] | xai) – P(class[k] | xbi))

Nearest neighbor will be sensitive to outliers, say you have a few abnormal data and query point around these outliers will be wrongly estimated. One solution is to use multiple nearest neighbors and combine their output in a certain way. This is known as KNN (k-nearest-neighbor). If the problem is classification, every neighbor will cast a vote with a weight inversely proportional to the "distance" with the query point, and the majority win. If the problem is regression, the weighted average of their output will be used instead.

Execution Optimization

One problem of instance-based learning is that you need to store all previously seen data and also compute the distance of query point to each of them. Both time and space complexity to serve a single query is O(M * N) where M is the number of dimensions and N is the number of previous data points.

Instead of compute the distance between the query point to each of the existing data points, you can organized the existing points into a KD Tree based on the distance function. The KD Tree has the properties that the max distance between two nodes is bound by the level of their common parent.

Using the KD Tree, you navigate the tree starting at the root node. Basically, you calculate the dist(current_node, query_point)

and each of the child nodes of the current_node
dist(child_j, query_point)

And then find the minimum of them, if the minimum is one of its child, then you navigate down the tree by setting current_node to this child and repeat the process. You terminate if the current_node is the minimum, or when there is no more child nodes.

After terminating at a particular node, this node is pretty close to the query point. You need to explore the surrounding nodes around this node (its siblings, siblings child, parent's siblings) to locate the K nearest neighbors.

By using a KD Tree, the time complexity depends on the depth of the tree and hence of order O(M * log N)

Note that KD Tree is not effective when the data has high dimensions (> 6).

Another way is to throw away some of the previous seen data if they won't affect the result prediction (especially effective for classification problem, you can just keep the data at the boundary between two different output values and throw away the interior points of a cluster of data points all has the same output values). However, if you are using KNN, then throwing away some points may change the result. So a general approach is to verify the previous seen data is still correctly predicted after throwing out various combination of points.

Recommendation Engine

A very popular application of KNN is the recommendation engine of many e-commerce web sites using a technique called "Collaborative Filtering". E.g. An online user have purchased a book, the web site looks at other "similar" users to see what other books they have seen and recommends that to the current user.

First of all, how do we determine what attributes of the users to be captured. This is a domain-specific questions because we want to identify those attributes that are most influential, maybe we can use static information such as user's age, gender, city ... etc. But here lets use something more direct ... the implicit transaction information (e.g. if the user has purchased a book online, we know that he likes that book) as well as explicit rating information (e.g. the user rates a book he bought previously so we know whether he/she likes the book or not).

Lets use a simple example to illustrate the idea. Here we have a number of users who rates a set of movies. The ratings is from [0 - 10] where 0 means hates it and 10 means extremely likes it.

The next important things is to define the distance function. We don't want to use the rating directly because of the following reasons.

Some nice users give an average rating of 7 while some tough users give an average rating of 5. On the other hand, the range of ratings of some users are wide while other users are narrow. However, we don't want these factors to affect our calculation of user similarity. We consider two users of same taste as long as they rate the same movie above their average or below their average with the same percentage of their rating range. Two users has different taste if they rate the movies in different directions.

Lets call rating_i_x to denote user_i's rating on movie_x

We can use the correlation coefficient to capture this.

sum_of_product =
sum_over_x( (rating_i_x - avg(rating_i)) * (rating_j_x - avg(rating_j)) )

If this number is +ve, then user_i and user_j are moving in the same direction. If this number is -ve, then they are moving in opposite direction (negatively correlated). If this number is zero, then they are uncorrelated.

We also need to normalize them with the range of the user's ratings, so we compute
root_of_product_square_sum =
square_root(sum_over_x( ((rating_i_x - avg(rating_i)) **2) * ((rating_j_x - avg(rating_j)) **2) )))

Define Pearson Coefficient = sum_of_product / root_of_product_square_sum

Let Pearson Coefficient to quantify the "similarity" between 2 users.

We may also use negatively correlated users to make recommendation. For example, if user_i and user_j is negatively correlated, then we can recommend the movies that user_j hates to user_i. However, this seems to be a bit risky so we are not doing it here.

Pragmatic Programming Techniques

Tuesday, May 5, 2009

Machine Learning: Nearest Neighbor

No comments: