In this post, I will discuss a recommendation algorithm that consider diversity in its list of recommendation.

### Topic Space

First of all, lets define a "topic space" where both the content and user will be map to. Having a "topic space" is a common approach in recommendation because it can reduce dimensionality and resulting in better system performance and improved generality.The set of topics in topic space can be extracted algorithmically using Text Mining techniques such as LDA, but for simplicity here we use a manual approach to define the topic space (topics should be orthogonal to each other, as highly correlated topics can distort the measures). Lets say we have the following topics defined ...

- Romance
- Sports
- Politics
- Weather
- Horror
- ...

### Content as Vector of topic weights

Once the topic space is defined, content author can assign topic weights to each content. For example, movie can be assigned with genres and web page can be assigned with topics as well. Notice that a single content can be assigned with multiple topics of different weights. In other words, each content can be described as a vector of topic weights.### User as Vector of topic weights

On the other hand, user can also be represented as a vector of topic weights based on their interaction of content, such as viewing a movie, visiting a web page, buying a product ... etc. Such interaction can have a positive or negative effect depends on whether the user like or dislike the content. If the user like the content, the user vector will have the corresponding topic weight associated with the content increases by multiplying alpha (alpha > 1). If the user dislike the content, the corresponding topic weight will be divided by alpha. After each update, the user vector will be normalized to a unit vector.### Diversifying the recommendation

We use a utility function to model the diversity of the document and then maximize the utility function.In practice, A is not computed from the full set of documents, which is usually huge. The full set of documents is typically indexed using some kind of Inverted index technologies using the set of topics as keywords, each c[j,k] is represented as tf-idf.

User is represented as a "query", and will be sent to the inverted index as a search request. Relevant documents (based on cosine distance measures w.r.t. user) will be return as candidate set D (e.g. top 200 relevant content).

To pick the optimal set of A out of D, we use a greedy approach as follows ...

- Start with an empty set A
- Repeat until |A| > H

- Pick a doc i from D such that by adding it to the set A will maximize the utility function
- Add doc i into A and remove doc i from D