Saturday, February 23, 2013

Text processing (part 2): Inverted Index

This is the second part of my text processing series.  In this blog, we'll look into how text documents can be stored in a form that can be easily retrieved by a query.  I'll used the popular open source Apache Lucene index for illustration.

There are two main processing flow in the system ...
  • Document indexing: Given a document, add it into the index
  • Document retrieval: Given a query, retrieve the most relevant documents from the index.
The following diagram illustrate how this is done in Lucene.


Index Structure

Both documents and query is represented as a bag of words.  In Apache Lucene, "Document" is the basic unit for storage and retrieval.  A "Document" contains multiple "Fields" (also call zones).  Each "Field" contains multiple "Terms" (equivalent to words).

To control how the document will be indexed across its containing fields, a Field can be declared in multiple ways to specified whether it should be analyzed (a pre-processing step during index), indexed (participate in the index) or stored (in case it needs to be returned in query result).
  • Keyword (Not analyzed, Indexed, Stored)
  • Unindexed (Not analyzed, Not indexed, Stored)
  • Unstored (Analyzed, Indexed, Not stored)
  • Text (Analyzed, Indexed, Stored)
The inverted index is a core data structure of the storage.  It is organized as an inverted manner from terms to the list of documents (which contain the term).  The list (known as posting list) is ordered by a global ordering (typically by document id).  To enable faster retrieval, the list is not just a single list but a hierarchy of skip lists.  For simplicity, we ignore the skip list in subsequent discussion.

This data structure is illustration below based on Lucene's implementation.  It is stored on disk as segment files which will be brought to memory during the processing.


The above diagram only shows the inverted index.  The whole index contain an additional forward index as follows.



Document indexing

Document in its raw form is extracted from a data adaptor.  (this can be making an Web API to retrieve some text output, or crawl a web page, or receiving an HTTP document upload).  This can be done in a batch or online manner.

When the index processing start, it parses each raw document and analyze its text content.  The typical steps includes ...
  • Tokenize the document (breakdown into words)
  • Lowercase each word (to make it non-case-sensitive, but need to be careful with names or abbreviations)
  • Remove stop words (take out high frequency words like "the", "a", but need to careful with phrases)
  • Stemming (normalize different form of the same word, e.g. reduce "run", "running", "ran" into "run")
  • Synonym handling.  This can be done in two ways.  Either expand the term to include its synonyms (ie: if the term is "huge", add "gigantic" and "big"), or reduce the term to a normalized synonym (ie: if the term is "gigantic" or "huge", change it to "big")
At this point, the document is composed with multiple terms.  doc = [term1, term2 ...].  Optionally, terms can be further combined into n-grams.  After that we count the term frequency of this document.  For example, in a bi-gram expansion, the document will become ...
doc1 -> {term1: 5, term2: 8, term3: 4, term1_2: 3, term2_3:1}

We may also compute a "static score" based on some measure of quality of the document.  After that, we insert the document into the posting list (if it exist, otherwise create a new posting list) for each terms (all n-grams), this will create the inverted list structure as shown in previous diagram.

There is a boost factor that can be set to the document or field.  The boosting factor effectively multiply the term frequency which effectively affecting the importance of the document or field.

Document can be added to the index in one of the following ways; inserted, modified and deleted.
Typically the document will first added to the memory buffer, which is organized as an inverted index in RAM.
  • When this is a document insertion, it goes through the normal indexing process (as I described above) to analyze the document and build an inverted list in RAM.
  • When this is a document deletion (the client request only contains the doc id), it fetches the forward index to extract the document content, then goes through the normal indexing process to analyze the document and build the inverted list.  But in this case the doc object in the inverted list is labeled as "deleted".
  • When this is a document update (the client request contains the modified document), it is handled as a deletion followed by an insertion, which means the system first fetch the old document from the forward index to build an inverted list with nodes marked "deleted", and then build a new inverted list from the modified document.  (e.g. If doc1 = "A B" is update to "A C", then the posting list will be {A:doc1(deleted) -> doc1, B:doc1(deleted), C:doc1}.  After collapsing A, the posting list will be {A:doc1, B:doc1(deleted), C:doc1}

As more and more document are inserted into the memory buffer, it will become full and will be flushed to a segment file on disk.  In the background, when M segments files have been accumulated, Lucene merges them into bigger segment files.  Notice that the size of segment files at each level is exponentially increased (M, M^2, M^3).  This maintains the number of segment files that need to be search per query to be at the O(logN) complexity where N is the number of documents in the index.  Lucene also provide an explicit "optimize" call that merges all the segment files into one.




Here lets detail a bit on the merging process, since the posting list is already vertically ordered by terms and horizontally ordered by doc id, merging two segment files S1, S2 is basically as follows
  • Walk the posting list from both S1 and S2 together in sorted term order.  For those non-common terms (term that appears in one of S1 or S2 but not both), write out the posting list to a new segment S3.
  • Until we find a common term T, we merge the corresponding posting list from these 2 segments. Since both list are sorted by doc id, we just walk down both posting list to write out the doc object to a new posting list.  When both posting lists have the same doc (which is the case when the document is updated or deleted), we pick the latest doc based on time order.
  • Finally, the doc frequency of each posting list (of the corresponding term) will be computed.

Document retrieval

Consider a document is a vector (each term as the separated dimension and the corresponding value is the tf-idf value) and the query is also a vector.  The document retrieval problem can be defined as finding the top-k most similar document that match a query, where similarity is defined as the dot-product or cosine distance between the document vector and the query vector.

tf-idf is a normalized frequency.  TF (term frequency) represents how many time the term appears in the document (usually a compression function such as square root or logarithm is applied).  IDF is the inverse of document frequency which is used to discount the significance if that term appears in many other documents.  There are many variants of TF-IDF but generally it reflects the strength of association of the document (or query) with each term.

Given a query Q containing terms [t1, t2], here is how we fetch the corresponding documents.  A common approach is the "document at a time approach" where we traverse the posting list of t1, t2 concurrently (as opposed to the "term at a time" approach where we traverse the whole posting list of t1 before we start the posting list of t2).  The traversal process is described as follows ...
  • For each term t1, t2 in query, we identify all the corresponding posting lists.
  • We walk each posting list concurrently to return a sequence of documents (ordered by doc id).  Notice that each return document contains at least one term but can also also contain multiple terms.
  • We compute the dynamic score which is dot product of the query to document vector.  Notice that we typically don't concern the TF/IDF of the query (which is short and we don't care the frequency of each term).  Therefore we can just compute the sum up all the TF score of the posting list that has a match term after dividing the IDF score (at the head of each posting list).  Lucene also support query level boosting where a boost factor can be attached to the query terms.  The boost factor will multiply the term frequency correspondingly.
  • We also look up the static score which is purely based on the document (but not the query).  The total score is a linear combination of static and dynamic score.
  • Although the score we used in above calculation is based on computing the cosine distance between the query and document, we are not restricted to that.  We can plug in any similarity function that make sense to the domain.  (e.g. we can use machine learning to train a model to score the similarity between a query and a document).
  • After we compute a total score, we insert the document into a heap data structure where the topK scored document is maintained.
Here the whole posting list will be traversed.  In case of the posting list is very long, the response time latency will be long.  Is there a way that we don't have to traverse the whole list and still be able to find the approximate top K documents ?  There are a couple strategies we can consider.
  1. Static Score Posting Order: Notice that the posting list is sorted based on a global order, this global ordering provide a monotonic increasing document id during the traversal that is important to support the "document at a time" traversal because it is impossible to visit the same document again.  This global ordering, however, can be quite arbitrary and doesn't have to be the document id.  So we can pick the order to be based on the static score (e.g. quality indicator of the document) which is global.  The idea is that we traverse the posting list in decreasing magnitude of static score, so we are more likely to visit the document with the higher total score (static + dynamic score).
  2. Cut frequent terms: We do not traverse the posting list whose term has a low IDF value (ie: the term appears in many documents and therefore the posting list tends to be long).  This way we avoid to traverse the long posting list.
  3. TopR list: For each posting list, we create an extra posting list which contains the top R documents who has the highest TF (term frequency) in the original list.  When we perform the search, we perform our search in this topR list instead of the original posting list.

Since we have multiple inverted index (in memory buffer as well as the segment files at different levels), we need to combine the result them.  If termX appears in both segmentA and segmentB, then the fresher version will be picked.  The fresher version is determine as follows; the segment with a lower level (smaller size) will be considered more fresh.  If the two segment files are at the same level, then the one with a higher number is more fresh.  On the other hand, the IDF value will be the sum of the corresponding IDF of each posting list in the segment file (the value will be slightly off if the same document has been updated, but such discrepancy is negligible).  However, the processing of consolidating multiple segment files incur processing overhead in document retrieval.  Lucene provide an explicit "optimize" call to merge all segment files into one single file so there is no need to look at multiple segment files during document retrieval.

Distributed Index

For large corpus (like the web documents), the index is typically distributed across multiple machines.  There are two models of distribution: Term partitioning and Document partitioning.


 
In document partitioning, documents are randomly spread across different partitions where the index is built.  In term partitioning, the terms are spread across different partitions.  We'll discuss document partitioning as it is more commonly used.  Distributed index is provider by other technologies that is built on Lucene, such as ElasticSearch.  A typical setting is as follows ...

In this setting, machines are organized as columns and rows.  Each column represent a partition of documents while each row represent a replica of the whole corpus.



During the document indexing, first a row of the machines is randomly selected and will be allocated for building the index.  When a new document crawled, a column machine from the selected row is randomly picked to host the document.  The document will be sent to this machine where the index is build.  The updated index will be later propagated to the other rows of replicas.

During the document retrieval, first a row of replica machines is selected.  The client query will then be broadcast to every column machine of the selected row.  Each machine will perform the search in its local index and return the TopM elements to the query processor which will consolidate the results before sending back to client.  Notice that K/P < M < K, where K is the TopK documents the client expects and P is the number of columns of machines.  Notice that M is a parameter that need to be tuned.

One caveat of this distributed index is that as the posting list is split horizontally across partitions, we lost the global view of the IDF value without which the machine is unable to calculate the TF-IDF score.  There are two ways to mitigate that ...
  1. Do nothing: here we assume the document are evenly spread across different partitions so the local IDF represents a good ratio of the actual IDF.
  2. Extra round trip: In the first round, query is broadcasted to every column which returns its local IDF.  The query processor will collected all IDF response and compute the sum of the IDF.  In the second round, it broadcast the query along with the IDF sum to each column of machines, which will compute the local score based on the IDF sum.

Friday, February 15, 2013

Text Processing (part 1) : Entity Recognition

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation.  It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing.  In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Given the sentence: Ricky is living in CA USA.
Output: Ricky/SP is/NA living/NA in/NA CA/SL USA/CL

Basically we want to tag each word with the entity, whose definition is domain specific.  In this example, we define the following tags
  • NA - Not Applicable
  • SP - Start of a person name
  • CP - Continue of a person name
  • SL - Start of a location name
  • CL - Continue of a location name

Hidden Markov Model

Lets say there is a sequence of state, lets look at multiple probabilistic graph.


However, in our tagging example, we don't directly observe the tag.  Instead, we only observe the words.  In this case, we can use a hidden markov model (ie: HMM).


 
Now the tagging problem can be structured as follows.

Given a sequence of words, we want to predict the most likely tag sequence.

Find a tag sequence t1, t2, ... tn that maximize the probability of P(t1, t2, .... | w1, w2 ...)

Using Bayes rules,
P(t1, t2, .... | w1, w2 ...) = P(t1, t2, ... tn, w1, w2, ... wn) / P(w1, w2, ... wn)

Since the sequence w1, w2 ... wn is observed and constant among all tag sequence.  This is equivalent to maximize P(t1, t2, ... tn, w1, w2, ... wn) which is equal to P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

Now, P(t1|S), P(t2|t1) ... can be estimated by counting the occurrence within the training data.
P(t2|t1) = count(t1, t2) / count(*, t2)

Similarly, P(w1|t1) = count(w1, t1) / count(*, t1)

Viterbi Algorithm

Now the problem is find a tag sequence t1, ... tn to maximize
P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

A naive method is to find all possible combination of tag sequence and then evaluate the above probability.  The order of complexity will be O(|T|^n) where T is the number of possible tag values.  Notice that this is exponential complexity with respect to the length of the sentence.

However, there is a more effective Viterbi algorithm that leverage the Markov chain properties as well as dynamic programming technique.



The key element is M(k, L) which indicates the max probability of any length k sequence that ends at tk = L.  On the other hand, M(k, L) is computed by looking back different choices of S of the length k-1 sequence, and pick the one that gives the maximum M(k-1, S)*P(tk=L | tk-1=S)*P(wk|tk=L).  The complexity of this algorithm is O(n*|T|^2).

To find the actual tag sequence, we also maintain a back pointer from every cell to S which leads to the cell.  Then we can trace back the path from the max cell  M(n, STOP) where STOP is the end of sentence.

Notice that for some rare words that is not observed from the training data,  P(wk|tk=L) will be zero and cause the whole term to be zero.  Such words can be numbers, dates.  One way to solve this problem is to group these rare words into patterns (e.g. 3 digits, year2012 ... etc) and then compute P(group(wk) | tk=L).  However, such grouping is domain specific and has to be hand-tuned.

Reference

NLP course from Michael Collins of Columbia Unversity

Tuesday, February 12, 2013

Basic Planning Algorithm

You can think of planning as a graph search problem where each node in the graph represents a possible "state" of the reality. A directed edge from nodeA to nodeB representing an "action" is available to transition stateA to stateB.

Planning can be thought of as another form of a constraint optimization problem which is quite different from the one I described in my last blog. In this case, the constraint is the goal state we want to achieve, where a sequence of actions needs to be found to meet the constraint. The sequence of actions will incur cost and our objective is to minimize the cost associated with our chosen actions

Basic Concepts 

A "domain" defined the structure of the problem.
  • A set of object types.  e.g. ObjectTypeX, ObjectTypeY ... etc.
  • A set of relation types  e.g. [ObjectTypeX RelationTypeA ObjectTypeY] or [ObjectTypeX RelationTypeA ValueTypeY]
A "state" is composed of a set of relation instances,  It can either be a "reality state" or a "required state".

A reality state contains tuples of +ve atoms.  e.g. [(personX in locationA), (personX is male)].  Notice that -ve atoms will not exist in reality state.  e.g. If personX is NOT in locationB, such tuple will just not show up in the state.

A required state contains both +ve and -ve atoms.  e.g. [(personX in locationA), NOT(personX is male)]  The required state is used to check against the reality state.  The required state is reached if all of the following is true.
  • All +ve atoms in the required state is contained in the +ve atoms of the reality state
  • None of the -ve atoms in the required state is contained in the +ve atoms of the reality state
Notice that there can be huge (or even infinite) number of nodes and edges in the graph if we are to expand the whole graph (with all possible states and possible actions).  Normally we will expressed only a subset of nodes and edges in an analytical way.  Instead of enumerating all possible states, we describe the state as a set of relations that we care, in particular we describe the initial state of the environment with all the things we observed and the goal state as what we want to reach.  Similarity, we are not enumerate every possible edges, instead we describe actions with variables such that it describe rules that can transition multiple situations of states.


An "action" causes transition from one state to the other.  It is defined as action(variable1, variable2 ...) and contains the following components.
  • Pre-conditions: a required state containing a set of tuples (expressed by variables).  The action is feasible if the current reality state contains all the +ve atoms but not any -ve atoms specified in the pre-conditions.
  • Effects: A set of +ve atoms and -ve atoms (also expressed by variables).  After the action is taken, it removes all the -ve atoms from the current reality state and then insert all the +ve atoms into the current reality state.
  • Cost of executing this actio.
Notice that since actions contains variables but the reality state does not, therefore before an action can be execution, we need to bind the variables in the pre-conditions to a specific value such that it will match the current reality state.  This binding will propagate to the variable in the effects of the actions and new atoms will be insert / removed from the reality state.

Planning Algorithm

This can be think of a Search problem.  Given an initial state and a goal state, our objective is to search for a sequence of actions such that the goal state is reached.



We can perform the search from the initial state, expand all the possible states that can be reached by taking some actions, and check during this process whether the goal state has been reached.  If so, terminate the process and return the path.

Forward planning build the plan from the initial state.  It works as follows ...
  1. Put the initial state into the exploration queue, with an empty path.
  2. Pick a state (together with its path from initial state) from the exploration queue as the current state according to some heuristics.
  3. If this current state is the goal state, then return its path that contains the sequence of action and we are done.  Else move on.
  4. For this current state, explore which action is possible by seeing whether the current state meet the pre-conditions (ie: contains all +ve and no -ve state specified in the action pre-conditions).
  5. If the action is feasible, compute the next reachable state, and the path (by adding this action to the original path), insert the next state into the exploration queue.
  6. Repeat 5 for all feasible actions of current state.

Alternatively, we can perform the search from the goal state.  We looked at what need to be accomplished and identify what possible actions can accomplish that (ie: the effect of the action meets the goal state).  Then we looked at whether those actions are feasible (ie: the initial state meets the action's pre-conditions).  If so we can execute the action, otherwise we take the action's pre-conditions as our sub-goal and expand our over goal state.

Backward planning build the plan from the goal state.  It works as follows ...
  1. Put the goal state into the exploration queue, with an empty path.
  2. Pick a regression state (a state that can reach the goal state, can be considered as a sub-goal) from the exploration queue according to some heuristics.
  3. If the regression state is contained in initial state, then we are done and return the path as the plan.  Else move on.
  4. From this regression state, identify all "relevant actions"; those actions who has some +ve effect is contained in the regression state; and all of its +ve effect is not overlap with the -ve regression state; and all of its -ve effect is not overlap with the +ve regression state.
  5. If the action is relevant, compute the next regression state by removing the action effect from the current regression state and adding the action pre-conditions into the current regression state, insert the next regression state into the exploration queue.
  6. Repeat 5 from all relevant actions of current regression state.

Heuristic Function

In above algorithms, to pick the next candidate from the exploration queue.  We can employ many strategies.
  • If we pick the oldest element in the queue, this is a breathe-first search
  • If we pick the youngest element in the queue, this is a depth-first search
  • We can pick the best element in the queue based on some value function.
Notice that what is "best" is very subjective and is also domain specific. A very popular approach is using the A* search whose value function = g(thisState) + h(thisState).

Notice that g(thisState) is the accumulative cost to move from initial state to "thisState", while h(thisState) is a domain-specific function that estimate the cost from "thisState" to the goal state.  It can be proved that in order for A* search to return an optimal solution (ie: the least cost path), the chosen h(state) function must not over-estimate (ie: need to underestimate) the actual cost to move from "thisState" to the goal state.

Here is some detail of A* search.

Monday, January 14, 2013

Optimization in R

Optimization is a very common problem in data analytics.  Given a set of variables (which one has control), how to pick the right value such that the benefit is maximized.  More formally, optimization is about determining a set of variables x1, x2, … that maximize or minimize an objective function f(x1, x2, …).

Unconstrained optimization

In an unconstraint case, the variable can be freely selected within its full range.

A typical solution is to compute the gradient vector of the objective function [∂f/∂x1, ∂f/∂x2, …] and set it to [0, 0, …].  Solve this equation and output the result x1, x2 … which will give the local maximum.

In R, this can be done by a numerical analysis method.

> f <- function(x){(x[1] - 5)^2 + (x[2] - 6)^2}
> initial_x <- c(10, 11)
> x_optimal <- optim(initial_x, f, method="CG")
> x_min <- x_optimal$par
> x_min
[1] 5 6

Equality constraint optimization

Moving onto the constrained case, lets say x1, x2 … are not independent and then have to related to each other in some particular way: g1(x1, x2, …) = 0,  g2(x1, x2, …) = 0.

The optimization problem can be expressed as …
Maximize objective function: f(x1, x2, …)
Subjected to equality constraints:
  • g1(x1, x2, …) = 0
  • g2(x1, x2, …) = 0
A typical solution is to turn the constraint optimization problem into an unconstrained optimization problem using Lagrange multipliers.

Define a new function F as follows ...
F(x1, x2, …, λ1, λ2, …) = f(x1, x2, …) + λ1.g1(x1, x2, …) + λ2.g2(x1, x2, …) + …

Then solve for ...
[∂F/∂x1, ∂F/∂x2, …, ∂F/∂λ1, ∂F/∂λ2, …] = [0, 0, ….]

Inequality constraint optimization

In this case, the constraint is inequality.  We cannot use the Lagrange multiplier technique because it requires equality constraint.  There is no general solution for arbitrary inequality constraints.

However, we can put some restriction in the form of constraint.  In the following, we study two models where constraint is restricted to be a linear function of the variables:
w1.x1 + w2.x2 + … >= 0

Linear Programming

Linear programming is a model where both the objective function and constraint function is restricted as linear combination of variables.  The Linear Programming problem can be defined as follows ...

Maximize objective function: f(x1, x2) = c1.x1 + c2.x2

Subjected to inequality constraints:
  • a11.x1 + a12.x2 <= b1
  • a21.x1 + a22.x2 <= b2
  • a31.x1 + a32.x2 <= b3
  • x1 >= 0, x2 >=0
As an illustrative example, lets walkthrough a portfolio investment problem.  In the example, we want to find an optimal way to allocate the proportion of asset in our investment portfolio.
  • StockA gives 5% return on average
  • StockB gives 4% return on average
  • StockC gives 6% return on average
To set some constraints, lets say my investment in C must be less than sum of A, B.  Also, investment in A cannot be more than twice of B.  Finally, at least 10% of investment in each stock.

To formulate this problem:

Variable: x1 = % investment in A, x2 = % in B, x3 = % in C

Maximize expected return: f(x1, x2, x3) = x1*5% + x2*4% + x3*6%

Subjected to constraints:
  • 10% < x1, x2, x3 < 100%
  • x1 + x2 + x3 = 1
  • x3 < x1 + x2
  • x1 < 2 * x2

> library(lpSolve)
> library(lpSolveAPI)
> # Set the number of vars
> model <- make.lp(0, 3)
> # Define the object function: for Minimize, use -ve
> set.objfn(model, c(-0.05, -0.04, -0.06))
> # Add the constraints
> add.constraint(model, c(1, 1, 1), "=", 1)
> add.constraint(model, c(1, 1, -1), ">", 0)
> add.constraint(model, c(1, -2, 0), "<", 0)
> # Set the upper and lower bounds
> set.bounds(model, lower=c(0.1, 0.1, 0.1), upper=c(1, 1, 1))
> # Compute the optimized model
> solve(model)
[1] 0
> # Get the value of the optimized parameters
> get.variables(model)
[1] 0.3333333 0.1666667 0.5000000
> # Get the value of the objective function
> get.objective(model)
[1] -0.05333333
> # Get the value of the constraint
> get.constraints(model)
[1] 1 0 0


Quadratic Programming

Quadratic programming is a model where both the objective function is a quadratic function (contains up to two term products) while constraint function is restricted as linear combination of variables.

The Quadratic Programming problem can be defined as follows ...

Minimize quadratic objective function:
f(x1, x2) = c1.x12 + c2. x1x2 + c2.x22 - (d1. x1 + d2.x2)
Subject to constraints
  • a11.x1 + a12.x2 == b1
  • a21.x1 + a22.x2 == b2
  • a31.x1 + a32.x2 >= b3
  • a41.x1 + a42.x2 >= b4
  • a51.x1 + a52.x2 >= b5
Express the problem in Matrix form:

Minimize objective ½(DTx) - dTx
Subject to constraint ATx >= b
First k columns of A is equality constraint

As an illustrative example, lets continue on the portfolio investment problem.  In the example, we want to find an optimal way to allocate the proportion of asset in our investment portfolio.
  • StockA gives 5% return on average
  • StockB gives 4% return on average
  • StockC gives 6% return on average
We also look into the variance of each stock (known as risk) as well as the covariance between stocks.

For investment, we not only want to have a high expected return, but also a low variance.  In other words, stocks with high return but also high variance is not very attractive.  Therefore, maximize the expected return and minimize the variance is the typical investment strategy.

One way to minimize variance is to pick multiple stocks (in a portfolio) to diversify the investment.  Diversification happens when the stock components within the portfolio moves from their average in different direction (hence the variance is reduced).

The Covariance matrix ∑ (between each pairs of stocks) is given as follows:
1%, 0.2%, 0.5%
0.2%, 0.8%, 0.6%
0.5%, 0.6%, 1.2%

In this example, we want to achieve a overall return of at least 5.2% of return while minimizing the variance of return

To formulate the problem:

Variable: x1 = % investment in A, x2 = % in B, x3 = % in C

Minimize variance: xT∑x

Subjected to constraint:
  • x1 + x2 + x3 == 1
  • X1*5% + x2*4% + x3*6% >= 5.2%
  • 0% < x1, x2, x3 < 100%
> library(quadprog)
> mu_return_vector <- c(0.05, 0.04, 0.06) 
> sigma <- matrix(c(0.01, 0.002, 0.005, 
+                   0.002, 0.008, 0.006, 
+                   0.005, 0.006, 0.012), 
+                  nrow=3, ncol=3)
> D.Matrix <- 2*sigma
> d.Vector <- rep(0, 3)
> A.Equality <- matrix(c(1,1,1), ncol=1)
> A.Matrix <- cbind(A.Equality, mu_return_vector, 
                    diag(3))
> b.Vector <- c(1, 0.052, rep(0, 3))
> out <- solve.QP(Dmat=D.Matrix, dvec=d.Vector, 
                  Amat=A.Matrix, bvec=b.Vector, 
                  meq=1)
> out$solution
[1] 0.4 0.2 0.4
> out$value
[1] 0.00672
>

Integration with Predictive Analytics

Optimization is usually integrated with predictive analytics, which is another important part of data analytics.  For an overview of predictive analytics, here is my previous blog about it.

The overall processing can be depicted as follows:


In this diagram, we use machine learning to train a predictive model in batch mode.  Once the predictive model is available, observation data is fed into it at real time and a set of output variables is predicted.  Such output variable will be fed into the optimization model as the environment parameters (e.g. return of each stock, covariance ... etc.) from which a set of optimal decision variable is recommended.

Wednesday, January 2, 2013

MapReduce: Detecting Cycles in Network Graph

I recently received an email from an audience of my blog on Map/Reduce algorithm design regarding how to detect whether a graph is acyclic using Map/Reduce.  I think this is an interesting problem and can imagine there can be wide range of application to it.

Although I haven't solved this exact problem in the past, I'd like to sketch out my thoughts on a straightforward approach, which may not be highly optimized.  My goal is to invite other audience who has solved this problem to share their tricks.

To define the problem:  Given a simple directed graph, we want to tell whether it contains any cycles.

Lets say the graph is represented in incidence edge format where each line represent an edge from one node to another node.  The graph can be split into multiple files.

node1, node2
node4, node7
node7, node1
....

Here is a general description of the algorithm.

We'll maintain two types of graph data.
  1. A set of link files where each line represent an edge in the graph.  This file will be static.
  2. A set of path files where each line represent a path from one node to another node.  This file will be updated in each round of map/reduce.
The general idea is to grow the path file until either the path cannot grow further, or the path contain a cycle, we'll use two global flags to keep track of that: "change" flag to keep track of whether new path is discovered, and "cycle" flag to keep track whether a cycle is detected.

The main driver program will create the initial path file by copying the link file.  And it will set the initial flag condition:  Set cycle and change flag to true.  In each round, the driver will
  • Check if the cycle is still false and change is true, exit if this is not the case
  • Otherwise, it will clear the change flag and invoke the map/reduce
Within each round, the map function will do the following
  • Emit the tuple [key=toNode, value=fromNode] if it is a path
  • Emit the tuple [key=fromNode, value=toNode] if it is a link
This ensure a path and all extended link reaches the same reducer, which will do the following
  • Check to see if the beginning point of the path is the same as the endpoint of the link.  If so, it detects a cycle and mark the cycle flag to be true.
  • Otherwise, it compute the product of every path to every link, and form the new path which effectively extend the length by one.

The following diagram illustrates the algorithm ...




Monday, November 26, 2012

Detecting Communities in Social Graph

In analyzing social network, one common problem is how to detecting communities, such as groups of people who knows or interacting frequently with each other.  Community is a subgraph of a graph where the connectivity are unusually dense.

In this blog, I will enumerate some common algorithms on finding communities.

First of all, community detection can be think of graph partitioning problem.  In this case, a single node will belong to no more than one community.  In other words, community does not overlap with each other.

High Betweenness Edge Removal

The intuition is that members within a community are densely connected and have many paths to reach each other.  On the other hand, nodes from different communities requires inter-community links to reach each other, and these inter-community links tends to have high betweenness score.

Therefore, by removing these high-betweenness links, the graph will be segregated into communities.

Algorithm:
  1. For each edge, compute the edge-betweenness score
  2. Remove the edge with the highest betweenness score
  3. Until there are enough segregation
However, while this method achieve good result, it is very slow and not work effectively when there are more than couple thousand nodes with dense edges.

Hierarchical Clustering

This is a very general approach of detecting communities.  Some measure of distance (or similarities) is defined and computed between every pair of nodes first.  Then classical hierarchical cluster technique can be applied.  The distance should be chosen such that it is small between members within a community and big between members of different community.

Random Walk

Random walk can be used to compute the distance between every pair of nodes node-B and node-C.  Lets focus on undirected graph for now.  A random walker starts at node-B, throw a dice and has beta probability that it randomly pick a neighbor to visit based on the weight of links, and with 1 - beta probability that it will teleport back to the original node-v.  At an infinite number of steps, the probability distribution of landing on node-w will be high if node-B and node-C belongs to the same community.  The intuition here is that the random walker tends to be trapped within the community so all nodes that has high probability distribution tends to be within the same community as node-B (where the random walker is started).

Notice that the pick of beta is important.  If it is too big (close to 1), then the probability after converging is independent of the starting node (ie: the probability distribution only reflect the centrality of each node but not the community of the starting node).  If beta is too small (close to zero), then the walker will die down too quick before it fully explore the community's connectivity.

There is an analytical solution to this problem.


Lets M be the transition matrix before every pair of nodes. V represents the probability distribution of where the random walkers is.


 The "distance" between node-B and every other nodes is the eigenvector of M.  We can repeat the same to find out distance of all pairs of nodes, and then feed the result to a hierarchical clustering algorithm.

Label Propagation

The basic idea is that nodes observe its neighbors and set its own label to be the majority of its neighbors.
  1. Nodes are initially assigned with a unique label.
  2. In each round, each node examine the label of all its neighbors are set its own label to be the majority of its neighbors, when there is a tie, the winner is picked randomly.
  3. Until there is no more change of label assignments

Modularity Optimization

Within a community, the probability of 2 nodes having a link should be higher than if the link is just formed randomly within the whole graph.

probability of random link = deg(node-B) * deg(node-C) / (N * (N-1))
The actual link = Adjacency matrix[B, C]

Define com(B) to be community of node-B, com(C) to be community of node-C

So a utility function "Modularity" is created as follows ...
sum_over_v_w((actual - random) * delta(com(B), com(C)))


Now we examine communities that can be overlapping.  ie: A single node can belong to more than one community.

Finding Clique

Simple community detection usually starts with cliques.  Clique is a subgraph whether every node is connected to any other node.  In a K-Clique, there are K nodes and K^2 links between them.

However, communities has a looser definition, we don't require everyone to know every other people within the community, but we need them to know "enough" (maybe a certain percentage) of other people in the community.  K-core is more relaxed definition, it requires the nodes of the K-core to have connectivity to at least K other members.  There are some less popular relaxation, K-Clan requires every node to connect with every other members within K steps (path length less than K).  K-plex requires the node to connect to (N - K) members in the node where N total number of members within the K-plex.

The community is defined as the found K-core, or K-clan, or K-plex.

K-Clique Percolation

Another popular way of finding community is by rolling across adjacent K-Clique.  Two K-Clique is adjacent if they share K-1 nodes.  K is a parameter that we need to pick based on how dense we expect the community should have.

The algorithm is illustrated in following diagram.





K-Clique percolation is a popular way to identify communities which can potentially be overlapping with each other.

Thursday, October 4, 2012

Machine Learning in Gradient Descent

In Machine Learning, gradient descent is a very popular learning mechanism that is based on a greedy, hill-climbing approach.

Gradient Descent

The basic idea of Gradient Descent is to use a feedback loop to adjust the model based on the error it observes (between its predicted output and the actual output).  The adjustment (notice that there are multiple model parameters and therefore should be considered as a vector) is pointing to a direction where the error is decreasing in the steepest sense (hence the term "gradient").



Notice that we intentionally leave the following items vaguely defined so this approach can be applicable in a wide range of machine learning scenarios.
  • The Model
  • The loss function
  • The learning rate
Gradient Descent is very popular method because of the following reasons ...
  • Intuitive and easy to understand
  • Easy to run in parallel processing architecture
  • Easy to run incrementally with additional data
On the other hand, the greedy approach in Gradient Descent can be trapped in local optimum.  This can be mitigated by choosing a convex LOSS function (which has a single optimum), or multiple starting points can be picked randomly (in the case, we hope the best local optimum is close to the global optimum).

Batch vs Online Learning

While some other Machine Learning model (e.g. decision tree) requires a batch of data points before the learning can start, Gradient Descent is able to learn each data point independently and hence can support both batch learning and online learning easily.  The difference lies in how the training data is fed into the model and how the loss function computes its error.

In batch learning, all training will be fed to the model, who estimates the output for all data points.  Error will then be summed to compute the loss and then update the model.  Model in this case will be updated after predicting the whole batch of data points.

In online learning mode (also called stochastic gradient descent), data is fed to the model one at a time while the adjustment of the model is immediately made after evaluating the error of this single data point.  Notice that the final result of incremental learning can be different from batch learning, but it can be proved that the difference is bound and inversely proportional to the square root of the number of data points.

The learning rate can be adjusted as well to achieve a better stability in convergence.  In general, the learning rate is higher initially and decrease over the iteration of training (in batch learning it decreases in next round, in online learning it decreases at every data point).  This is quite intuitive as you paid less attention to the error as you have learn more and more.  Because of that online learning is sensitive to the arrival order of data.

One way to adjust the learning rate is to have a constant divide by the square root of N (where N is the number of data point seen so far).

 ɳ = ɳ_initial / (t ^ 0.5).

By using different decay factor, we can control how much attention we should pay for the late coming data.  In online learning, as data comes in time of occurrence, we can play around with this decay factor to guide how much attention the learning mechanism should be paying to latest arrival data.  Online learning automatically adapt to change of trends over time.

Most real world machine learning scenario relies on stationality of the model.  By the way, learning is about "learning from past experience".  If the environment changes too rapidly that the past experience is invalid, there is little value to learn.  Because of this reason, most machine learning project are satisfied by using batch learning (daily or weekly) and the demand of online learning is not very high.  A very common batch learning model is described in my previous blog here.

Parallel Learning


Because of no dependency in data processing, Gradient Descent is very easy to put into a parallel processing environment such as Map/Reduce.  Here we illustrate how to parallelize the execution of batch learning.



Notice that there are multiple rounds of Map/Reduce until the model converges.  On the other hand, online learning is not possible for Hadoop Map/Reduce which doesn't support real-time at this moment.

In summary, gradient descent is a very powerful approach of machine learning and works well in a wide spectrum of scenarios.