Wednesday, March 12, 2014

Common Text Mining workflow

In this post, I want to summarize a common pattern that I have used in my previous text mining projects.

Text mining is different in that it uses vocabulary term as a key elements in feature engineering, otherwise it is quite similar to a statistical data mining project.  Following are the key steps ...
  1. Determine the "object" that we are interested to analyze.  In some cases, the text document itself is the object (e.g. an  email).  In other cases, the text document is providing information about the object (e.g. user comment of a product, tweaks about a company)
  2. Determine the features of the object we are interested, and create the corresponding feature vector of the object.
  3. Feed the data (each object and its corresponding set of features) to standard descriptive analytics and predictive analytics techniques.
The overall process of text mining can be described in the following flow ...

Extract docs

In this phase, we are extracting text document from various types of external sources into a text index (for subsequent search) as well as a text corpus (for text mining).

Document source can be a public web site, an internal file system, or a SaaS offerings.  Extracting documents typically involves one of the following ...
  • Perform a google search or crawl a predefined list of web sites, then download the web page from the list of URL, parse the DOM to extract text data from its sub-elements, and eventually creating one or multiple documents, store them into the text index as well as text Corpus.
  • Invoke the Twitter API to search for tweets (or monitor a particular topic stream of tweets), store them into the text index and text Corpus.
  • There is no limit in where to download the text data.  In an intranet environment, this can be downloading text document from a share drive.  On the other hand, in a compromised computer, user's email or IM can also be download from the virus agent.
  • If the text is in a different language, we may also invoke some machine translation service (e.g. Google translate) to convert the language to English.
Once the document is stored in the text index (e.g. Lucene index), it is available for search.  Also, once the document is stored in the text corpus, further text processing will be involved.


After the document is stored in the Corpus, here are some typical transformations ...
  • If we want to extract information about some entities mentioned in the document, we need to conduct sentence segmentation, paragraph segmentation in order to provide some local context from which we can analyze the entity with respect to its relationship with other entities.
  • Attach Part-Of-Speech tagging, or Entity tagging (person, place, company) to each word.
  • Apply standard text processing such as lower case, removing punctuation, removing numbers, removing stopword, stemming.
  • Perform domain specific conversion such as replace dddd-dd-dd with , (ddd)ddd-dddd to , remove header and footer template text, remove terms according to domain-specific stop-word dictionary.
  • Optionally, normalize the words to its synonyms using Wordnet or domain specific dictionary.

Extract Features

For text mining, the "bag-of-words model" is commonly used as the feature set.  In this model, each document is represented as a word vector (a high dimensional vector with magnitude represents the importance of that word in the document).  Hence all documents within the corpus is represented as a giant document/term matrix.  The "term" can be generalized as uni-gram, bi-gram, tri-gram or n-gram, while the cell value in the matrix represents the frequency of the term appears in the document.  We can also use TF/IDF as the cell value to dampen the importance of those terms if it appears in many documents.  If we just want to represent whether the term appears in the document, we can binarize the cell value into 0 or 1.

After this phase, the Corpus will turn into a large and sparse document term matrix.

Reduce Dimensions

Since each row in the document/term matrix represents each document as a high dimension vector (with each dimension represents the occurrence of each term), there are two reasons we want to reduce its dimension ...
  1. For efficiency reason, we want to reduce the memory footprint for storing the corpus
  2. We want to transform the vector from the "term" space to a "topic" space, which allows document of similar topics to situate close by each other even they use different terms.  (e.g. document using the word "pet" and "cat" are map to the same topic based on their co-occurrence)
SVD (Singular Value Decomposition) is a common matrix factorization technique to convert a "term" vector into a "concept" vector.  SVD can be used to factor a large sparse matrix of M by N into the multiplication of three smaller dense matrix M*K, K*K, K*N.  Latent Semantic Indexing (LSI) is applying the SVD in the document term matrix.

Another popular technique call topic modeling, based on LDA (Latent Dirichlet Allocation) is also commonly used to transform the document into a smaller set of topic dimensions.

Apply standard data mining

At this point, each document is represented as a topic vector.  We can also add more domain specific features (such as for spam detection, whether the document contains certain word or character patterns such as '$', '!').  After that we can feed the each vector into the regular machine learning process.

Tools and Library

I have used Python's NLTK as well as R's TM, topicmodel library for performing the text mining work that I described above.  Both of these library provide a good set of features for mining text documents.

Monday, March 3, 2014

Estimating statistics via Bootstrapping and Monte Carlo simulation

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population.

It will be too tedious to enumerate all members of the whole population.  For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population.  We understand the estimation done this way (via random sampling) can deviate from the population.  Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics).

The challenge is how do we estimate the "standard error" or the "confidence interval".  A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation.  Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation.

But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set.  Bootstrapping provides a way to address this ...


Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set.  Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population.

R provides a nice "bootstrap" library to do this.

> library(boot)
> # Generate a population
> population.weight <- rnorm(100000, 160, 60)
> # Lets say we care about the ninety percentile
> quantile(population.weight, 0.9)
> # We create our first sample set of 500 samples
> sample_set1 <- sample(population.weight, 500)
> # Here is our sample statistic of ninety percentile
> quantile(sample_set1, 0.9)
> # Notice that the sample statistics deviates from the population statistics
> # We want to estimate how big is this deviation by using bootstrapping
> # I need to define my function to compute the statistics
> ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))}
> # Bootstrapping will call this function many times with different idx
> boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000)
> boot_result


boot(data = sample_set1, statistic = ninety_percentile, R = 1000)

Bootstrap Statistics :
    original   bias    std. error
t1* 232.3641 2.379859     5.43342
> plot(boot_result)
>, type="bca")
Based on 1000 bootstrap replicates

CALL : = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   (227.2, 248.1 )  
Calculations and Intervals on Original Scale

Here is the visual output of the bootstrap plot

Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way.  It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics. 

If we assume certain distribution model want to see the distribution of certain statistics.  Monte Carlo simulation provides a powerful way for this.

Monte Carlo Simulation

The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples.  We compute the statistics of each sample set and see how the statistics distributed across different sample sets.

For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples.

> sample_stats <- rep(0, 1000)
> for (i in 1:1000) {
+     sample_stats[i] <- max(rnorm(5))
+ }
> mean(sample_stats)
[1] 1.153008
> sd(sample_stats)
[1] 0.6584022
> par(mfrow=c(1,2))
> hist(sample_stats, breaks=30)
> qqnorm(sample_stats)
> qqline(sample_stats)

Here is the distribution of the "max(5)" statistics, which shows some right skewness

Bootstrapping and Monte Carlo simulation is a powerful tool to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.