Wednesday, March 12, 2014

Common Text Mining workflow

In this post, I want to summarize a common pattern that I have used in my previous text mining projects.

Text mining is different in that it uses vocabulary term as a key elements in feature engineering, otherwise it is quite similar to a statistical data mining project.  Following are the key steps ...
  1. Determine the "object" that we are interested to analyze.  In some cases, the text document itself is the object (e.g. an  email).  In other cases, the text document is providing information about the object (e.g. user comment of a product, tweaks about a company)
  2. Determine the features of the object we are interested, and create the corresponding feature vector of the object.
  3. Feed the data (each object and its corresponding set of features) to standard descriptive analytics and predictive analytics techniques.
The overall process of text mining can be described in the following flow ...

Extract docs

In this phase, we are extracting text document from various types of external sources into a text index (for subsequent search) as well as a text corpus (for text mining).

Document source can be a public web site, an internal file system, or a SaaS offerings.  Extracting documents typically involves one of the following ...
  • Perform a google search or crawl a predefined list of web sites, then download the web page from the list of URL, parse the DOM to extract text data from its sub-elements, and eventually creating one or multiple documents, store them into the text index as well as text Corpus.
  • Invoke the Twitter API to search for tweets (or monitor a particular topic stream of tweets), store them into the text index and text Corpus.
  • There is no limit in where to download the text data.  In an intranet environment, this can be downloading text document from a share drive.  On the other hand, in a compromised computer, user's email or IM can also be download from the virus agent.
  • If the text is in a different language, we may also invoke some machine translation service (e.g. Google translate) to convert the language to English.
Once the document is stored in the text index (e.g. Lucene index), it is available for search.  Also, once the document is stored in the text corpus, further text processing will be involved.


After the document is stored in the Corpus, here are some typical transformations ...
  • If we want to extract information about some entities mentioned in the document, we need to conduct sentence segmentation, paragraph segmentation in order to provide some local context from which we can analyze the entity with respect to its relationship with other entities.
  • Attach Part-Of-Speech tagging, or Entity tagging (person, place, company) to each word.
  • Apply standard text processing such as lower case, removing punctuation, removing numbers, removing stopword, stemming.
  • Perform domain specific conversion such as replace dddd-dd-dd with , (ddd)ddd-dddd to , remove header and footer template text, remove terms according to domain-specific stop-word dictionary.
  • Optionally, normalize the words to its synonyms using Wordnet or domain specific dictionary.

Extract Features

For text mining, the "bag-of-words model" is commonly used as the feature set.  In this model, each document is represented as a word vector (a high dimensional vector with magnitude represents the importance of that word in the document).  Hence all documents within the corpus is represented as a giant document/term matrix.  The "term" can be generalized as uni-gram, bi-gram, tri-gram or n-gram, while the cell value in the matrix represents the frequency of the term appears in the document.  We can also use TF/IDF as the cell value to dampen the importance of those terms if it appears in many documents.  If we just want to represent whether the term appears in the document, we can binarize the cell value into 0 or 1.

After this phase, the Corpus will turn into a large and sparse document term matrix.

Reduce Dimensions

Since each row in the document/term matrix represents each document as a high dimension vector (with each dimension represents the occurrence of each term), there are two reasons we want to reduce its dimension ...
  1. For efficiency reason, we want to reduce the memory footprint for storing the corpus
  2. We want to transform the vector from the "term" space to a "topic" space, which allows document of similar topics to situate close by each other even they use different terms.  (e.g. document using the word "pet" and "cat" are map to the same topic based on their co-occurrence)
SVD (Singular Value Decomposition) is a common matrix factorization technique to convert a "term" vector into a "concept" vector.  SVD can be used to factor a large sparse matrix of M by N into the multiplication of three smaller dense matrix M*K, K*K, K*N.  Latent Semantic Indexing (LSI) is applying the SVD in the document term matrix.

Another popular technique call topic modeling, based on LDA (Latent Dirichlet Allocation) is also commonly used to transform the document into a smaller set of topic dimensions.

Apply standard data mining

At this point, each document is represented as a topic vector.  We can also add more domain specific features (such as for spam detection, whether the document contains certain word or character patterns such as '$', '!').  After that we can feed the each vector into the regular machine learning process.

Tools and Library

I have used Python's NLTK as well as R's TM, topicmodel library for performing the text mining work that I described above.  Both of these library provide a good set of features for mining text documents.

1 comment:

Алексей Рюмин said...

Thank you a lot for the article!!!