Thursday, January 22, 2009

Solving TF-IDF using Map-Reduce

TF-IDF (Term Frequency, Inverse Document Frequency) is a basic technique to compute the relevancy of a document with respect to a particular term.

"Term" is a generalized element contains within a document. A "term" is a generalized idea of what a document contains. (e.g. a term can be a word, a phrase, or a concept).

Intuitively, the relevancy of a document to a term can be calculated from the percentage of that term shows up in the document (ie: the count of the term in that document divide by the total number of terms in it). We called this the "term frequency"

On the other hand, if this is a very common term which appears in many other documents, then its relevancy should be reduced. (ie: the count of documents having this term divided by total number of documents). We called this the "document frequency"

The overall relevancy of a document with respect to a term can be computed using both the term frequency and document frequency.

relevancy = term frequency * log (1 / document frequency)

This is called tf-idf. A "document" can be considered as a multi-dimensional vector where each dimension represents a term with the tf-idf as its value.

Compute TF-IDF using Map/Reduce

To extract the terms from a document, the following process is common
  • Extract words by tokenize the input streams
  • Make the words case-insensitive (e.g. transform to all lower case)
  • Apply n-gram to extract phrases (e.g. statistically frequent n-grams is likely a phrase)
  • Filter out stop words
  • Stemming (e.g. transform cat, cats, kittens to cat)
To keep the term simple, each word itself is a term in our example below.

We use multiple rounds of Map/Reduce to gradually compute …
  1. the word count of per word/doc combination
  2. the total number of words per doc
  3. the total number of docs per word. And finally compute the TF-IDF



Implementation in Apache PIG

There are many ways to implement the Map/Reduce paradigm above. Apache Hadoop is a pretty popular approach using Java or other programming language (ie: Hadoop Streaming).

Apache PIG is another approach based on a higher level language with parallel processing construct built in. Here is the 3 rounds of map/reduce logic implemented in PIG Script
REGISTER rickyudf.jar

/* Build up the input data stream */
A1 = LOAD 'dirdir/data.txt' AS (words:chararray);
DocWordStream1 =
FOREACH A1 GENERATE
'data.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

A2 = LOAD 'dirdir/data2.txt' AS (words:chararray);
DocWordStream2 =
FOREACH A2 GENERATE
'data2.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

A3 = LOAD 'dirdir/data3.txt' AS (words:chararray);
DocWordStream3 =
FOREACH A3 GENERATE
'data3.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

InStream = UNION DocWordStream1,
DocWordStream2,
DocWordStream3;

/* Round 1: word count per word/doc combination */
B = GROUP InStream BY (word, docId);
Round1 = FOREACH B GENERATE
group AS wordDoc,
COUNT(InStream) AS wordCount;

/* Round 2: total word count per doc */
C = GROUP Round1 BY wordDoc.docId;
WW = GROUP C ALL;
C2 = FOREACH WW GENERATE
FLATTEN(C),
COUNT(C) AS totalDocs;
Round2 = FOREACH C2 GENERATE
FLATTEN(Round1),
SUM(Round1.wordCount) AS wordCountPerDoc,
totalDocs;

/* Round 3: Compute the total doc count per word */
D = GROUP Round2 BY wordDoc.word;
D2 = FOREACH D GENERATE
FLATTEN(Round2),
COUNT(Round2) AS docCountPerWord;
Round3 = FOREACH D2 GENERATE
$0.word AS word,
$0.docId AS docId,
com.ricky.TFIDF(wordCount,
wordCountPerDoc,
totalDocs,
docCountPerWord) AS tfidf;

/* Order the output by relevancy */
ORDERRound3 = ORDER Round3 BY word ASC,
tfidf DESC;
DUMP ORDERRound3;



Here is the corresponding User Defined Function in Java (contained in rickyudf.jar)
package com.ricky;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class TFIDF extends EvalFunc<Double> {

@Override
public Double exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
long wordCount = (Long) input.get(0);
long wordCountPerDoc = (Long) input.get(1);
long totalDocs = (Long) input.get(2);
long docCountPerWord = (Long) input.get(3);

double tf = (wordCount * 1.0) / wordCountPerDoc;
double idf = Math.log((totalDocs * 1.0) / docCountPerWord);

return tf * idf;
}
}

2 comments:

prasoonsharma said...

Good post Ricky. I'm getting started with Pig and your is helpful. Looking forward to more...

Unknown said...

Liked the post as it helped me well in beginning..Keep posting some more on Pig Scripting.