## Sunday, April 27, 2008

### Bayesian Classifier

Classification Problem

Observing an instance x, determine its class. (e.g. Given a Email, determine if this is a spam).

Solution Approach

Based on probability theory, the solution is class[j] which has maximum chance to produce x.

Also, x can be represents by a set of observed features (a set of predicates). ie: x = a0 ^ a1 ^ a2 ... ^ ak

For each class[j], calculate j which maximize P(class[j] | x).

Also assume we have already gone through a learning stage where a lot of (x, class) has been taught

Note that X is a huge space, it is unlikely that we have seen x during training. Therefore, apply Bayes theorem:

P(class[j] | x) = P(x | class[j]) * P(class[j]) / P(x)

Since P(x) is the same for all j, we can remove P(x).

Find j to maximize: P(a0 ^ a1 ... ^ ak | class[j]) * P(class[j])

P(class[j]) = no_of_training_instances_whose_class_equals_classJ / total_no_of_training_instances. (This is easy to find).

Now P(a0 ^ a1 ... ^ ak | class[j]) is very hard to find because you probably have not met this combination during the training.

Lets say we have some domain knowledge and we understand the dependency relationship between a0, a1 ... We can make some assumptions.

Naive Bayes

So if we know a0, a1 ... ak are "independent of each other given knowing class == class[j], then

P(a0 ^ a1 ... ^ ak | class[j]) is same as P(a0 | class[j]) x P(a1 | class[j]) x .... x P(ak | class[j])

Now P(ak | class[j]) = no_of_instances_has_ak_and_classJ / no_of_instances_has_classJ (This is easy to find)

Spam Filtering Example

x is an Email. class = spam, class = non-spam

Lets break down the observed instance x as a vector of words.

• x = ["Hello", "there", "welcome", .....]
• a0 is position == "Hello"
• a1 is position == "there"
• a2 is position == "welcome"

We assume position[k] == "Hello" is the same for all k and the occurrence of words are independent of each other given a particular class

Therefore, we try to compare between ...

• P(a0 ^ a1 ... ^ ak | spam) * P(spam)
• P(a0 ^ a1 ... ^ ak | nonspam) * P(nonspam)

P(a0 ^ a1 ... ^ak | spam) is the same as:

P(pos == "Hello" | spam) x P(pos == "there" | spam) x .... x P(ak | spam) * P(spam)

P(pos == "Hello" | spam) = no_of_hello_in_spam_email / total_words_in_spam_email

Algorithm

``Class NaiveBayes { def initialize(word_dictionary) {   @word_dictionary = word_dictionary } def learn(doc, class) {   @total_instances += 1   @class_count[class] += 1   for each word in doc {     @word_count_by_class[class][word] += 1     @total_word[class] += 1   } } def classify(doc) {   for each class in ["spam", "nonspam"] {     prob[class] = @class_count[class] / @total_instances     for k in 0 .. doc.length {       word = doc[k]       prob[class] *= (@word_count[_by_class[class][word] + 1) / (@total_word[class] + @word_dictionary.length)     }     if max_prob < prob[class] {       max_prob = prob[class]       max_class = class     }   }   return max_class }}``

Bayesian Network

Sometimes, assuming complete independence is too extreme. We need to relax this assumption by letting some possible dependencies among a0, a1 ... ak.

We can draw a dependency graph (called Bayesian network) between features. For example, if we know ak depends on a2, then a node a2 will have an arc pointing to ak.

P(a0 ^ a1 ... ^ ak | class[j]) = P(a0 | class[j]) x P(a1 | class[j]) x .... x P(ak | a2 ^ class[j])

Now P(ak | a2 ^ class[j]) = no_of_instances_has_ak_a2_and_classJ / no_of_instances_has_a2_and_classJ (this is harder to find than Naive Bayes but still much better).