Classification Problem
Observing an instance x, determine its class. (e.g. Given a Email, determine if this is a spam).
Solution Approach
Based on probability theory, the solution is class[j] which has maximum chance to produce x.
Also, x can be represents by a set of observed features (a set of predicates). ie: x = a0 ^ a1 ^ a2 ... ^ ak
For each class[j], calculate j which maximize P(class[j] | x).
Also assume we have already gone through a learning stage where a lot of (x, class) has been taught
Note that X is a huge space, it is unlikely that we have seen x during training. Therefore, apply Bayes theorem:
P(class[j] | x) = P(x | class[j]) * P(class[j]) / P(x)
Since P(x) is the same for all j, we can remove P(x).
Find j to maximize: P(a0 ^ a1 ... ^ ak | class[j]) * P(class[j])
P(class[j]) = no_of_training_instances_whose_class_equals_classJ / total_no_of_training_instances. (This is easy to find).
Now P(a0 ^ a1 ... ^ ak | class[j]) is very hard to find because you probably have not met this combination during the training.
Lets say we have some domain knowledge and we understand the dependency relationship between a0, a1 ... We can make some assumptions.
Naive Bayes
So if we know a0, a1 ... ak are "independent of each other given knowing class == class[j], then
P(a0 ^ a1 ... ^ ak | class[j]) is same as P(a0 | class[j]) x P(a1 | class[j]) x .... x P(ak | class[j])
Now P(ak | class[j]) = no_of_instances_has_ak_and_classJ / no_of_instances_has_classJ (This is easy to find)
Spam Filtering Example
x is an Email. class[0] = spam, class[1] = non-spam
Lets break down the observed instance x as a vector of words.
- x = ["Hello", "there", "welcome", .....]
- a0 is position[0] == "Hello"
- a1 is position[1] == "there"
- a2 is position[2] == "welcome"
We assume
Therefore, we try to compare between ...
- P(a0 ^ a1 ... ^ ak | spam) * P(spam)
- P(a0 ^ a1 ... ^ ak | nonspam) * P(nonspam)
P(a0 ^ a1 ... ^ak | spam) is the same as:
P(pos[0] == "Hello" | spam) x P(pos[1] == "there" | spam) x .... x P(ak | spam) * P(spam)
P(pos[0] == "Hello" | spam) = no_of_hello_in_spam_email / total_words_in_spam_email
Algorithm
Class NaiveBayes {
def initialize(word_dictionary) {
@word_dictionary = word_dictionary
}
def learn(doc, class) {
@total_instances += 1
@class_count[class] += 1
for each word in doc {
@word_count_by_class[class][word] += 1
@total_word[class] += 1
}
}
def classify(doc) {
for each class in ["spam", "nonspam"] {
prob[class] = @class_count[class] / @total_instances
for k in 0 .. doc.length {
word = doc[k]
prob[class] *= (@word_count[_by_class[class][word] + 1) / (@total_word[class] + @word_dictionary.length)
}
if max_prob < prob[class] {
max_prob = prob[class]
max_class = class
}
}
return max_class
}
}
Bayesian Network
Sometimes, assuming complete independence is too extreme. We need to relax this assumption by letting some possible dependencies among a0, a1 ... ak.
We can draw a dependency graph (called Bayesian network) between features. For example, if we know ak depends on a2, then a node a2 will have an arc pointing to ak.
P(a0 ^ a1 ... ^ ak | class[j]) = P(a0 | class[j]) x P(a1 | class[j]) x .... x P(ak | a2 ^ class[j])
Now P(ak | a2 ^ class[j]) = no_of_instances_has_ak_a2_and_classJ / no_of_instances_has_a2_and_classJ
No comments:
Post a Comment