Start from an HTML, save the file, crawl its links
def collect_doc(doc)
-- save doc
-- for each link in doc
---- collect_doc(link.href)
Build Word Index
Build a set of
{
word1 => {doc1 => [pos1, pos2, pos3], doc2 => [pos4]}
word2 => {...}
}
def build_index(doc)
-- for each word in doc.extract_words
---- index[word][doc] << style="font-weight: bold;" size="5">Search
Given a set of words, locate the relevant docs
A simple example ...
def search(query)
-- Find the most selective word within query
-- for each doc in index[most_selective_word].keys
---- if doc.has_all_words(query)
------ result << doc
-- return result
Ranking
There are many scoring functions and the result need to be able to combine these scoring functions in a flexible way
def scoringFunction(query, doc)
-- do something from query and doc
-- return score
Scoring function can be based on word counts, page rank, or where the location of words within the doc ... etc.
Scoring need to be normalized, say within the same range.
weight is between 0 to 1 and the sum of all weights equals to 1
def score(query, weighted_functions, docs)
-- scored_docs = []
-- for each weighted_function in weighted_functions
---- for each doc in docs
------ scored_docs[doc] += (weight_function[:func](query, doc) * weight_function[:weight])
No comments:
Post a Comment