Pragmatic Programming Techniques: Search Engine Basics

Receive the question of "how search works ?" couple times recently so try to document the whole process. This is intended to highlight the key concepts but not specific implementation details, which will be much more complicated and sophisticated than this one.

A very basic search engine includes a number of processing phases.

Crawling: to discover the web pages on the internet
Indexing: to build an index to facilitate query processing
Query Procesisng: Extract the most relevant page based on user's query terms
Ranking: Order the result based on relevancy

Notice that each element in the above diagram reflects a logical function unit but not its physical boundary. For example, the processing unit in each orange box is in fact executed across many machines in parallel. Similarly, each of the data store element is spread physically across many machines based on the key partitioning.

Vector Space Model

Here we use the "Vector Space Model" where each document is modeled as a multi-dimensional vector (each word represents a dimension). If we put all documents together, we form a matrix where the rows are documents and columns are words, and each cell contains the TF/IDF value of the word within the document.

To determine the similarity between 2 documents, we can apply the dot product between 2 documents and the result will represents the degree of similarity.

Crawler

Crawler's job is to collect web pages on the internet, it is typically done by a farm of crawlers, who do the following

Start from a set of seed URLs, repeat following ...

Pick the URL that has the highest traversal priority.
Download the page content from the URLs to the content repository (which can be a distributed file system, or DHT), as well as update the entry in the doc index
Discover new URL links from the download pages. Add the link relationship into the link index and add these links to the traversal candidates
Prioritize the traversal candidates

The content repository can be any distributed file system, here lets say it is a DHT.

There are a number of considerations.

How to make sure different Crawlers are working on different set of contents (rather than crawling the same page twice) ? When the crawler detects overlapping is happening (url is already exist in the page repository with pretty recent time), the crawler will skip the processing on this URL and pick up the next best URL to crawl.
How does the crawler determines which is the next candidate to crawl ? We can use a heuristic algorithm based on some utility function (e.g. we can pick the URL candidate which has the highest page rank score)
How frequent do we re-crawl ? We can track the rate of changes of the page to determine the frequency of crawling.

Indexer

The Indexer's job is to build the inverted index for the query processor to serve the online search requests.

First the indexer will build the "forward index"

The indexer will parse the documents from the content repository into a token stream.
Build up a "hit list" which describe each occurrence of the token within the document (e.g. position in the doc, font size, is it a title, archor text ... etc).
Apply various "filters" to the token stream (like stop word filters to remove words like "a", "the", or a stemming filter to normalize words "happy", "happily", "happier" into "happy")
Compute the term frequency within the document.

From the forward index, the indexer will proceed to build a reverse index (typically through a Map/Reduce mechanism). The result will be keyed by word and stored in a DHT.

Ranker

Ranker's job is to compute the rank of a document, based on how many in-links pointing to the document as well as the rank of the referrers (hence a recursive definition). Two popular ranking algorithms including the "Page Rank" and "HITs".

Page Rank Algorithm

Page rank is a global rank mechanism. It is precomputed upfront and is independent of the query

HITS Algorithm

In HITS, every page is playing a dual role: "hub" role and "authority" role. It has two corresponding ranks on these two roles. Hub rank measures the quality of the outlinks. A good hub is one that points to many good authorities. Authority ranks measures the quality of my content. A good authority is one that has many good hubs pointing to.

Notice that HITS doesn't pre-compute the hub and authority score. Instead it invoke a regular search engine (which only do TF/IDF matches but not ranking) to get a set of initial results (typically with a predefined fix size) and then expand this result set by tracing the outlinks into the expand result set. It also incorporate a fix size of inlinks (by sampling the inlinks into the initial result set) into the expanded result set. After this expansion, it runs an iterative algorithm to compute the authority ranks and hub ranks. And use the combination of these 2 ranks to calculate the ultimate rank of each page, usually pages with high hub rank will weight more than high authority rank.

Notice that the HITS algorithm is perform at query time and not pre-computed upfront. The advantage of HITS is that it is sensitive to the query (as compare to PageRank which is not). The disadvantage is that it perform ranking per query and hence expensive.

Query Processor

When user input a search query (containing multiple words), the query will be treated as a "query document". Relevancy is computed and combined with the rank of the document and return an ordered list of result.

There are many ways to compute the relevancy. We can consider only the documents that contains all the terms specified in the query. In this model, we search for each term (with the query) a list of document id and then do an intersection with them. If we order the document list by the document id, the intersection can be computed pretty efficiently.

Alternatively, we can return the union (instead of intersection) of all document and order them by a combination of the page rank TF/IDF score. Document that have more terms intersecting with the query will have a higher TF/IDF score.

In some cases, an automatic query result feedback loop can be used to improve the relevancy.

In first round, the search engine will perform a search (as described above) based on user query
Construct a second round query by expanding the original query with additional terms found in the return documents which has high rank in the first round result
Perform a second round of query and return the result.

Outstanding Issues

Fighting the spammer is a continuous battle in search engine. Because of the financial value of being shown up in the first page of search result. Many spammers try to manipulate their page. Earlier attempt is to modify a page to repeat the terms many many times (trying to increase the TF/IDF score). The evolution of Page rank has mitigate this to some degree because page rank in based on "out-of-page" information that the site owner is much harder to manipulate.

But people use Link-farms to game the page rank algorithms. The ideas is to trade links between different domains. There is active research in this area about how to catch these patterns and discount their ranks

4 comments:

Mehdi Mousavi said...: Hi,
That was a good intro. However, can you introduce a book that covers an in-depth look at the searching mechanisms?

Keep up the good work,

Mehdi Mousavi; March 7, 2010 at 5:46 AM
SEO Reseller said...: The Crawler, a White Label SEO Reseller's most anticipated visitor, Thanks for the Logic Diagram, but I would have to echo the suggestion of the one before me, if you could make an e-book about it, that would be greatly appreciated. That would make for an interesting read.; June 7, 2010 at 9:28 PM
Deepak said...: Wonderful work Sir. Looking forward for good stuff; June 10, 2010 at 6:40 PM
systemmechanic11 said...: You have chosen a wonderful topic. Sir Could we call it "Data Flow Programming" also.
Thanks; June 7, 2011 at 10:02 PM

Monday, March 1, 2010

Search Engine Basics

4 comments: