Sunday, December 9, 2007

Search Engine and Ranking

Crawling

Start from an HTML, save the file, crawl its links

def collect_doc(doc)
-- save doc
-- for each link in doc
---- collect_doc(link.href)

Build Word Index

Build a set of tuples

{
word1 => {doc1 => [pos1, pos2, pos3], doc2 => [pos4]}
word2 => {...}
}

def build_index(doc)
-- for each word in doc.extract_words
---- index[word][doc] << style="font-weight: bold;" size="5">Search

Given a set of words, locate the relevant docs

A simple example ...

def search(query)
-- Find the most selective word within query
-- for each doc in index[most_selective_word].keys
---- if doc.has_all_words(query)
------ result << doc
-- return result


Ranking

There are many scoring functions and the result need to be able to combine these scoring functions in a flexible way

def scoringFunction(query, doc)
-- do something from query and doc
-- return score

Scoring function can be based on word counts, page rank, or where the location of words within the doc ... etc.
Scoring need to be normalized, say within the same range.
weight is between 0 to 1 and the sum of all weights equals to 1

def score(query, weighted_functions, docs)
-- scored_docs = []
-- for each weighted_function in weighted_functions
---- for each doc in docs
------ scored_docs[doc] += (weight_function[:func](query, doc) * weight_function[:weight])
-- return scored_docs.sort(:rank)

No comments: