The 13th International Conference on Applications of Natural Language to Information Systems (NLDB 2008), June 24-27, 2008, London, United Kingdom
focused crawling; Web search; feature selection; MEMMs
Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models (MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.
Proceedings of the 13th International Conference on Applications of Natural Language to Information Systems (NLDB 2008), June 24-27, 2008, London, United Kingdom.