Exploiting Multiple Features with MEMMs for Focused Web Crawling

From National Research Council Canada

Download	View accepted manuscript: Exploiting Multiple Features with MEMMs for Focused Web Crawling (PDF, 349 KiB)
Author	Search for: Liu, H.; Search for: Milios, E.; Search for: Korba, Larry
Format	Text, Article
Conference	The 13th International Conference on Applications of Natural Language to Information Systems (NLDB 2008), June 24-27, 2008, London, United Kingdom
Subject	focused crawling; Web search; feature selection; MEMMs
Abstract	Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models (MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.
Publication date	2008
In	Proceedings of the 13th International Conference on Applications of Natural Language to Information Systems (NLDB 2008), June 24-27, 2008, London, United Kingdom.
Language	English
NRC number	NRCC 50373
NPARC number	5765089
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	32528c1e-e4f6-40ce-ba06-414d5bd7f94c
Record created	2009-03-29
Record modified	2020-08-12

Date modified:: 2024-12-22