Abstract | This thesis endeavors to solve a text classification (TC) problem of a real-world system, New Brunswick Opportunities Network (NBON), an online tendering system that helps the vendors and the purchasing agents to provide and obtain information about business opportunities. The solution mainly involves techniques in the areas of machine learning and natural language processing (NLP). We use a Naïve Bayes classifier, a simple and effective machine learning approach for TC tasks, to automatically classify the tenders of the NBON system. We implement three smoothing algorithms for the Naïve Bayes classifier, namely, no-match, Laplace correction, Lidstone's law of succession, and we show that the difference between the accuracies obtained for the three algorithms is negligible. We show that the effectiveness of the Naïve Bayes classifier is better than that of three other TC techniques that are equally simple, namely, Strong Predictors (a modification of Term Frequency), TF-IDF (Term Frequency - Inverse Document Frequency), and WIDF (Weighted Inverse Document Frequency). NLP tools such as stop lists and stemmers are adopted for the text operations on the historic NBON data that is used to train the classifiers. We experiment with variations of such tools and show that NLP techniques do not have much impact on the effectiveness of a classifier. |
---|