Large-Scale Taxonomy Categorization for Noisy Product Listings
E-commerce catalogs include a continuously growing number of products that are constantly updated. Each item in a catalog is characterized by several attributes and identified by a taxonomy label. Categorizing products with their taxonomy labels is fundamental to effectively search and organize listings in a catalog. However, manual and/or rule based approaches to categorization are not scalable. In this paper, we compare several classifiers to product taxonomy categorization of top-level categories. We first investigate a number of feature sets and observe that a combination of word unigrams from product names and navigational breadcrumbs work best for categorization. Secondly, we apply correspondence topic models to detect noisy data and introduce a lightweight manual process to improve dataset quality. Finally, we evaluate linear models, gradient boosted trees (GBTs) and convolutional neural networks (CNNs) with pre-trained word embeddings demonstrating that, compared to other baselines, GBTs and CNNs yield the highest gains in error reduction.