A Dataset and Baseline for e-commerce Product categorization

Author: Author: Yiu-Chang Lin, Pradipto Das, Andrew Trotman, Surya Kallumadi


We make available a document collection of a million product titles from 3,008 anonymized categories of the product catalog. The anonymization has been done due to intellectual property rights on the underlying data organization taxonomy. Our analysis of the characteristics of the 800,000 training and 20,000 validation titles show that they match the test set of 180,000 titles. Twenty six independent teams participated in an automatic product categorization challenge on this dataset. We present results and analysis and suggest strong baselines for this collection and task.

