A Dataset and Baseline for e-commerce Product categorization
We make available a document collection of a million product titles from 3,008 anonymized categories of the rakuten.com product catalog. The anonymization has been done due to intellectual property rights on the underlying data organization taxonomy. Our analysis of the characteristics of the 800,000 training and 20,000 validation titles show that they match the test set of 180,000 titles. Twenty six independent teams participated in an automatic product categorization challenge on this dataset. We present results and analysis and suggest strong baselines for this collection and task.