PUBLICATIONS

A Dataset and Baseline for e-commerce Product categorization

Author: Author: Yiu-Chang Lin, Pradipto Das, Andrew Trotman, Surya Kallumadi

Sep 2019

ICTIR - 2019 ACM SIGIR International Conference on the Theory of Information

ABSTRACT

We make available a document collection of a million product titles from 3,008 anonymized categories of the rakuten.com product catalog. The anonymization has been done due to intellectual property rights on the underlying data organization taxonomy. Our analysis of the characteristics of the 800,000 training and 20,000 validation titles show that they match the test set of 180,000 titles. Twenty six independent teams participated in an automatic product categorization challenge on this dataset. We present results and analysis and suggest strong baselines for this collection and task.

Paper Link

Copied!

Research Areas : #Language Program #Machine Learning

Tags : #Machine Learning #Natural Language Processing

Careers : Open Positions