PUBLICATIONS
Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce
ABSTRACT
Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce
The cataloging of product listings through taxonomy categorization is a fundamental problem for any e-commerce marketplace, with applications ranging from personal- ized search recommendations to query un- derstanding. However, manual and rule based approaches to categorization are not scalable. In this paper, we compare sev- eral classifiers for categorizing listings in both English and Japanese product cata- logs. We show empirically that a combina- tion of words from product titles, naviga- tional breadcrumbs, and list prices, when available, improves results significantly. We outline a novel method using corre- spondence topic models and a lightweight manual process to reduce noise from mis- labeled data in the training set. We con- trast linear models, gradient boosted trees (GBTs) and convolutional neural networks (CNNs), and show that GBTs and CNNs yield the highest gains in error reduc- tion. Finally, we show GBTs applied in a language-agnostic way on a large- scale Japanese e-commerce dataset have improved taxonomy categorization perfor- mance over current state-of-the-art based on deep belief network models.