Results on Public Datasets
Use the Galileo Sandbox environment to explore the enterprise-grade data quality platform
Galileo helps you discover insights and errors in your training dataset within minutes, not days! You can now confidently ditch excel sheets and ad hoc python scripts, mitigating the cumbersome detective work of exploratory dataset analysis.
We used a pretrained DistilBERT model to train (until convergence) on four popular public datasets across two tasks (described in Table below). Galileo was used to inspect, discover, and fix dataset errors using insights surfaced from the UI.
Go ahead and test drive the Galileo Sandbox environment with these datasets. Feel free to follow along with our provided insights (see Table below)
Dataset / Task
Movie Reviews / Sentiment Classification
SST2 (Stanford Sentiment Treebank) dataset for sentiment classification of IMDb movie reviews with
Conversational AI / Intent Classification
Multi-class intent classification dataset for conversational AI. The training set has ~13K samples and the test set has ~700 samples, each query belonging to one of the following 7 classes:
Product Reviews / Sentiment Classification
Dataset for binary sentiment classification of amazon reviews in either
Banking Call Center / Intent Classification
Dataset for multi-class intent classification for call-center banking queries. The training set has ~10K samples and the test set has ~3K samples, each query belonging to one of 77 classes.