Automatic Data Insights on your Text Classification or NER dataset
Automatically gets insights on a text classification or NER dataset.
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console
One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.


  • Parameters
    • hf_data (Union[DatasetDict, str, None]) -- Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.
    • hf_inference_names (Optional[List[str]]) -- Use this param alongside hf_data if you have splits you'd like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_data
    • train_data (Union[DataFrame, Dataset, str, None]) -- Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
    • val_data (Union[DataFrame, Dataset, str, None]) -- Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
    • test_data (Union[DataFrame, Dataset, str, None]) -- Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
    • inference_data (Optional[Dict[str, Union[DataFrame, Dataset, str]]]) -- User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
    • max_padding_length (int) -- The max length for padding the input text during tokenization. Default 200
    • hf_model (str) -- The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased
    • num_train_epochs (int) -- The number of epochs to train for (early stopping will always be active). Default 15
    • labels (Optional[List[str]]) -- Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data
    • project_name (Optional[str]) -- Optional project name. If not set, a random name will be generated
    • run_name (Optional[str]) -- Optional run name for this data. If not set, a random name will be generated
    • wait (bool) -- Whether to wait for Galileo to complete processing your run. Default True
    • create_data_embs (Optional[bool]) -- Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(..., include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.
  • Return type
You should either be using the huggingface params (hf_data and optionallyhf_inference_names) or the non-huggingface params (train_data, test_data, val_data, and optionally inference_data)
If you are using the train_data, test_data, and val_data params, note the following:
  • train_data will be used for training the model
  • val_data (if available) will be used as the evaluation set during train data. If it is not there, test_data will be used
  • If val_data and test_data are both provided, test_data will be treated as the held-out set and will be run over after training to view in the Galileo console
If only the training split is provided, it will be randomly split 80/20 to create a validation dataset for training.

Inference data

Along with training data, you can pass in inference data and see the results in Galileo under the various inference splits. The format is a dictionary, where the key is the inference name, and the value is the dataframe. For text classification, the text column is required. For NER, the tokens column is required
If you are using huggingface data (the hf_data param), you should use the hf_inference_names which is a list of keys that should exist in your DatasetDict that map to your inference Datasets.

Data Format Requirements

For text classification datasets, the only required columns are text and label
For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example:
tokens ner_tags
[what, is, a, good, action, movie, that, is, r... [0, 0, 0, 0, 7, 0, ...
[show, me, political, drama, movies, with, jef... [0, 0, 7, 8, 0, 0, ...
[what, are, some, good, 1980, s, g, rated, mys... [0, 0, 0, 0, 5, 6, ...
[list, a, crime, film, which, director, was, d... [0, 0, 7, 0, 0, 0, ...
[is, there, a, thriller, movie, starring, al, ... [0, 0, 0, 7, 0, 0, ...
... ... ...
NOTE: All other columns in the dataset are automatically uploaded as "metadata" columns


An example using auto with a hosted huggingface text classification dataset
import dataquality as dq"rungalileo/trec6")
Similarly, for NER
import dataquality as dq"conll2003")
An example using auto with sklearn data as pandas dataframes
import dataquality as dq
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
# Load the newsgroups dataset from sklearn
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Convert to pandas dataframes
df_train = pd.DataFrame(
{"text":, "label":}
df_test = pd.DataFrame(
{"text":, "label":}
An example of using auto with a local CSV file with text and label columns
import dataquality as dq

Get started with a notebook