dq.auto
Automatic Data Insights on your Text Classification or NER dataset
Automatically gets insights on a text classification or NER dataset.
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console
One of
hf_data
, train_data
should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.- Parameters
- hf_data (
Union
[DatasetDict
,str
,None
]) -- Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored. - hf_inference_names (
Optional
[List
[str
]]) -- Use this param alongside hf_data if you have splits you'd like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_data - train_data (
Union
[DataFrame
,Dataset
,str
,None
]) -- Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path - val_data (
Union
[DataFrame
,Dataset
,str
,None
]) -- Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path - test_data (
Union
[DataFrame
,Dataset
,str
,None
]) -- Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path - inference_data (
Optional
[Dict
[str
,Union
[DataFrame
,Dataset
,str
]]]) -- User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path - max_padding_length (
int
) -- The max length for padding the input text during tokenization. Default 200 - hf_model (
str
) -- The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased - num_train_epochs (
int
) -- The number of epochs to train for (early stopping will always be active). Default 15 - labels (
Optional
[List
[str
]]) -- Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data - project_name (
Optional
[str
]) -- Optional project name. If not set, a random name will be generated - run_name (
Optional
[str
]) -- Optional run name for this data. If not set, a random name will be generated - wait (
bool
) -- Whether to wait for Galileo to complete processing your run. Default True - create_data_embs (
Optional
[bool
]) -- Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(..., include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.
- Return type
None
You should either be using the huggingface params (
hf_data
and optionallyhf_inference_names)
or the non-huggingface params (train_data
, test_data
, val_data
, and optionally inference_data
)If you are using the
train_data
, test_data
, and val_data
params, note the following:train_data
will be used for training the modelval_data
(if available) will be used as the evaluation set during train data. If it is not there,test_data
will be used- If
val_data
andtest_data
are both provided,test_data
will be treated as the held-out set and will be run over after training to view in the Galileo console
If only the training split is provided, it will be randomly split 80/20 to create a validation dataset for training.
Along with training data, you can pass in inference data and see the results in Galileo under the various inference splits. The format is a dictionary, where the key is the inference name, and the value is the dataframe. For text classification, the
text
column is required. For NER, the tokens
column is requiredIf you are using huggingface data (the
hf_data
param), you should use the hf_inference_names
which is a list of keys that should exist in your DatasetDict that map to your inference Datasets.For text classification datasets, the only required columns are
text
and label
For NER, the required format is the huggingface standard format of
tokens
and tags
(or ner_tags
). See example: https://huggingface.co/datasets/rungalileo/mit_moviestokens ner_tags
[what, is, a, good, action, movie, that, is, r... [0, 0, 0, 0, 7, 0, ...
[show, me, political, drama, movies, with, jef... [0, 0, 7, 8, 0, 0, ...
[what, are, some, good, 1980, s, g, rated, mys... [0, 0, 0, 0, 5, 6, ...
[list, a, crime, film, which, director, was, d... [0, 0, 7, 0, 0, 0, ...
[is, there, a, thriller, movie, starring, al, ... [0, 0, 0, 7, 0, 0, ...
... ... ...
NOTE:
All other columns in the dataset are automatically uploaded as "metadata" columns
An example using
auto
with a hosted huggingface text classification datasetimport dataquality as dq
dq.auto(hf_data="rungalileo/trec6")
Similarly, for NER
import dataquality as dq
dq.auto(hf_data="conll2003")
An example using
auto
with sklearn data as pandas dataframesimport dataquality as dq
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
# Load the newsgroups dataset from sklearn
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Convert to pandas dataframes
df_train = pd.DataFrame(
{"text": newsgroups_train.data, "label": newsgroups_train.target}
)
df_test = pd.DataFrame(
{"text": newsgroups_test.data, "label": newsgroups_test.target}
)
dq.auto(
train_data=df_train,
test_data=df_test,
labels=newsgroups_train.target_names,
project_name="newsgroups_work",
run_name="run_1_raw_data"
)
An example of using
auto
with a local CSV file with text
and label
columnsimport dataquality as dq
dq.auto(
train_data="train.csv",
test_data="test.csv",
project_name="data_from_local",
run_name="run_1_raw_data"
)
Last modified 1mo ago