Configuring DQ Auto
Automatic Data Insights on your Seq2Seq dataset
auto
While using auto with default settings is as simple as running dq.auto()
, you can also set granular control over dataset settings, training parameters, and generation configuration. The auto
function takes in optional parameters for dataset_config
, training_config
, and generation_config
. If a configuration parameter is omitted, default values from below will be used.
Example
Parameters
Parameters
project_name (
Union
[str
,None
]) -- Optional project name. If not set, a default name will be used. Default "s2s_auto"run_name (
Union
[str
,None
]) -- Optional run name. If not set, a random name will be generatedtrain_path (
Union
[str
,None
]) -- Optional training data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
.dataset_config (
Union
[Seq2SeqDatasetConfig
,None
]) -- Optional config for loading the dataset. SeeSeq2SeqDatasetConfig
for more detailstraining_config (
Union
[Seq2SeqTrainingConfig
,None
]) -- Optional config for training the model. SeeSeq2SeqTrainingConfig
for more detailsgeneration_config (
Union
[Seq2SeqGenerationConfig
,None
]) -- Optional config for post training model generation. SeeSeq2SeqGenerationConfig
for more detailswait (
bool
) -- Whether to wait for Galileo to complete processing your run. Default True
Dataset Config
Use the Seq2SeqGenerationConfig()
class to set the dataset for auto training.
Given either a pandas dataframe, local file path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console.
One of hf_data
, train_path
, or train_data
should be provided.
Parameters
Parameters
hf_data (
Union
[DatasetDict
,str
,None
]) -- Use this param if you have huggingface data in the hub or in memory. Otherwise see train_path or train_data, val_path or val_data, and test_path or test_data. If provided, other dataset parameters are ignored.train_path (
Union
[str
,None
]) -- Optional training data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
.val_path (
Union
[str
,None
]) -- Optional validation data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
. If not provided, but test_path is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data.test_path (
Union
[str
,None
]) -- Optional test data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
. The test data, if provided with val, will be used after training is complete, as the hold-out set. If no validation data is provided, this will instead be used as the evaluation set.train_data (
Union
[DataFrame
,Dataset
,None
]) -- Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Huggingface dataset hub pathval_data (
Union
[DataFrame
,Dataset
,None
]) -- Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Huggingface dataset hub pathtest_data (
Union
[DataFrame
,Dataset
,None
]) -- Optional test data to use. The test data, if provided with val, will be used after training is complete, as the hold-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Huggingface dataset hub pathinput_col (
str
) -- Column name of the model input in the provided dataset. Defaulttext
target_col (
str
) -- Column name of the model target output in the provided dataset. Defaultlabel
Training Config
Use the Seq2SeqTrainingConfig()
class to set the training parameters for auto training.
Parameters
Parameters
model (
int
) -- The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Defaultgoogle/flan-t5-base
epochs (
int
) -- The number of epochs to train. Defaults to 3. If set to 0, training/fine-tuning will be skipped and auto will only do a forward pass with the data to gather all the necessary info to display it in the console.learning_rate (
float
) -- Optional learning rate. Defaults to 3e-4batch_size (
int
) -- Optional batch size. Default 4accumulation_steps (
int
) -- Optional accumulation steps. Default 4max_input_tokens (
int
) -- Optional the maximum length in number of tokens for the inputs to the transformer model. If not set, will use tokenizer default or default 512 if tokenizer has no defaultmax_target_tokens (
int
) -- Optional the maximum length in number of tokens for the target outputs to the transformer model. If not set, will use tokenizer default or default 128 if tokenizer has no defaultcreate_data_embs (
Optional
[bool
]) -- Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(..., include_data_embs=True) in the data_emb col. Default True if a GPU is available, else default False.
Generation Config
Use the Seq2SeqGenerationConfig()
class to set the training parameters for auto training.
Parameters
Parameters
max_new_tokens (
int
) -- The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. Default 16temperature (
float
) -- The value used to modulate the next token probabilities. Default 0.2do_sample (
float
) -- Whether or not to use sampling ; use greedy decoding otherwise. Default Falsetop_p (
float
) -- If set to float < 1, only the smallest set of most probable tokens with probabilities that add up totop_p
or higher are kept for generation. Default 1.0top_k (
int
) -- The number of highest probability vocabulary tokens to keep for top-k-filtering. Default 50generation_splits (
Union[List[str], None]
) -- Optional list of splits to perform generation on after training the model. These generated outputs will show up in the console for specified splits. Default ["test"]
Examples
An example using auto
with a hosted huggingface summarization dataset
An example of using auto
with a local jsonl file
Where train.jsonl
might be a file with prompt
and completion
columns that looks like:
Get started with a notebook 📘
Last updated