SetFit
Log a human-readable version of your dataset. Galileo will join these samples with the model's outputs and present them in the Console. Notice the difference between the logging of the
SetFitTrainer
and the SetFitModel
.SetFitModel
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.setfit import watch
from setfit import SetFitModel
model_id = "./trained_setfit_model"
model = SetFitModel.from_pretrained(model_id)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
project_name=project_name,
run_name=run_name)
# 🔭🌕 Watch the model and return the evaluation function
dq_evaluate = watch(model, finish=False)
# 🔭🌕 Galileo logging for custom split (test)
preds = dq_evaluate(
dataset["test"],
split="test",
column_mapping={
"sentence": "text" ,
"label": "label",
"idx": "id"
})
# 🔭🌕 Once all prediction are logged finish the run
dq.finish()
Log where you are within the training pipeline (epoch and current split) and behind the scenes Galileo will track the different stages of training and will combine your model outputs with your logged input data.
SetFitTrainer
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.setfit import watch
from setfit import SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
num_iterations=20,
column_mapping={"sentence": "text", "label": "label","idx":"id"},
)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
watch(trainer, labels=labels,
project_name=project_name, run_name=run_name
)
trainer.train()
model.save_pretrained("./trained_model")
If your model is supposed to run on multiple GPUs or with multiple workers, we recommend the advanced implementation, where you log model outputs from your PyTorch model's forward function. Note: Your model must be defined in the PyTorch model-subclass-style and be executing eagerly. The advanced solution does not need the model to be watched. Example Colab Notebook.
SetFit Inference
SetFit logging for inference data
from setfit import SetFitModel
inference_data = dataset["test"]
column_mapping = {
"sentence": "text" ,
"label": "label",
"idx": "id"
}
model = SetFitModel.from_pretrained("./trained_model")
# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
project_name=project_name,
run_name=run_name)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
dq.set_labels_for_run(labels)
dq_evaluate = watch(model)
# 🔭🌕 Galileo logging
preds = dq_evaluate(
inference_data,
split="inference",
inference_name="inference_test",
column_mapping=column_mapping)
dq.finish()
The
watch
function watches a SetFit model or trainer, extracting model outputs for data quality evaluation. The function returns a function that can be used to evaluate the model on a dataset.setfit
: An instance of eitherSetFitModel
orSetFitTrainer
which is to be monitored.labels
: An optional list of labels. Default isNone
.project_name
: The name of the project. Default is an empty string.run_name
: The name of the run. Default is an empty string.finish
: A boolean indicating whether to rundq.finish
after evaluation. Default isTrue
.wait
: A boolean indicating whether to wait fordq.finish
. Default isFalse
.batch_size
: An optional integer for specifying batch size for evaluation. Default isNone
.meta
: An optional list of meta data for evaluation. Default isNone
.validate_before_training
: A boolean indicating whether to conduct a test run before training. Default isFalse
.
- A Callable function for model evaluation.
The
auto
function in the dataquality.integrations.setfit
package automatically performs dataquality checks on training, validation, and inference. It requires a pre-trained model (setfit_model
) and optionally training, validation, test data, an inference datas. By integrating it with SetFit, dataquality can enhance their model's performance and better manage the entire data science workflow.from setfit import SetFitTrainer, SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
# 🔭🌕 Galileo logging
from dataquality.integrations.setfit import auto
model_id = "intfloat/e5-base"
# "sentence" is the name of the text column in your dataframe
column_mapping = {"sentence": "text", "label": "label", "idx": "id"}
model = SetFitModel.from_pretrained(model_id)
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
num_iterations=1,
column_mapping=column_mapping,
)
trainer.train()
# 🔭🌕 Galileo logging
model = auto(
setfit_model=model,
train_data=train_dataset,
val_data=eval_dataset,
test_data=test_dataset,
inference_data={"inference": inf_dataset},
project_name="example_setfit",
run_name="example_run",
labels=['negative', 'positive'],
column_mapping=column_mapping
)
Last modified 5mo ago