Galileo
Search
K

SetFit

Logging the Data Inputs and Model Outputs

Log a human-readable version of your dataset. Galileo will join these samples with the model's outputs and present them in the Console. Notice the difference between the logging of the SetFitTrainer and the SetFitModel.
SetFitModel
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.setfit import watch
from setfit import SetFitModel
model_id = "./trained_setfit_model"
model = SetFitModel.from_pretrained(model_id)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
project_name=project_name,
run_name=run_name)
# 🔭🌕 Watch the model and return the evaluation function
dq_evaluate = watch(model, finish=False)
# 🔭🌕 Galileo logging for custom split (test)
preds = dq_evaluate(
dataset["test"],
split="test",
column_mapping={
"sentence": "text" ,
"label": "label",
"idx": "id"
})
# 🔭🌕 Once all prediction are logged finish the run
dq.finish()

Training the Model

Log where you are within the training pipeline (epoch and current split) and behind the scenes Galileo will track the different stages of training and will combine your model outputs with your logged input data.
SetFitTrainer
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.setfit import watch
from setfit import SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
num_iterations=20,
column_mapping={"sentence": "text", "label": "label","idx":"id"},
)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
watch(trainer, labels=labels,
project_name=project_name, run_name=run_name
)
trainer.train()
model.save_pretrained("./trained_model")

Inference

If your model is supposed to run on multiple GPUs or with multiple workers, we recommend the advanced implementation, where you log model outputs from your PyTorch model's forward function. Note: Your model must be defined in the PyTorch model-subclass-style and be executing eagerly. The advanced solution does not need the model to be watched. Example Colab Notebook.
SetFit Inference
SetFit logging for inference data
from setfit import SetFitModel
inference_data = dataset["test"]
column_mapping = {
"sentence": "text" ,
"label": "label",
"idx": "id"
}
model = SetFitModel.from_pretrained("./trained_model")
# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
project_name=project_name,
run_name=run_name)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
dq.set_labels_for_run(labels)
dq_evaluate = watch(model)
# 🔭🌕 Galileo logging
preds = dq_evaluate(
inference_data,
split="inference",
inference_name="inference_test",
column_mapping=column_mapping)
dq.finish()

Parameters for the watch function

The watch function watches a SetFit model or trainer, extracting model outputs for data quality evaluation. The function returns a function that can be used to evaluate the model on a dataset.

Parameters

  • setfit: An instance of either SetFitModel or SetFitTrainer which is to be monitored.
  • labels: An optional list of labels. Default is None.
  • project_name: The name of the project. Default is an empty string.
  • run_name: The name of the run. Default is an empty string.
  • finish: A boolean indicating whether to run dq.finish after evaluation. Default is True.
  • wait: A boolean indicating whether to wait for dq.finish. Default is False.
  • batch_size: An optional integer for specifying batch size for evaluation. Default is None.
  • meta: An optional list of meta data for evaluation. Default is None.
  • validate_before_training: A boolean indicating whether to conduct a test run before training. Default is False.

Returns

  • A Callable function for model evaluation.

Dataquality Auto with SetFit

The auto function in the dataquality.integrations.setfit package automatically performs dataquality checks on training, validation, and inference. It requires a pre-trained model (setfit_model) and optionally training, validation, test data, an inference datas. By integrating it with SetFit, dataquality can enhance their model's performance and better manage the entire data science workflow.
from setfit import SetFitTrainer, SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
# 🔭🌕 Galileo logging
from dataquality.integrations.setfit import auto
model_id = "intfloat/e5-base"
# "sentence" is the name of the text column in your dataframe
column_mapping = {"sentence": "text", "label": "label", "idx": "id"}
model = SetFitModel.from_pretrained(model_id)
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
num_iterations=1,
column_mapping=column_mapping,
)
trainer.train()
# 🔭🌕 Galileo logging
model = auto(
setfit_model=model,
train_data=train_dataset,
val_data=eval_dataset,
test_data=test_dataset,
inference_data={"inference": inf_dataset},
project_name="example_setfit",
run_name="example_run",
labels=['negative', 'positive'],
column_mapping=column_mapping
)

Example Notebooks

Last modified 5mo ago