Galileo
Search
K

watch

The watch feature simplifies the training process by eliminating the need for constant manual logging of inputs and outputs. Users can conveniently provide the expected parameters for each modality to watch (as shown below) and proceed with their training.
Once finished, calling dq.finish() triggers data quality processing, and a link to the console is provided for users to view their run.

Frameworks

Galileo currently has watch functions for the following frameworks:
  • Transformers (HuggingFace)
  • PyTorch
  • PyTorch-Lightning
  • Keras
Below are a few examples on how to integrate it.

Transformers

from dataquality.integrations.transformers_trainer import watch
def watch(
trainer: Trainer,
classifier_layer: Optional[Layer] = None,
embedding_dim: Optional[DimensionSlice] = None,
logits_dim: Optional[DimensionSlice] = None,
embedding_fn: Optional[Callable] = None,
logits_fn: Optional[Callable] = None,
last_hidden_state_layer: Optional[Layer] = None,
) -> None
  1. 1.
    Trainer: Trainer object from the transformers library
  2. 2.
    classifier_layer: Layer to hook into (usually 'classifier' or 'fc'). Inputs are the embeddings and outputs are the logits. If none provided we take the last layer.
  3. 3.
    embedding_dim: Dimension of the embeddings. For example, "[:, 0]" can be used to remove the cls token. If none provided we use the entire embedding vector
  4. 4.
    logits_dim: Dimension of the logits from layer input and logits from layer output. If the layer is not found, the last_hidden_state_layer will be used.
  5. 5.
    embedding_fn: Function to process embeddings from the model. If none provided we do no additional processing
  6. 6.
    logits_fn: Function to process logits from the model. If none provided we do no additional processing
  7. 7.
    last_hidden_state_layer: Layer to extract the embeddings from.

PyTorch (Classification, Multi-label Classification, NER)

from dataquality.integrations.torch import watch
def watch(
model: Module,
dataloaders: Optional[List[DataLoader]] = [],
classifier_layer: Optional[Union[str, Module]] = None,
embedding_dim: Optional[InputDim] = None,
logits_dim: Optional[InputDim] = None,
embedding_fn: Optional[Callable] = None,
logits_fn: Optional[Callable] = None,
last_hidden_state_layer: Union[Module, str, None] = None,
unpatch_on_start: bool = False,
dataloader_random_sampling: bool = False,
) -> None
  1. 1.
    model: PyTorch Model to be wrapped.
  2. 2.
    dataloaders: List of dataloaders that will be used in training and validation.
  3. 3.
    classifier_layer: Layer to hook into (usually 'classifier' or 'fc'). Inputs are the embeddings and outputs are the logits. If none provided we take the last layer.
  4. 4.
    embedding_dim: Dimension of the embeddings. For example, "[:, 0]" can be used to remove the cls token. If none provided we use the entire embedding vector.
  5. 5.
    logits_dim: Dimension of the logits from layer input and logits from layer output. If the layer is not found, the last_hidden_state_layer will be used.
  6. 6.
    embedding_fn: Function to process embeddings from the model. If none provided we do no additional processing.
  7. 7.
    logits_fn: Function to process logits from the model. If none provided we do no additional processing.
  8. 8.
    last_hidden_state_layer: Layer to extract the embeddings from.
  9. 9.
    unpatch_on_start: Force unpatching of dataloaders instead of global patching.
  10. 10.
    dataloader_random_sampling: Whether a RandomSampler or WeightedRandomSampler is being used. If random sampling is being used, you must set this to True, otherwise logging will fail at the end of training.

PyTorch (Semantic Segmentation)

from dataquality.integrations.torch_semantic_segmentation import watch
def watch(
model: Module,
imgs_remote_location: str,
local_path_to_dataset_root: str,
dataloaders: Dict[str, DataLoader],
mask_col_name: Optional[str] = None,
unpatch_on_start: bool = False,
) -> None
  1. 1.
    model: This is a PyTorch segmentation model that should return logits for each pixel. The expected shape of the return value can be either (batch_size, num_classes, height, width) or (batch_size, height, width, num_classes).
  2. 2.
    imgs_remote_location: This parameter specifies the URL address of the bucket where the images and masks are stored. Only the parent bucket address is required.
  3. 3.
    local_path_to_dataset_root: This parameter indicates the path to the dataset folder on your local machine. It is assumed that the child directories of your bucket and the child directories of your dataset_path are equivalent. For example, once we navigate to gs/bucket_name on cloud storage, the path to any file relative to bucket_name is expected to be equivalent to the relative path from the dataset. For instance, if our cloud storage hasgs/bucket_name/train_images/images1.png then the same file should be located locally in dataset_path/train_images/images1.png.
  4. 4.
    dataloaders: These are the dataloaders that will be used for analysis, with the expectation that no cropping has been done on them. While these dataloaders may or may not be the ones used for training, it is necessary that they have not undergone any cropping so that the segmentation masks can be displayed accurately in our console.
  5. 5.
    mask_col_name (optional): We require that your dataloader's provided will return a dictionary with an entry for the mask. If no argument is provided, we will attempt to find the column programmatically.
  6. 6.
    unpatch_on_start: If your model has already been subjected to watch(model) being called, this parameter needs to be set to true in order for the hooks to be applied correctly. By default, it is set to false.
Important dataloader notes: The dataloaders passed to watch need not be the same as you use in training. We run separate inference with them post-training. There are two requirements of the dataloaders passed to watch:
  1. 1.
    They cannot have any cropping mechanism on either the mask or image, please only use resizing as cropping will lead to unexpected behavior in our console
  2. 2.
    The batch returned from the dataloader needs to contain the image, the mask an entry called image_path and an entry title mask_path which can either be the path from your dataset_path folder to the image/mask or the absolute path on your local machine.

PyTorch/HuggingFace (Sequence-to-Sequence)

For the sequence-to-sequence task, the watch function works alongside the standard logging functions. Below we will show you how to log all the necessary data. For a full end-to-end example have a look at our PyTorch/HuggingFace notebook.
As in any other task, we will have to log the inputs and the outputs. In addition, we will wrap the model and tokenizer in the watch function in order to align tokens and words in the Galileo Console, and to allow generation.
The inputs are logged before training, in a single call (per split), with dq.log_dataset:
dq.log_dataset(
dataset=ds_train,
text="article",
label="summary",
split="training"
)
The outputs (logits) are logged during the training loop with dq.log_model_outputs:
dq.set_epoch_and_split(split="training", epoch=epoch)
[ ... ]
dq.log_model_outputs(
logits = logits,
ids = ids
)
The log_model_outputs method will be called for every batch of logits. Don't forget to set the epoch and split correctly before logging the logits.
Finally we will wrap the model and tokenizer, as well as define some hyperparameters. Here is the signature of the watch function:
from dataquality.integrations.seq2seq.hf import watch
def watch(
model: PreTrainedModel,
tokenizer: PreTrainedTokenizerFast,
generation_config: GenerationConfig,
generation_splits: Optional[List[str]] = None,
max_input_tokens: Optional[int] = None,
max_target_tokens: Optional[int] = None,
) -> None
  1. 1.
    model: any HuggingFace model, as PreTrainedModel is the base class for all their models. The model is required for generating outputs after a training run.
  2. 2.
    tokenizer: any HuggingFace fast tokenizer, as PreTrainedTokenizerFast is the base class for all their fast tokenizers. Most tokenizers can be instantiated with the argument use_fast=True to be converted into a fast tokenizer. The tokenizer is used in particular to align tokens to words in the Console.
  3. 3.
    generation_config: an object of type GenerationConfig in order to control the generation (temperature, top_k, etc).
  4. 4.
    generation_splits: list of strings indicating on which splits to generate. Choose from "training", "validation" and "test". By default, Galileo will only generate on the test split.
  5. 5.
    max_input_tokens: integer indicating the max number of tokens used for the input during training. If no value is set, Galileo will use the value set in tokenizer.model_max_length. We suggest setting the exact value used during training to avoid confusing insights.
  6. 6.
    max_target_tokens: integer indicating the max number of tokens used for the target during training (teacher forcing). If no value is set, Galileo will use the value set in tokenizer.model_max_length. We suggest setting the exact value used during training to avoid confusing insights.

Examples

Please see our example notebooks for details.