Upon completing a run, you'll be taken to the Galileo Console. The first thing you'll notice is your dataset on the right. On each row, we show you your sample with its Ground Truth annotations, the same sample with your model's prediction, the Data Error Potential of the sample and an error count. By default, your samples are sorted by Data Error Potential.
You can also view your samples in the embeddings space of the model. This can help you get a semantic understanding of your dataset. Using features like Color-By DEP, you might discover pockets of problematic data (e.g. decision boundaries that might benefit from more samples or a cluster of garbage samples).
Your left pane is called the Insights Menu. On the top you can see your dataset size and choose the metric you want to guide your exploration by (F1 by default). Size and metric update as you add filters to your dataset.
Clicking on an Alert will filter the dataset to the subset of data that corresponds to the Alert.
Under metrics, you'll find different charts, such as:
- High Problematic Words
- Error Distribution
- F1 by Class
- Sample Count by Class
- Overlapping Classes
- Top Misclassified Pairs
- DEP Distribution
These charts are dynamic and update as you add different filters. They're also interactive - clicking on a class or group of classes will filter the dataset accordingly, allowing you to inspect and fix the samples.
Once you've identified a problematic subset of data, Galileo allows you to fix your samples with the goal of improving your F1 or performance metric of choice. In Text Classification runs, we allow you to:
- Change Label - Re-assign the label of your image right in-tool
- Remove - Remove problematic images you want to discard from your dataset
- Edit Data - Add or Move Spans, fix fypos or extraneous characters in your samples
- Export - Download your samples so you can fix them elsewhere
Your changes are tracked in your Edits Cart. There you can view a summary of the changes you've made, you can undo them, or download a clean and fixed dataset to retrain your model.
Your dataset splits are maintained on Galileo. Your data is logged as Training, Test and/or Validation split. Galileo allows you to explore each split independently. Some alerts, such as Underfitting Classes or Overfitting Classes look at cross-split performance. However, for the most part, each split is treated independently.
To switch splits, find the Splits dropdown next to your project and run name near the top of the screen. By default, the Training split is shown first.
You can always adjust the DEP slider to filter this view and update the Insights.
Galileo automatically identifies whether any of the following errors are present per row:
a. Span Shift: A count of the misaligned spans that have overlapping predicted and gold spans
b. Wrong Tag: A count of aligned predicted and gold spans that primarily have mismatched labels
c. Missed Span: A count of the spans that have gold spans, but no corresponding predicted spans
d. Ghost Span: A count of the spans that have predicted spans, but no corresponding gold spans
Often it is critical to get a high level view of what specific words the model is struggling with most. This NER specific insight lists out the words that are most frequently contained within spans with high DEP scores.
Click on any word to get a filtered view of the high DEP spans containing that word.
Hover over any region to get a list of spans and the corresponding DEP scores in a list.
Click the region to get a detailed view for a particular span that has been clicked.
After every run, you might want to prune your dataset to either
a. Prep it for the next training job
b. Send the dataset for re-labeling
You can think of the 'Edits Cart' as a means to capture all the dataset changes done during the discovery phase (removing/re-labeling rows and spans) to collectively take action upon a curated dataset.
At any point you can export the dataset to a CSV file in a easy to view format.
As shown in Figure 1, observing the samples that have a high DEP score (i.e. they are hard for the model), and a non-zero count for ghost spans, can help identify samples where the annotators overlooked actual spans. Such annotation errors can cause inconsistencies in the dataset, which can affect model generalization.
As shown in Figure 2, observing the subset of data with span labels in pairs with high confusion matrix and having high DEP, can help identify samples where the annotators incorrectly labelled the spans with a different class tag. Example: An annotator confused "ACTOR" spans with "DIRECTOR" spans, thereby contributing to the model biases.
As shown in Figure 3, the insights panel provides top erroneous words across all spans in the dataset. These words have the highest average DEP across spans, and should be further inspected for error patterns. Example: "rated" had high DEP because it was inconsistently labelled as "RATING_AVERAGE" or "RATING" by the annotators.
As shown in Figure 4, the model performance charts can be used to identify and filter on the least performing class. The erroneously annotated spans surface to the top.
As shown in the Figure 5, the "color-by" feature can be used to observe predicted embeddings, and see the spans that are present in ground truth data, but were not predicted by the model. These spans are hard for the model to predict on
As shown in Figure 6, the error distribution chart can be used to identify which classes have highly confused spans, where the span class was predicted incorrectly. Sorting by DEP and wrong tag error can help surface such confusing spans.
As shown in Figure 7, the smart features from Galileo allow one to quickly find ill-formed samples. Example: Adding text length as a column and sorting based on it will surface malformed samples.