The Embeddings View provides a visual playground for you to interact with your datasets. To visualize your datasets, we leverage your model's embeddings logged during training, validation, testing or inference. Given these embeddings, we plot the data points on the 2D plane using the techniques explained below.
After experimenting with a host of different dimensionality reduction techniques, we have adopted the principles of UMAP . Given a high dimensional dataset, UMAP seeks to preserve the positional information of each data sample while projecting the data into a lower dimensional space (the 2D plane in our case). We additionally use a parameterized version of UMAP along with custom compression techniques to efficiently scale our data visualization to O(million) samples.
The Embedding View allows you to visually detect patterns in the data, interactively select dataset sub populations for further exploration, and visualize different dataset features and insights to identify model decision boundaries and better gauge overall model performance. Visualizing data embeddings provides a key component in going beyond traditional dataset level metrics for analyzing model performance and understanding data quality.
Navigating the embedding view is made easy with interactive plotting. While exploring your dataset you can easily adjust and drag the embedding plane with the Pan tool, zoom in and out on specific data regions with Scroll to Zoom, and reset the visualization with the Reset Axes tool*.* To interact with individual data samples, simply hover the cursor over a data sample of interest to display information and insights.
Fig. General embeddings view navigation
One powerful feature is the ability to color data points by different data fields e.g.
ground truth labels,
data error potential (DEP), etc. Different data coloring schemes reveal different dataset insights (i.e. using color by
predicted labelsreveals the model's perceived decision boundaries) and altogether provide a more holistic view of the data.
Fig. Coloring by different data fields opens the door to a range of insights
Once you have identified a data subset of interest, you can explicitly select this subset to further analyze and view insights on. We offer two different selection tools: lasso selection and box select.
After selecting a data subset, the embeddings view, insights charts, and the general data table are all updated to reflect just the selected data. As shown below, given a cluster of miss-classified data points, you can make a lasso selection to easily inspect subset specific insights. For example, you can view model performance on the selected sub population, as well as develop insights into which classes are most significantly underperforming.
Fig. Lasso Selection
In the Embeddings View, you can easily interact with Galileo's similarity search feature. Hovering over a data point reveals the "Show similar" button. When selected, your inspection dataset is restricted to the data samples with most similar embeddings to the selected data sample, allowing you to quickly inspect model performance over a highly focused data sub-population. See the similarity search __ documentation for more details.
Fig. Similarity search enables quick surfacing of similar data samples