dq.metrics
Helper functions to get your raw and processed data out of Galileo
Experiment to your heart's content using
dq.metrics
to access your raw probabilities, embeddings, and processed dataframes from Galileo. These helper functions make it easy to access the data you've been looking for your whole life.Downloads the data processed by Galileo for a run/split as a Vaex dataframe.
Optionally include the raw logged embeddings, probabilities, or text-token-indices (NER only)
Passing in
"*"
in the meta_cols list indicates to return the dataframe with all metadata columns. E.g. meta_cols=["*"]
or meta_cols=["perplexity", "*"]
Special note for NER: By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a
spans
column in a spacy-compatible JSON format. If include_emb is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddingsdef get_dataframe(
project_name: str,
run_name: str,
split: Split,
inference_name: str = "",
file_type: FileType = FileType.arrow,
include_embs: bool = False,
include_probs: bool = False,
include_token_indices: bool = False,
hf_format: bool = False,
tagging_schema: Optional[TaggingSchema] = None,
filter: Union[FilterParams, Dict] = None,
as_pandas: bool = True,
meta_cols: Optional[List[str]] = None,
) -> DataFrame:
Example:
Arguments | Text |
---|---|
project_name | project to download data for |
run_name | run to download data for |
split | split to download data for |
file_type | the file type to download the data as. Default arrow. It's suggested to leave the default as is. |
include_embs | Whether to include the full logged embeddings in the data. If True for NER, the sample-level rows will be expanded to span-level rows in order to join the embeddings. Default False |
include_probs | Whether to include the full logged probabilities in the data. Not available for NER runs. Default False |
include_token_indices | (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraining |
hf_format | (NER only) Whether to export your data in a huggingface compatible format. This will return a dataframe with text , tokens , ner_tags , ner_tags_readable , and ner_labels which is a mapping from your ner_tags to your labels |
tagging_schema | (NER only) if hf_format is True , this must be set. Must be one of BIO , BIOES , or BILOU |
filter | Optional filter to provide to restrict the distribution to only to matching rows. See the FilterParams section below, or, in code, help(dq.metrics.FilterParams) |
as_pandas | Whether to return the dataframe as a pandas df (or vaex if False ) If you are having memory issues (the data is too large), set this to False , and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default True |
Examples:
import dataquality as dq
from dataquality.schemas.metrics import FilterParams
project = "my_project"
run = "my_run"
split = "training"
df = dq.metrics.get_dataframe(project, run, split)
# This will be a vaex dataframe because embeddings are multi-dimensional
df_with_embs = dq.metrics.get_dataframe(project, run, split, include_embs=True)
# Filter dataframe
df_high_dep = dq.metrics.get_dataframe(
project, run, split, filter={"data_error_potential_low": 0.9}
)
# Or use the FilterParams
df_high_dep = dq.metrics.get_dataframe(
project, run, split, filter=FilterParams(data_error_potential_low=0.9)}
)
# NER only
# This df will be at the sample level
df_with_tokens = dq.metrics.get_dataframe(project, run, split, include_token_indices=True)
# This df will be expanded to the span level
df_with_embs_and_tokens = dq.metrics.get_dataframe(project, run, split, include_embs=True, include_token_indices=True)
# Get your data into a huggingface dataset!
hf_df = dq.metrics.get_dataframe(
project, run, split, hf_format=True, tagging_schema="BIO"
)
hf_df.export("data.parquet")
from datasets import Dataset
ds = Dataset.from_parquet("data.parquet")
When using the
get_dataframe
function, there is a parameter called filter
which can be passed in as a dictionary, or as a FilterParams
object.You can import the FilterParams using
dq.metrics.FilterParams
and run help(dq.metrics.FilterParams)
to see all available filters.The list of currently available filters:
Filter | Type | Usage |
---|---|---|
ids | List[int] | Filter for specific IDs in the dataframe (span in NER) |
similar_to | Optional[int] | If running similarity search, how many similar samples to get. More willl take longer |
text_pat | Optional[StrictStr] | Filter text samples by some text pattern |
regex | Optional[bool] | If searching with text, whether to use regex |
data_error_potential_high | Optional[float] | Only samples with DEP <= this |
data_error_potential_low | Optional[float] | Only samples with DEP >= this |
misclassified_only | Optional[bool] | Only look at misclassified samples |
gold_filter | Optional[List[StrictStr]] | Filter GT classes |
pred_filter | Optional[List[StrictStr]] | Filter prediction classes |
class_filter | Optional[List[StrictStr]] | Filter for samples with these values as the GT OR prediction |
meta_filter | Optional[List[Metafilter]] | Filter on particular metadata columns in the dataframe. see help(dq.schemas.metrics.MetaFilter) |
inference_filter | Optional[InferenceFilter] | Specific fitlers related to inference data. See help(dq.metrics.metrics.InferenceFilter) |
span_sample_ids | Optional[List[int]] | (NER only) filter for full samples by ID (will return all spans in those samples) |
span_text | Optional[str] | (NER only) filter only on span text |
exclude_ids | List[int] | Opposite of ids filter. Exclude the ids passed in (will apply to spans in NER) |
lasso | Optional[LassoSelection] | Related to making a lasso selection from the UI. See the dq.schemas.metrics.LassoSelection class |
likely_mislabeled | Optional[bool] | Filter for only likely_mislabeled samples. False/None will return all samples |
likely_mislabeled_dep_percentile | Optional[int] | A percentile threshold for likely_mislabeled . This field (ranged 0-100) determines the precision of the likely_mislabeled filter. The threshold is applied against the DEP distribution of the likely_mislabeled samples. A threshold of 0 returns all, 100 returns 1 sample, and 50 will return the top 50% DEP samples that are likely_mislabeled. Higher = more precision, lower = more recall. Default 0. |
Downloads the data with all edits from the edits cart applied as a Vaex dataframe.
Optionally include the raw logged embeddings, probabilities, or text-token-indices (NER only)
Note: This function has the identical syntax and signature as
get_dataframe
with the exception of no filter
parameter. See get_dataframe
above for a full list of parameters and examples. Anything passed into get_dataframe
(besides filter
) can be used exactly the same as get_edited_dataframe
def get_edited_dataframe(
project_name: str,
run_name: str,
split: Split,
inference_name: str = "",
file_type: FileType = FileType.arrow,
include_embs: bool = False,
include_probs: bool = False,
include_token_indices: bool = False,
hf_format: bool = False,
tagging_schema: Optional[TaggingSchema] = None,
as_pandas: bool = True
) -> DataFrame:
Examples:
import dataquality as dq
from dataquality.schemas.metrics import FilterParams
project = "my_project"
run = "my_run"
split = "training"
# Export the edited dataframe with all edits from the edits cart
edited_df = dq.metrics.get_edited_dataframe(project, run, split)
# See the probabilities with your edited data (this will be a vaex dataframe)
edited_df = dq.metrics.get_edited_dataframe(project, run, split, include_probs=True)
Get the full run summary for a run/split. This provides:
- overall metrics (weighted f1, recall, precision),
- DEP distribution
- misclassified pairs
- top 50 samples sorted by DEP descending
- top DEP words (NER only)
- performance per task (Multi-label only)
def get_run_summary(
project_name: str,
run_name: str,
split: Split,
task: Optional[str] = None,
inference_name: Optional[str] = None,
filter: Union[FilterParams, Dict] = None,
) -> Dict:
Example:
Returns | Text |
---|---|
Dict[str, Any] | A dictionary of many different fields of interest, encompassing a "summary" of this run/split, including performance metrics, some samples, distributions etc. |
import dataquality as dq
import pandas as pd
project = "my_project"
run = "my_run"
split = "training"
sumamry = dq.metrics.get_run_summary(
project, run, split,
)
print(summary)
# See summary for only misclassified samples
mis_sumamry = dq.metrics.get_run_summary(
project, run, split, filter={"misclassified_only": True}
)
print(mis_sumamry)
Get metrics for classes grouped by a particular categorical column (or ground truth or prediction)
def get_metrics(
project_name: str,
run_name: str,
split: Split,
task: Optional[str] = None,
inference_name: Optional[str] = None,
category: str = "gold",
filter: Union[FilterParams, Dict] = None,
) -> Dict[str, List]:
Example:
Returns | Text |
---|---|
Dict[str, List] | A dictionary of keys -> list of values. The labels key is your x axis, and your other keys are potential y-axes (useful for plotting or inputting to a Pandas dataframe |
import dataquality as dq
import pandas as pd
project = "my_project"
run = "my_run"
split = "training"
metrics = dq.metrics.get_metrics(
project, run, split, category="galileo_language_id"
)
metrics_df = pd.DataFrame(metrics)
Plots the distribution for a continuous column in your data. Defaults to Data Error Potential.
When plotting data error potential, the hard/easy DEP thresholds will be used for coloring. Otherwise no coloring is applied.
Plotly must be installed for this function to work.
def display_distribution(
project_name: str,
run_name: str,
split: Split,
task: Optional[str] = None,
inference_name: Optional[str] = None,
column: str = "data_error_potential",
filter: Union[FilterParams, Dict] = None,
) -> None:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
split = "training"
# Display DEP distribution colored by thresholds
dq.metrics.display_distribution(
project, run, split
)
# Display text length distribution
dq.metrics.display_distribution(
project, run, split, column="galileo_text_length"
)
# Display DEP distribution only for with gold/pred = class "APPLE"
dq.metrics.display_distribution(
project, run, split, filter={"class_filter": ["APPLE"]}
)
Returns the list of epochs logged for a run/split
def get_epochs(
project_name: str, run_name: str, split: Split
) -> List[int]:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
split = "training"
logged_epochs = dq.metrics.get_epochs(project, run, split)
Downloads the embeddings for a run/split at an epoch as a Vaex dataframe.
Optionally choose the epoch to get embeddings for, otherwise the latest epoch's embeddings will be chosen.
def get_embeddings(
project_name: str, run_name: str, split: Split, epoch: int = None
) -> DataFrame:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
split = "training"
latest_embs = dq.metrics.get_embeddings(project, run, split)
epochs = sorted(dq.metrics.get_epochs(project, run, split))
second_latest_embs = dq.metrics.get_embeddings(project, run, split, epoch=epochs[-2])
Downloads the probabilities for a run/split at an epoch as a Vaex dataframe.
Optionally choose the epoch to get probabilities for, otherwise the latest epoch's probabilities will be chosen.
def get_probabilities(
project_name: str, run_name: str, split: Split, epoch: int = None
) -> DataFrame:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
split = "training"
latest_probs = dq.metrics.get_probabilities(project, run, split)
epochs = sorted(dq.metrics.get_epochs(project, run, split))
second_latest_probs = dq.metrics.get_probabilities(project, run, split, epoch=epochs[-2])
Downloads the raw logged data for a run/split at an epoch as a Vaex dataframe.
Optionally choose the epoch to get probabilities for, otherwise the latest epoch's probabilities will be chosen.
For NER, this will download the text samples and text-token-indices
def get_raw_data(
project_name: str, run_name: str, split: Split, epoch: int = None
) -> DataFrame:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
split = "training"
df = dq.metrics.get_raw_data(project, run, split)
Gets labels for a given run. If multi-label, a task must be provided
def get_label_for_run(
project_name: str, run_name: str, task: Optional[str] = None
) -> List[str]:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
labels = dq.metrics.get_labels_for_run(project, run)
# for multi-label
tasks = dq.metrics.get_tasks_for_run(project, run, split)
labels = dq.metrics.get_labels_for_run(project, run, tasks[0])
Multi-label only. Gets tasks for a given run.
def get_tasks_for_run(project_name: str, run_name: str) -> List[str]:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
tasks = dq.metrics.get_tasks_for_run(project, run, split)
Get xray cards for a project/run/split
Xray cards are automatic insights calculated and provided by Galileo on your data
def get_xray_cards(
project_name: str, run_name: str, split: Split, inference_name: Optional[str] = None
) -> List[Dict[str, str]]:
Examples:
import dataquality as dq
project = "my_project"
run = "my_run"
split = "training"
tasks = dq.metrics.get_xray_cards(project, run, split)
Last modified 26d ago