You have questions, we have (some) answers!

Q: How do I install the Galileo Python client?

pip install dataquality

Q: I'm seeing errors importing dataquality in Jupyter / Google Colab

Make sure you running at least dataquality >= 0.8.6 The first thing to try in this case it to restart your kernel. Dataquality uses certain python packages that require your kernel to be restarted after installation. In Jupyter you can click "Kernel -> Restart"
In Colab you can click "Runtime -> Disconnect and delete runtime"
If you already had vaex installed on your machine prior to installing dataquality, there is a known bug when upgrading. Solution: pip uninstall -y vaex-core vaex-hdf5 && pip install --upgrade --force-reinstall dataquality ``And then restart your jupyter/colab kernel

Q: My run finished, but there's no data in the console! What went wrong?

Make sure you ran dq.finish() after the run.
t's possible that:
  • your run hasn't finished processing
  • you've logged some data incorrectly
  • you may have found a bug (congrats!
First, to see what happened to your data, you can run dq.wait_for_run() (you can optionally pass in the project and run name, or the most recent will be used)
This function will wait for your run to finish processing. If it's completed, check the console again by refreshing.
If that shows an exception, your run failed to be processed. You can see the logs from your model training by running dq.get_dq_log_file() which will download and return the path to your logfile. That may indicate the issue. Feel free to reach out to us for more help!

Q: Can I log custom metadata to my dataset?

Yes (glad you asked)! You can attach any metadata fields you'd like to your original dataset, as long as they are primitive datatypes (numbers and strings).
In all available logging functions for input data, you can attach custom metadata:
df = pd.DataFrame(
"id": [0,1,2,3],
"text": ["sen 1","sen 2","sen 3","sen 4"],
"label": [0, 1, 1, 0],
"customer_score": [0.66, 0.98, 0.12, 0.05],
"sentiment": ["happy", "sad", "happy", "angry"]
dq.log_dataset(df, meta=["customer_score", "sentiment"])
texts = [
"Text sample 1",
"Text sample 2",
"Text sample 3",
"Text sample 4"
labels = ["B", "C", "A", "A"]
meta = {
"sample_importance": ["high", "low", "low", "medium"]
"quality_ranking": [9.7, 2.4, 5.5, 1.2]
ids = [0, 1, 2, 3]
split = "training"
dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)
This data will show up in the console under the column dropdown
And you can see any performance metric grouped by your categorical metadata
Lastly, once active, you can further filter your data by your metadata fields, helping find high-value cohorts

Q: How do I disable Galileo logging during model training?

Q: How do I load a Galileo exported file for re-training?

from datasets import Dataset, dataset_dict
file_name_train = "exported_galileo_sample_file_train.parquet"
file_name_val = "exported_galileo_sample_file_val.parquet"
file_name_test = "exported_galileo_sample_file_test.parquet"
ds_train = Dataset.from_parquet(file_name_train)
ds_val = Dataset.from_parquet(file_name_val)
ds_test = Dataset.from_parquet(file_name_test)
ds_exported = dataset_dict.DatasetDict({"train": ds_train, "validation": ds_val, "test": ds_test})
labels = ds_new["train"]["ner_labels"][0]
tokenized_datasets = hf.tokenize_and_log_dataset(ds_exported, tokenizer, labels)
train_dataloader = hf.get_dataloader(tokenized_datasets["train"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=True)
val_dataloader = hf.get_dataloader(tokenized_datasets["validation"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=False)
test_dataloader = hf.get_dataloader(tokenized_datasets["test"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=False)

Q: How do I get my NER data into huggingface format?

import dataquality as dq
from datasets import Dataset
# A vaex dataframe
df = dq.metrics.get_dataframe(
project_name, run_name, split, hf_format=True, tagging_schema="BIO"
ds = Dataset.from_parquet("data.parquet")

Q: My spans JSON column for my NER data can't be loaded with json.loads

If you're seeing an error similar to: JSONDecodeError: Expecting ',' delimiter: line 1 column 84 (char 83) It's likely the case that you have some data in your text field that is not valid json (extra quotes " or '). Unfortunately, we cannot modify the content of your span text, but we can strip out the text field with some regex. Given a pandas dataframe df with column spans (from a Galileo export) you can replace df["spans"] = df.apply(json.loads) with (make sure to import re) df["spans"] = df.apply(lambda row: json.loads(re.sub(r","text".}", "}", row)))

Q: Galileo marked an incorrect span as a span shift error, but it looks like a wrong tag error. What's going on?

Great observation! Let's take a real example below, from the WikiNER IT dataset. As you can see, the Anemone apennina clearly looks like a wrong tag error (correct span boundaries, incorrect class prediction), but is marked as a span shift.
We can further validate this with dq.metrics.get_dataframe. We can see that there are 2 spans with identical character boundaries, one with a label and one without (which is the prediction span).
So what is going on here? When Galileo computes error types for each span, they are computed at the byte-pair (BPE) level using the span token indices, not **** the character indices. When looking at the console, however, you are seeing the character level indices, because that's much more intuitive view of your data. That conversion from token (fine-grained) to **character** (coarse-grained) level indices can cause index differences to overlap as a result of less-granular information.
We can again validate this with dq.metrics by looking at the raw data logged to Galileo. As we can see, at the token level, the span start and end indices do not align, and in fact overlap (ids 21948 and 21950), which is the reason for the span_shift error 🤗

Q: What do you mean when you say the deployment logs are written to Google Cloud?

We manage deployments and updates to the versions of services running in your cluster via Github Actions. Each deployment/update produces logs that go into a bucket on Galileo's cloud (GCP). During our private deployment process **** (for Enterprise users), we allow customers to provide us with their emails, so they can have access to these deployment logs.

Q: Where are the client logs stored?

The client logs are stored in the home (~) folder of the machine where the training occurs.

Q: Does Galileo store data in the cloud?

For Enterprise Users, data does not leave the customer VPC/Data Center. For users of the Free version of our product, we store data and model outputs in secured servers in the cloud. We pride ourselves in taking data security very seriously.

Q: Do you offer air-gapped deployments?

Yes, we do! Contact us to learn more.

Q: How do I contact Galileo?

You can write us at team[at]

Q: How do I convert my vaex dataframe to a pandas DataFrame when using the dq.metrics.get_dataframe

Simply add dq.metrics.get_dataframe(...).to_pandas_df()

Q: Importing dataquality throws a permissions error PermissionError

Galileo creates a folder in your system's HOME directory. If you are seeing a PermissionsError it means that your system does not have access to your current HOME directory. This may happen in an automated CI system like AWS Glue. To overcome this, simply change your HOME python Environment Variable to somewhere accessible. For example, the current directory you are in
import os
# Set the HOME directory to the current working directory
os.environ["HOME"] = os.getcwd()
import dataquality as dq
This will only affect the current python runtime, it will not change your system's HOME directory. Because of that, if you run a new python script in this environment again, you will need to set the HOME variable in each new runtime.

Q: vaex-core fails to build with Python 3.10 on MacOs Monterey

When installing dataquality with python 3.10 on MacOS Monterey you might encounter an issue when building vaex-core binaries. To fix any issues that come up, please follow the instructions in the failure output which may include running xcodebuild -runFirstLaunch and also allowing for any clang permission requests that pop up.