🙋
FAQs
You have questions, we have (some) answers!
- 1.
- 2.
- 1.Most Frequent High DEP words
- 2.Span-level Embeddings
- 3.What do the different Error Types mean?
pip install dataquality
Make sure you running at least
dataquality >= 0.8.6
The first thing to try in this case it to restart your kernel. Dataquality uses certain python packages that require your kernel to be restarted after installation.
In Jupyter you can click "Kernel -> Restart".png?alt=media)
In Colab you can click "Runtime -> Disconnect and delete runtime"
.png?alt=media)
If you already had vaex installed on your machine prior to installing
dataquality,
there is a known bug when upgrading.
Solution:
pip uninstall -y vaex-core vaex-hdf5 && pip install --upgrade --force-reinstall dataquality
``And then restart your jupyter/colab kernelMake sure you ran
dq.finish()
after the run.t's possible that:
- your run hasn't finished processing
- you've logged some data incorrectly
- you may have found a bug (congrats!
First, to see what happened to your data, you can run
dq.wait_for_run()
(you can optionally pass in the project and run name, or the most recent will be used)This function will wait for your run to finish processing. If it's completed, check the console again by refreshing.
If that shows an exception, your run failed to be processed. You can see the logs from your model training by running
dq.get_dq_log_file()
which will download and return the path to your logfile. That may indicate the issue. Feel free to reach out to us for more help!Yes (glad you asked)! You can attach any metadata fields you'd like to your original dataset, as long as they are primitive datatypes (numbers and strings).
In all available logging functions for input data, you can attach custom metadata:
df = pd.DataFrame(
{
"id": [0,1,2,3],
"text": ["sen 1","sen 2","sen 3","sen 4"],
"label": [0, 1, 1, 0],
"customer_score": [0.66, 0.98, 0.12, 0.05],
"sentiment": ["happy", "sad", "happy", "angry"]
}
)
dq.log_dataset(df, meta=["customer_score", "sentiment"])
texts = [
"Text sample 1",
"Text sample 2",
"Text sample 3",
"Text sample 4"
]
labels = ["B", "C", "A", "A"]
meta = {
"sample_importance": ["high", "low", "low", "medium"]
"quality_ranking": [9.7, 2.4, 5.5, 1.2]
}
ids = [0, 1, 2, 3]
split = "training"
dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)
This data will show up in the console under the column dropdown
.png?alt=media)
And you can see any performance metric grouped by your categorical metadata
.png?alt=media)
Lastly, once active, you can further filter your data by your metadata fields, helping find high-value cohorts
****
.png?alt=media)
from datasets import Dataset, dataset_dict
file_name_train = "exported_galileo_sample_file_train.parquet"
file_name_val = "exported_galileo_sample_file_val.parquet"
file_name_test = "exported_galileo_sample_file_test.parquet"
ds_train = Dataset.from_parquet(file_name_train)
ds_val = Dataset.from_parquet(file_name_val)
ds_test = Dataset.from_parquet(file_name_test)
ds_exported = dataset_dict.DatasetDict({"train": ds_train, "validation": ds_val, "test": ds_test})
labels = ds_new["train"]["ner_labels"][0]
tokenized_datasets = hf.tokenize_and_log_dataset(ds_exported, tokenizer, labels)
train_dataloader = hf.get_dataloader(tokenized_datasets["train"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=True)
val_dataloader = hf.get_dataloader(tokenized_datasets["validation"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=False)
test_dataloader = hf.get_dataloader(tokenized_datasets["test"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=False)
import dataquality as dq
from datasets import Dataset
dq.login()
# A vaex dataframe
df = dq.metrics.get_dataframe(
project_name, run_name, split, hf_format=True, tagging_schema="BIO"
)
df.export("data.parquet")
ds = Dataset.from_parquet("data.parquet")
If you're seeing an error similar to:
JSONDecodeError: Expecting ',' delimiter: line 1 column 84 (char 83)
It's likely the case that you have some data in your text
field that is not valid json (extra quotes "
or '
). Unfortunately, we cannot modify the content of your span text, but we can strip out the text
field with some regex.
Given a pandas dataframe df
with column spans
(from a Galileo export) you can replace
df["spans"] = df.apply(json.loads)
with (make sure to import re
)
df["spans"] = df.apply(lambda row: json.loads(re.sub(r","text".}", "}", row)))
Great observation! Let's take a real example below, from the WikiNER IT dataset. As you can see, the
Anemone apennina
clearly looks like a wrong tag error (correct span boundaries, incorrect class prediction), but is marked as a span shift..png?alt=media)
We can further validate this with
dq.metrics.get_dataframe
. We can see that there are 2 spans with identical character boundaries, one with a label and one without (which is the prediction span).%20(1).png?alt=media)
So what is going on here? When Galileo computes error types for each span, they are computed at the byte-pair (BPE) level using the span token indices, not **** the character indices. When looking at the console, however, you are seeing the character level indices, because that's much more intuitive view of your data. That conversion from token (fine-grained) to **character** (coarse-grained) level indices can cause index differences to overlap as a result of less-granular information.
We can again validate this with
dq.metrics
by looking at the raw data logged to Galileo. As we can see, at the token level, the span start and end indices do not align, and in fact overlap (ids 21948 and 21950), which is the reason for the span_shift error 🤗.png?alt=media)
We manage deployments and updates to the versions of services running in your cluster via Github Actions. Each deployment/update produces logs that go into a bucket on Galileo's cloud (GCP). During our private deployment process **** (for Enterprise users), we allow customers to provide us with their emails, so they can have access to these deployment logs.
The client logs are stored in the home (~) folder of the machine where the training occurs.
For Enterprise Users, data does not leave the customer VPC/Data Center. For users of the Free version of our product, we store data and model outputs in secured servers in the cloud. We pride ourselves in taking data security very seriously.
Yes, we do! Contact us to learn more.
You can write us at team[at]rungalileo.io
Simply add
dq.metrics.get_dataframe(...).to_pandas_df()
Galileo creates a folder in your system's
HOME
directory. If you are seeing a PermissionsError
it means that your system does not have access to your current HOME
directory. This may happen in an automated CI system like AWS Glue. To overcome this, simply change your HOME
python Environment Variable to somewhere accessible. For example, the current directory you are inimport os
# Set the HOME directory to the current working directory
os.environ["HOME"] = os.getcwd()
import dataquality as dq
This will only affect the current python runtime, it will not change your system's
HOME
directory. Because of that, if you run a new python script in this environment again, you will need to set the HOME
variable in each new runtime.When installing dataquality with python 3.10 on MacOS Monterey you might encounter an issue when building vaex-core binaries. To fix any issues that come up, please follow the instructions in the failure output which may include running
xcodebuild -runFirstLaunch
and also allowing for any clang permission requests that pop up.For larger datasets you can speed up model training by running CUDA.
Note: You must be running CUDA 11.X for this functionality to work.
Cuda's CUML libraries require CUDA 11.X to work properly. You can check your CUDA version by running
nvcc -V
. Do not run nvidia-smi, that does not give you the true CUDA version. To learn more about this installation or to do it manually, see the installation guide.If you are training on datasets in the millions, and noticing that the Galileo processing is slowing down at the "Dimensionality Reduction" stage, you can optionally run those steps on the GPU/TPU that you are training your model with.
In order to leverage this feature, simply install
dataquality
with the [cuda]
extra.pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/
We pass in the
extra-index-url
to the install, because the extra required packages are hosted by Nvidia, and exist on Nvidia's personal pypi repository, not the standard pypi repository.After running that installation, dataquality will automatically pick up on the available libraries, and leverage your GPU/TPU to apply the dimensionality reduction.
Please validate that the installation ran correctly by running
import cuml
in your environment. This must complete successfully.To manually install these packages (at your own risk), you can run
pip install cuml-cu11 ucx-py-cu11 rmm-cu11 raft-dask-cu11 pylibraft-cu11 dask-cudf-cu11 cudf-cu11 --extra-index-url=https://pypi.nvidia.com/
Last modified 27d ago