Advanced Usage: Faster processing

For larger datasets, you can speed up Galileo data processing here

Note: You must be running CUDA 11.X for this functionality to work.

Cuda's CUML libraries require CUDA 11.X to work properly. You can check your CUDA version by running nvcc -V. Do not run nvidia-smi, that does not give you the true CUDA version. To learn more about this installation or to do it manually, see the installation guide.
If you are training on datasets in the millions, and noticing that the Galileo processing is slowing down at the "Dimensionality Reduction" stage, you can optionally run those steps on the GPU/TPU that you are training your model with.
In order to leverage this feature, simply install dataquality with the [cuda] extra.
pip install 'dataquality[cuda]' --extra-index-url=
We pass in the extra-index-url to the install, because the extra required packages are hosted by Nvidia, and exist on Nvidia's personal pypi repository, not the standard pypi repository.
After running that installation, dataquality will automatically pick up on the available libraries, and leverage your GPU/TPU to apply the dimensionality reduction.
Please validate that the installation ran correctly by running import cuml in your environment. This must complete successfully.

Some notes for those interested

We install the nvidia libraries via pip. This is experimental, and there are a number of other ways to install these libraries. If you'd like to install them without dataquality, you can do so by following Nvidia's official documentation.
Dataquality specifically needs the cuml-cu11 library, but it is dependent on the following
[As us 03-13-2023] We specifically install all libraries pinned at version 22.12. This is because after 22.12, Nvidia changed the location where they host their packaged from to This caused a number of issues, and the current suggestion from the community is to maintain pinned at 22.12 until the issue is resolved. For more information, see this issue.