Advanced Usage: Faster processing
For larger datasets, you can speed up Galileo data processing here
Cuda's CUML libraries require CUDA 11.X to work properly. You can check your CUDA version by running
nvcc -V. Do not run nvidia-smi, that does not give you the true CUDA version. To learn more about this installation or to do it manually, see the installation guide.
If you are training on datasets in the millions, and noticing that the Galileo processing is slowing down at the "Dimensionality Reduction" stage, you can optionally run those steps on the GPU/TPU that you are training your model with.
In order to leverage this feature, simply install
pip install 'dataquality[cuda]' --extra-index-url=https://pypi.ngc.nvidia.com/
We pass in the
extra-index-urlto the install, because the extra required packages are hosted by Nvidia, and exist on Nvidia's personal pypi repository, not the standard pypi repository.
After running that installation, dataquality will automatically pick up on the available libraries, and leverage your GPU/TPU to apply the dimensionality reduction.
Please validate that the installation ran correctly by running
import cumlin your environment. This must complete successfully.
We install the nvidia libraries via pip. This is experimental, and there are a number of other ways to install these libraries. If you'd like to install them without dataquality, you can do so by following Nvidia's official documentation.
Dataquality specifically needs the
cuml-cu11library, but it is dependent on the following
[As us 03-13-2023] We specifically install all libraries pinned at version
22.12. This is because after
22.12, Nvidia changed the location where they host their packaged from https://pypi.ngc.nvidia.com/ to https://pypi.nvidia.com/. This caused a number of issues, and the current suggestion from the community is to maintain pinned at
22.12until the issue is resolved. For more information, see this issue.