# index.html.md # NVIDIA NIM Microservices ## Natural Language Processing NeMo Retriever Get access to state-of-the-art models for building text Q&A retrieval pipelines with high accuracy. Text Embedding Text Reranking # index.html.md # Local Choose your preferred installation method for running RAPIDS Conda Install RAPIDS using conda Docker Install RAPIDS using Docker pip Install RAPIDS using pip WSL2 Install RAPIDS on Windows using Windows Subsystem for Linux version 2 (WSL2) # index.html.md # HPC RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware such as InfiniBand. Deploying on HPC often means using queue management systems such as SLURM, LSF, PBS, etc. ## SLURM #### WARNING This is a legacy page and may contain outdated information. We are working hard to update our documentation with the latest and greatest information, thank you for bearing with us. If you are unfamiliar with SLURM or need a refresher, we recommend the [quickstart guide](https://slurm.schedmd.com/quickstart.html). Depending on how your nodes are configured, additional settings may be required such as defining the number of GPUs `(--gpus)` desired or the number of gpus per node `(--gpus-per-node)`. In the following example, we assume each allocation runs on a DGX1 with access to all eight GPUs. ### Start Scheduler First, start the scheduler with the following SLURM script. This and the following scripts can deployed with `salloc` for interactive usage or `sbatch` for batched run. ```bash #!/usr/bin/env bash #SBATCH -J dask-scheduler #SBATCH -n 1 #SBATCH -t 00:10:00 module load cuda/11.0.3 CONDA_ROOT=/nfs-mount/user/miniconda3 source $CONDA_ROOT/etc/profile.d/conda.sh conda activate rapids LOCAL_DIRECTORY=/nfs-mount/dask-local-directory mkdir $LOCAL_DIRECTORY CUDA_VISIBLE_DEVICES=0 dask-scheduler \ --protocol tcp \ --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" & dask-cuda-worker \ --rmm-pool-size 14GB \ --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" ``` Notice that we configure the scheduler to write a `scheduler-file` to a NFS accessible location. This file contains metadata about the scheduler and will include the IP address and port for the scheduler. The file will serve as input to the workers informing them what address and port to connect. The scheduler doesn’t need the whole node to itself so we can also start a worker on this node to fill out the unused resources. ### Start Dask CUDA Workers Next start the other [dask-cuda workers](https://docs.rapids.ai/api/dask-cuda/nightly/). Dask-CUDA extends the traditional Dask `Worker` class with specific options and enhancements for GPU environments. Unlike the scheduler and client, the workers script should be scalable and allow the users to tune how many workers are created. For example, we can scale the number of nodes to 3: `sbatch/salloc -N3 dask-cuda-worker.script` . In this case, because we have 8 GPUs per node and we have 3 nodes, our job will have 24 workers. ```bash #!/usr/bin/env bash #SBATCH -J dask-cuda-workers #SBATCH -t 00:10:00 module load cuda/11.0.3 CONDA_ROOT=/nfs-mount/miniconda3 source $CONDA_ROOT/etc/profile.d/conda.sh conda activate rapids LOCAL_DIRECTORY=/nfs-mount/dask-local-directory mkdir $LOCAL_DIRECTORY dask-cuda-worker \ --rmm-pool-size 14GB \ --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" ``` ### cuDF Example Workflow Lastly, we can now run a job on the established Dask Cluster. ```bash #!/usr/bin/env bash #SBATCH -J dask-client #SBATCH -n 1 #SBATCH -t 00:10:00 module load cuda/11.0.3 CONDA_ROOT=/nfs-mount/miniconda3 source $CONDA_ROOT/etc/profile.d/conda.sh conda activate rapids LOCAL_DIRECTORY=/nfs-mount/dask-local-directory cat <>/tmp/dask-cudf-example.py import cudf import dask.dataframe as dd from dask.distributed import Client client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json") cdf = cudf.datasets.timeseries() ddf = dd.from_pandas(cdf, npartitions=10) res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute() print(res) EOF python /tmp/dask-cudf-example.py ``` ### Confirm Output Putting the above together will result in the following output: ```bash x y mean sum count mean sum count id name 1077 Laura 0.028305 1.868120 66 -0.098905 -6.527731 66 1026 Frank 0.001536 1.414839 921 -0.017223 -15.862306 921 1082 Patricia 0.072045 3.602228 50 0.081853 4.092667 50 1007 Wendy 0.009837 11.676199 1187 0.022978 27.275216 1187 976 Wendy -0.003663 -3.267674 892 0.008262 7.369577 892 ... ... ... ... ... ... ... 912 Michael 0.012409 0.459119 37 0.002528 0.093520 37 1103 Ingrid -0.132714 -1.327142 10 0.108364 1.083638 10 998 Tim 0.000587 0.747745 1273 0.001777 2.262094 1273 941 Yvonne 0.050258 11.358393 226 0.080584 18.212019 226 900 Michael -0.134216 -1.073729 8 0.008701 0.069610 8 [6449 rows x 6 columns] ```

# index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Custom RAPIDS Docker Guide This guide provides instructions for building custom RAPIDS Docker containers. This approach allows you to select only the RAPIDS libraries you need, which is ideal for creating minimal, customizable images that can be tuned to your requirements. #### NOTE For quick setup with pre-built containers that include the full RAPIDS suite, please see the [Official RAPIDS Docker Installation Guide](https://docs.rapids.ai/install#docker). ## Overview Building a custom RAPIDS container offers several advantages: - **Minimal Image Sizes**: By including only the libraries you need, you can reduce the final image size. - **Flexible Configuration**: You have full control over library versions and dependencies. ## Getting Started To begin, you will need to create a few local files for your custom build: a `Dockerfile` and a configuration file (`env.yaml` for conda or `requirements.txt` for pip). The templates for these files is provided in the Docker Templates section below for you to copy. 1. **Create a Project Directory**: It’s best practice to create a dedicated folder for your custom build. ```bash mkdir rapids-custom-build && cd rapids-custom-build ``` 2. **Prepare Your Project Files**: Based on your chosen approach (conda or pip), create the necessary files in your project directory from the corresponding tab in the Docker Templates section below. 3. **Customize Your Build**: - When using **conda**, edit your local `env.yaml` file to add the desired RAPIDS libraries. - When using **pip**, edit your local `requirements.txt` file with your desired RAPIDS libraries. 4. **Build the Image**: Use the commands provided in the Build and Run section to create and start your custom container. --- ## Package Manager Differences The choice of base image depends on how your package manager handles CuPy (a dependency for most RAPIDS libraries) and CUDA library dependencies: ### Conda → Uses `cuda-base` ```dockerfile FROM nvidia/cuda:12.9.1-base-ubuntu24.04 ``` This approach works because conda can install both Python and non-Python dependencies, including system-level CUDA libraries like `libcudart` and `libnvrtc`. When installing RAPIDS libraries via conda, the package manager automatically pulls the required CUDA runtime libraries alongside CuPy and other dependencies, providing complete dependency management in a single installation step. ### Pip → Uses `cuda-runtime` ```dockerfile FROM nvidia/cuda:12.9.1-runtime-ubuntu24.04 ``` This approach is necessary because CuPy wheels distributed via PyPI do not currently bundle CUDA runtime libraries (`libcudart`, `libnvrtc`) within the wheel packages themselves. Since pip cannot install system-level CUDA libraries, CuPy expects these libraries to already be present in the system environment. The `cuda-runtime` image provides the necessary CUDA runtime libraries that CuPy requires, eliminating the need for manual library installation. ## Docker Templates The complete source code for the Dockerfiles and their configurations are included here. Choose your preferred package manager. ### conda This method uses conda and is ideal for workflows that are based on `conda`. **`rapids-conda.Dockerfile`** ```dockerfile # syntax=docker/dockerfile:1 # Copyright (c) 2024-2025, NVIDIA CORPORATION. ARG CUDA_VER=12.9.1 ARG LINUX_DISTRO=ubuntu ARG LINUX_DISTRO_VER=24.04 FROM nvidia/cuda:${CUDA_VER}-base-${LINUX_DISTRO}${LINUX_DISTRO_VER} SHELL ["/bin/bash", "-euo", "pipefail", "-c"] # Install system dependencies RUN apt-get update && \ apt-get install -y --no-install-recommends \ wget \ curl \ git \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Install Miniforge RUN wget -qO /tmp/miniforge.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh" && \ bash /tmp/miniforge.sh -b -p /opt/conda && \ rm /tmp/miniforge.sh && \ /opt/conda/bin/conda clean --all --yes # Add conda to PATH and activate base environment ENV PATH="/opt/conda/bin:${PATH}" ENV CONDA_DEFAULT_ENV=base ENV CONDA_PREFIX=/opt/conda # Create conda group and rapids user RUN groupadd -g 1001 conda && \ useradd -rm -d /home/rapids -s /bin/bash -g conda -u 1001 rapids && \ chown -R rapids:conda /opt/conda USER rapids WORKDIR /home/rapids # Copy the environment file template COPY --chmod=644 env.yaml /home/rapids/env.yaml # Update the base environment with user's packages from env.yaml # Note: The -n base flag ensures packages are installed to the base environment # overriding any 'name:' specified in the env.yaml file RUN /opt/conda/bin/conda env update -n base -f env.yaml && \ /opt/conda/bin/conda clean --all --yes CMD ["bash"] ``` **`env.yaml`** (Conda environment configuration) ```yaml name: base channels: - "rapidsai-nightly" - conda-forge - nvidia dependencies: - python=3.12 - cudf=25.12 ``` ### pip This approach uses Python virtual environments and is ideal for workflows that are already based on `pip`. **`rapids-pip.Dockerfile`** ```dockerfile # syntax=docker/dockerfile:1 # Copyright (c) 2024-2025, NVIDIA CORPORATION. ARG CUDA_VER=12.9.1 ARG PYTHON_VER=3.12 ARG LINUX_DISTRO=ubuntu ARG LINUX_DISTRO_VER=24.04 # Use CUDA runtime image for pip FROM nvidia/cuda:${CUDA_VER}-runtime-${LINUX_DISTRO}${LINUX_DISTRO_VER} ARG PYTHON_VER SHELL ["/bin/bash", "-euo", "pipefail", "-c"] # Install system dependencies RUN apt-get update && \ apt-get install -y --no-install-recommends \ python${PYTHON_VER} \ python${PYTHON_VER}-venv \ python3-pip \ wget \ curl \ git \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Create symbolic links for python and pip RUN ln -sf /usr/bin/python${PYTHON_VER} /usr/bin/python && \ ln -sf /usr/bin/python${PYTHON_VER} /usr/bin/python3 # Create rapids user RUN groupadd -g 1001 rapids && \ useradd -rm -d /home/rapids -s /bin/bash -g rapids -u 1001 rapids USER rapids WORKDIR /home/rapids # Create and activate virtual environment RUN python -m venv /home/rapids/venv ENV PATH="/home/rapids/venv/bin:$PATH" ENV VIRTUAL_ENV="/home/rapids/venv" # Upgrade pip RUN pip install --no-cache-dir --upgrade pip setuptools wheel # Copy the requirements file COPY --chmod=644 requirements.txt /home/rapids/requirements.txt # Install all packages RUN pip install --no-cache-dir -r requirements.txt CMD ["bash"] ``` **`requirements.txt`** (Pip package requirements) ```text # RAPIDS libraries (pip versions) cudf-cu12==25.12.*,>=0.0.0a0 ``` --- ## Build and Run ### Conda After copying the source files into your local directory: ```bash # Build the image docker build -f rapids-conda.Dockerfile -t rapids-conda-base . # Start a container with an interactive shell docker run --gpus all -it rapids-conda-base ``` ### Pip After copying the source files into your local directory: ```bash # Build the image docker build -f rapids-pip.Dockerfile -t rapids-pip-base . # Start a container with an interactive shell docker run --gpus all -it rapids-pip-base ``` #### IMPORTANT When using `pip`, you must specify the CUDA version in the package name (e.g., `cudf-cu12`, `cuml-cu12`). This ensures you install the version of the library that is compatible with the CUDA toolkit. #### NOTE **GPU Access with `--gpus all`**: The `--gpus` flag uses the NVIDIA Container Toolkit to dynamically mount GPU device files (`/dev/nvidia*`), NVIDIA driver libraries (`libcuda.so`, `libnvidia-ml.so`), and utilities like `nvidia-smi` from the host system into your container at runtime. This is why `nvidia-smi` becomes available even though it’s not installed in your Docker image. Your container only needs to provide the CUDA runtime libraries (like `libcudart`) that RAPIDS requires—the host system’s NVIDIA driver handles the rest. ### Image Size Comparison One of the key benefits of building custom RAPIDS containers is the significant reduction in image size compared to the pre-built RAPIDS images. Here are actual measurements from containers containing only cuDF: | Image Type | Contents | Size | |----------------------|-------------------|-------------| | **Custom conda** | cuDF only | **6.83 GB** | | **Custom pip** | cuDF only | **6.53 GB** | | **Pre-built RAPIDS** | Full RAPIDS suite | **12.9 GB** | Custom builds are smaller in size when you only need specific RAPIDS libraries like cuDF. These size reductions result in faster container pulls and deployments, reduced storage costs in container registries, lower bandwidth usage in distributed environments, and quicker startup times for containerized applications. ## Extending the Container One of the benefits of building custom RAPIDS containers is the ability to easily add your own packages to the environment. You can add any combination of RAPIDS and non-RAPIDS libraries to create a fully featured container for your workloads. ### Using conda To add packages to the Conda environment, add them to the `dependencies` list in your `env.yaml` file. **Example: Adding `scikit-learn` and `xgboost` to a conda image containing `cudf`** ```yaml name: base channels: - rapidsai-nightly - conda-forge - nvidia dependencies: - cudf=25.12 - scikit-learn - xgboost ``` ### Using pip To add packages to the Pip environment, add them to your `requirements.txt` file. **Example: Adding `scikit-learn` and `lightgbm` to a pip image containing `cudf`** ```text cudf-cu12==25.12.*,>=0.0.0a0 scikit-learn lightgbm ``` After modifying your configuration file, rebuild the Docker image. The new packages will be automatically included in your custom RAPIDS environment. ## Build Configuration You can customize the build by modifying the version variables at the top of each Dockerfile. These variables control the CUDA version, Python version, and Linux distribution used in your container. ### Available Configuration Variables The following variables can be modified at the top of each Dockerfile to customize your build: | Variable | Default Value | Description | Example Values | |-------------------------|-----------------|--------------------------------------------------------|----------------------| | `CUDA_VER` | `12.9.1` | Sets the CUDA version for the base image and packages. | `12.0` | | `PYTHON_VER` (pip only) | `3.12` | Defines the Python version to install and use. | `3.11`, `3.10` | | `LINUX_DISTRO` | `ubuntu` | The Linux distribution being used | `rockylinux9`, `cm2` | | `LINUX_DISTRO_VER` | `24.04` | The version of the Linux distribution. | `20.04` | #### NOTE For conda installations, you can choose the required python version in the `env.yaml` file ## Verifying Your Installation After starting your container, you can quickly test that RAPIDS is installed and running correctly. The container launches directly into a `bash` shell where you can install the [RAPIDS CLI](https://github.com/rapidsai/rapids-cli) command line utility to verify your installation. 1. **Run the Container Interactively** This command starts your container and drops you directly into a bash shell. ```bash # For Conda builds docker run --gpus all -it rapids-conda-base # For Pip builds docker run --gpus all -it rapids-pip-base ``` 2. **Install RAPIDS CLI** Inside the containers, install the RAPIDS CLI: ```bash pip install rapids-cli ``` 3. **Test the installation using the Doctor subcommand** Once RAPIDS CLI is installed, you can use the `rapids doctor` subcommand to perform health checks. ```bash rapids doctor ``` 4. **Expected Output** If your installation is successful, you will see output similar to this: ```bash 🧑‍⚕️ Performing REQUIRED health check for RAPIDS Running checks All checks passed! ``` # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md To access Jupyter, navigate to `:8888` in the browser. In a Python notebook, check that you can import and use RAPIDS libraries like `cudf`. ```ipython In [1]: import cudf In [2]: df = cudf.datasets.timeseries() In [3]: df.head() Out[3]: id name x y timestamp 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 ``` Open `cudf/10min.ipynb` and execute the cells to explore more of how `cudf` works. When running a Dask cluster you can also visit `:8787` to monitor the Dask cluster status. # index.html.md Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected. ```bash cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04" resources: limits: nvidia.com/gpu: 1 EOF ``` ```console $ kubectl logs pod/cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that Pod. ```console $ kubectl delete pod cuda-vectoradd pod "cuda-vectoradd" deleted ``` # index.html.md There are a selection of methods you can use to install RAPIDS which you can see via the [RAPIDS release selector](https://docs.rapids.ai/install#selector). For this example we are going to run the RAPIDS Docker container so we need to know the name of the most recent container. On the release selector choose **Docker** in the **Method** column. Then copy the commands shown: ```bash docker pull rapidsai/notebooks:25.12a-cuda12-py3.13 docker run --gpus all --rm -it \ --shm-size=1g --ulimit memlock=-1 \ -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/notebooks:25.12a-cuda12-py3.13 ``` #### NOTE If you see a “docker socket permission denied” error while running these commands try closing and reconnecting your SSH window. This happens because your user was added to the `docker` group only after you signed in. # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Does the Dask scheduler need a GPU? A common question from users deploying Dask clusters is whether the scheduler has different minimum requirements to the workers. This question is compounded when using RAPIDS and GPUs. #### WARNING This guide outlines our current advice on scheduler hardware requirements, but this may be subject to change. **TLDR; It is strongly suggested that your Dask scheduler has matching hardware/software capabilities to the other components in your cluster.** Therefore, if your workers have GPUs and the RAPIDS libraries installed we recommend that your scheduler does too. However the GPU attached to your scheduler doesn’t need to be as powerful as the GPUs on your workers, as long as it has the same capabilities and driver/CUDA versions. ## What does the scheduler use a GPU for? The Dask client generates a task graph of operations that it wants to be performed and serializes any data that needs to be sent to the workers. The scheduler handles allocating those tasks to the various Dask workers and passes serialized data back and forth. The workers deserialize the data, perform calculations, serialize the result and pass it back. This can lead users to logically ask if the scheduler needs the same capabilities as the workers/client. It doesn’t handle the actual data or do any of the user calculations, it just decides where work should go. Taking this even further you could even ask “Does the Dask scheduler even need to be written in Python?”. Some folks even [experimented with a Rust implementation of the scheduler](https://github.com/It4innovations/rsds) a couple of years ago. There are two primary reasons why we recommend that the scheduler has the same capabilities: - There are edge cases where the scheduler does deserialize data. - Some scheduler optimizations require high-level graphs to be pickled on the client and unpickled on the scheduler. If your workload doesn’t trigger any edge-cases and you’re not using the high-level graph optimizations then you could likely get away with not having a GPU. But it is likely you will run into problems eventually and the failure-modes will be potentially hard to debug. ### Known edge cases When calling [`client.submit`](https://docs.dask.org/en/latest/futures.html#distributed.Client.submit) and passing data directly to a function the whole graph is serialized and sent to the scheduler. In order for the scheduler to figure out what to do with it the graph is deserialized. If the data uses GPUs this can cause the scheduler to import RAPIDS libraries, attempt to instantiate a CUDA context and populate the data into GPU memory. If those libraries are missing and/or there are no GPUs this will cause the scheduler to fail. Many Dask collections also have a meta object which represents the overall collection but without any data. For example a Dask Dataframe has a meta Pandas Dataframe which has the same meta properties and is used during scheduling. If the underlying data is instead a cuDF Dataframe then the meta object will be too, which is deserialized on the scheduler. ### Example failure modes When using the default TCP communication protocol, the scheduler generally does *not* inspect data communicated between clients and workers, so many workflows will not provoke failure. For example, suppose we set up a Dask cluster and do not provide the scheduler with a GPU. The following simple computation with [CuPy](https://cupy.dev)-backed Dask arrays completes successfully ```python import cupy from distributed import Client, wait import dask.array as da client = Client(scheduler_file="scheduler.json") x = cupy.arange(10) y = da.arange(1000, like=x) z = (y * 2).persist() wait(z) # Now let's look at some results print(z[:10].compute()) ``` We can run this code, giving the scheduler no access to a GPU: ```sh $ CUDA_VISIBLE_DEVICES="" dask scheduler --protocol tcp --scheduler-file scheduler.json & $ dask cuda worker --protocol tcp --scheduler-file scheduler.json & $ python test.py ... [ 0 2 4 6 8 10 12 14 16 18] ... ``` In contrast, if you provision an [Infiniband-enabled system](azure/infiniband.md) and wish to take advantage of the high-performance network, you will want to use the [UCX](https://openucx.org/) protocol, rather than TCP. Using such a setup without a GPU on the scheduler will not succeed. When the client or workers communicate with the scheduler, any GPU-allocated buffers will be sent directly between GPUs (avoiding a roundtrip to host memory). This is more efficient, but will not succeed if the scheduler does not *have* a GPU. Running the same example from above, but this time using UCX we obtain an error: ```sh $ CUDA_VISIBLE_DEVICES="" dask scheduler --protocol ucx --scheduler-file scheduler.json & $ dask cuda worker --protocol ucx --scheduler-file scheduler.json & $ python test.py $ CUDA_VISIBLE_DEVICES="" dask scheduler --protocol ucx --scheduler-file foo.json & $ dask-cuda-worker --protocol ucx --scheduler-file scheduler.json & $ python test.py ... 2023-01-27 11:01:28,263 - distributed.core - ERROR - CUDA error at: .../rmm/include/rmm/cuda_device.hpp:56: cudaErrorNoDevice no CUDA-capable device is detected Traceback (most recent call last): File ".../distributed/distributed/utils.py", line 741, in wrapper return await func(*args, **kwargs) File ".../distributed/distributed/comm/ucx.py", line 372, in read frames = [ File ".../distributed/distributed/comm/ucx.py", line 373, in device_array(each_size) if is_cuda else host_array(each_size) File ".../distributed/distributed/comm/ucx.py", line 171, in device_array return rmm.DeviceBuffer(size=n) File "device_buffer.pyx", line 85, in rmm._lib.device_buffer.DeviceBuffer.__cinit__ RuntimeError: CUDA error at: .../rmm/include/rmm/cuda_device.hpp:56: cudaErrorNoDevice no CUDA-capable device is detected 2023-01-27 11:01:28,263 - distributed.core - ERROR - Exception while handling op gather Traceback (most recent call last): File ".../distributed/distributed/core.py", line 820, in _handle_comm result = await result File ".../distributed/distributed/scheduler.py", line 5687, in gather data, missing_keys, missing_workers = await gather_from_workers( File ".../distributed/distributed/utils_comm.py", line 80, in gather_from_workers r = await c File ".../distributed/distributed/worker.py", line 2872, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File ".../distributed/distributed/utils_comm.py", line 419, in retry_operation return await retry( File ".../distributed/distributed/utils_comm.py", line 404, in retry return await coro() File ".../distributed/distributed/worker.py", line 2852, in _get_data response = await send_recv( File ".../distributed/distributed/core.py", line 986, in send_recv response = await comm.read(deserializers=deserializers) File ".../distributed/distributed/utils.py", line 741, in wrapper return await func(*args, **kwargs) File ".../distributed/distributed/comm/ucx.py", line 372, in read frames = [ File ".../distributed/distributed/comm/ucx.py", line 373, in device_array(each_size) if is_cuda else host_array(each_size) File ".../distributed/distributed/comm/ucx.py", line 171, in device_array return rmm.DeviceBuffer(size=n) File "device_buffer.pyx", line 85, in rmm._lib.device_buffer.DeviceBuffer.__cinit__ RuntimeError: CUDA error at: .../rmm/include/rmm/cuda_device.hpp:56: cudaErrorNoDevice no CUDA-capable device is detected Traceback (most recent call last): File "test.py", line 15, in print(z[:10].compute()) File ".../dask/dask/base.py", line 314, in compute (result,) = compute(self, traverse=False, **kwargs) File ".../dask/dask/base.py", line 599, in compute results = schedule(dsk, keys, **kwargs) File ".../distributed/distributed/client.py", line 3144, in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) File ".../distributed/distributed/client.py", line 2313, in gather return self.sync( File ".../distributed/distributed/utils.py", line 338, in sync return sync( File ".../distributed/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File ".../distributed/distributed/utils.py", line 378, in f result = yield future File ".../tornado/gen.py", line 769, in run value = future.result() File ".../distributed/distributed/client.py", line 2205, in _gather response = await future File ".../distributed/distributed/client.py", line 2256, in _gather_remote response = await retry_operation(self.scheduler.gather, keys=keys) File ".../distributed/distributed/utils_comm.py", line 419, in retry_operation return await retry( File ".../distributed/distributed/utils_comm.py", line 404, in retry return await coro() File ".../distributed/distributed/core.py", line 1221, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) File ".../distributed/distributed/core.py", line 1011, in send_recv raise exc.with_traceback(tb) File ".../distributed/distributed/core.py", line 820, in _handle_comm result = await result File ".../distributed/distributed/scheduler.py", line 5687, in gather data, missing_keys, missing_workers = await gather_from_workers( File ".../distributed/distributed/utils_comm.py", line 80, in gather_from_workers r = await c File ".../distributed/distributed/worker.py", line 2872, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File ".../distributed/distributed/utils_comm.py", line 419, in retry_operation return await retry( File ".../distributed/distributed/utils_comm.py", line 404, in retry return await coro() File ".../distributed/distributed/worker.py", line 2852, in _get_data response = await send_recv( File ".../distributed/distributed/core.py", line 986, in send_recv response = await comm.read(deserializers=deserializers) File ".../distributed/distributed/utils.py", line 741, in wrapper return await func(*args, **kwargs) File ".../distributed/distributed/comm/ucx.py", line 372, in read frames = [ File ".../distributed/distributed/comm/ucx.py", line 373, in device_array(each_size) if is_cuda else host_array(each_size) File ".../distributed/distributed/comm/ucx.py", line 171, in device_array return rmm.DeviceBuffer(size=n) File "device_buffer.pyx", line 85, in rmm._lib.device_buffer.DeviceBuffer.__cinit__ RuntimeError: CUDA error at: .../rmm/include/rmm/cuda_device.hpp:56: cudaErrorNoDevice no CUDA-capable device is detected ... ``` The critical error comes from [RMM](https://docs.rapids.ai/api/rmm/nightly/), we’re attempting to allocate a [`DeviceBuffer`](https://docs.rapids.ai/api/rmm/nightly/basics.html#devicebuffers) on the scheduler, but there is no GPU available to do so: ```pytb File ".../distributed/distributed/comm/ucx.py", line 171, in device_array return rmm.DeviceBuffer(size=n) File "device_buffer.pyx", line 85, in rmm._lib.device_buffer.DeviceBuffer.__cinit__ RuntimeError: CUDA error at: .../rmm/include/rmm/cuda_device.hpp:56: cudaErrorNoDevice no CUDA-capable device is detected ``` ### Scheduler optimizations and High-Level graphs The Dask community is actively working on implementing high-level graphs which will both speed up client -> scheduler communication and allow the scheduler to make advanced optimizations such as predicate pushdown. Much effort has been put into using existing serialization strategies to communicate the HLG but this has proven prohibitively difficult to implement. The current plan is to simplify HighLevelGraph/Layer so that the entire HLG can be pickled on the client, sent to the scheduler as a single binary blob, and then unpickled/materialized (HLG->dict) on the scheduler. The problem with this new plan is that the pickle/un-pickle convention will require the scheduler to have the same environment as the client. If any Layer logic also requires a device allocation, then this approach also requires the scheduler to have access to a GPU. ## So what are the minimum requirements of the scheduler? From a software perspective we recommend that the Python environment on the client, scheduler and workers all match. Given that the user is expected to ensure the worker has the same environment as the client it is not much of a burden to ensure the scheduler also has the same environment. From a hardware perspective we recommend that the scheduler has the same capabilities, but not necessarily the same quantity of resource. Therefore if the workers have one or more GPUs we recommend that the scheduler has access to one GPU with matching NVIDIA driver and CUDA versions. In a large multi-node cluster deployment on a cloud platform this may mean the workers are launched on VMs with 8 GPUs and the scheduler is launched on a smaller VM with one GPU. You could also select a less powerful GPU such as those intended for inferencing for your scheduler like a T4, provided it has the same CUDA capabilities, NVIDIA driver version and CUDA/CUDA Toolkit version. This balance means we can guarantee things function as intended, but reduces cost because placing the scheduler on an 8 GPU node would be a waste of resources. # index.html.md # Colocate Dask workers on Kubernetes while using nodes with multiple GPUs To optimize performance when working with nodes that have multiple GPUs, a best practice is to schedule Dask workers in a tightly grouped manner, thereby minimizing communication overhead between worker pods. This guide provides a step-by-step process for adding Pod affinities to worker pods ensuring they are scheduled together as much as possible on Google Kubernetes Engine (GKE), but the principles can be adapted for use with other Kubernetes distributions. ## Prerequisites First you’ll need to have the [`gcloud` CLI tool](https://cloud.google.com/sdk/gcloud) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), etc for managing Kubernetes. Ensure you are logged into the `gcloud` CLI. ```bash $ gcloud init ``` ## Create the Kubernetes cluster Now we can launch a GPU enabled GKE cluster. ```bash $ gcloud container clusters create rapids-gpu \ --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g \ --zone us-central1-c --release-channel stable ``` With this command, you’ve launched a GKE cluster called `rapids-gpu`. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs. ## Install drivers Next, [install the NVIDIA drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) onto each node. ```console $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml daemonset.apps/nvidia-driver-installer created ``` Verify that the NVIDIA drivers are successfully installed. ```console $ kubectl get po -A --watch | grep nvidia kube-system nvidia-driver-installer-6zwcn 1/1 Running 0 8m47s kube-system nvidia-driver-installer-8zmmn 1/1 Running 0 8m47s kube-system nvidia-driver-installer-mjkb8 1/1 Running 0 8m47s kube-system nvidia-gpu-device-plugin-5ffkm 1/1 Running 0 13m kube-system nvidia-gpu-device-plugin-d599s 1/1 Running 0 13m kube-system nvidia-gpu-device-plugin-jrgjh 1/1 Running 0 13m ``` After your drivers are installed, you are ready to test your cluster. Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected. ```bash cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04" resources: limits: nvidia.com/gpu: 1 EOF ``` ```console $ kubectl logs pod/cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that Pod. ```console $ kubectl delete pod cuda-vectoradd pod "cuda-vectoradd" deleted ``` ### Installing Dask operator with Helm The operator has a Helm chart which can be used to manage the installation of the operator. Follow the instructions provided in the [Dask documentation](https://kubernetes.dask.org/en/latest/installing.html#installing-with-helm), or alternatively can be installed via: ```console $ helm install --create-namespace -n dask-operator --generate-name --repo https://helm.dask.org dask-kubernetes-operator NAME: dask-kubernetes-operator-1666875935 NAMESPACE: dask-operator STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Operator has been installed successfully. ``` ## Configuring a RAPIDS `DaskCluster` To configure the `DaskCluster` resource to run RAPIDS you need to set a few things: - The container image must contain RAPIDS, the [official RAPIDS container images](https://docs.rapids.ai/install/#docker) are a good choice for this. - The Dask workers must be configured with one or more NVIDIA GPU resources. - The worker command must be set to `dask-cuda-worker`. ## Creating a RAPIDS `DaskCluster` using `kubectl` Here is an example resource manifest for launching a RAPIDS Dask cluster with worker Pod affinity ```yaml # rapids-dask-cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: rapids-dask-cluster labels: dask.org/cluster-name: rapids-dask-cluster spec: worker: replicas: 2 spec: containers: - name: worker image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" args: - dask-cuda-worker - --name - $(DASK_WORKER_NAME) resources: limits: nvidia.com/gpu: "1" affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: dask.org/component operator: In values: - worker topologyKey: kubernetes.io/hostname scheduler: spec: containers: - name: scheduler image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" env: args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 15 periodSeconds: 20 resources: limits: nvidia.com/gpu: "1" service: type: ClusterIP selector: dask.org/cluster-name: rapids-dask-cluster dask.org/component: scheduler ports: - name: tcp-comm protocol: TCP port: 8786 targetPort: "tcp-comm" - name: http-dashboard protocol: TCP port: 8787 targetPort: "http-dashboard" ``` You can create this cluster with `kubectl`. ```bash $ kubectl apply -f rapids-dask-cluster.yaml ``` ### Manifest breakdown Most of this manifest is explained in the [Dask Operator](https://docs.rapids.ai/deployment/stable/tools/kubernetes/dask-operator/#example-using-kubecluster) documentation in the tools section of the RAPIDS documentation. The only addition made to the example from the above documentation page is the following section in the worker configuration ```yaml # ... affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: dask.org/component operator: In values: - worker topologyKey: kubernetes.io/hostname # ... ``` For the Dask Worker Pod configuration, we are setting a Pod affinity using the name of the node as the topology key. [Pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) in Kubernetes allows you to constrain which nodes the Pod can be scheduled on and allows you to configure a set of workloads that should be co-located in the same defined topology, in this case, preferring to place two worker pods on the same node. This is also intended to be a soft requirement as we are using the `preferredDuringSchedulingIgnoredDuringExecution` type of Pod affinity. The Kubernetes scheduler tries to find a node which meets the rule. If a matching node is not available, the Kubernetes scheduler still schedules the Pod on any available node. This ensures that you will not face any issues with the Dask cluster even if placing worker pods on nodes already in use is not possible. ### Accessing your Dask cluster Once you have created your `DaskCluster` resource we can use `kubectl` to check the status of all the other resources it created for us. ```console $ kubectl get all -l dask.org/cluster-name=rapids-dask-cluster -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/rapids-dask-cluster-default-worker-12a055b2db-7b5bf8f66c-9mb59 1/1 Running 0 2s 10.244.2.3 gke-rapids-gpu-1-default-pool-d85b49-2545 pod/rapids-dask-cluster-default-worker-34437735ae-6fdd787f75-sdqzg 1/1 Running 0 2s 10.244.2.4 gke-rapids-gpu-1-default-pool-d85b49-2545 pod/rapids-dask-cluster-scheduler-6656cb88f6-cgm4t 0/1 Running 0 3s 10.244.3.3 gke-rapids-gpu-1-default-pool-d85b49-2f31 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/rapids-dask-cluster-scheduler ClusterIP 10.96.231.110 8786/TCP,8787/TCP 3s dask.org/cluster-name=rapids-dask-cluster,dask.org/component=scheduler ``` Here you can see our scheduler Pod and two worker pods along with the scheduler service. The two worker pods are placed in the same node as desired, while the scheduler Pod is placed on a different node. If you have a Python session running within the Kubernetes cluster (like the [example one on the Kubernetes page](../platforms/kubernetes.md)) you should be able to connect a Dask distributed client directly. ```python from dask.distributed import Client client = Client("rapids-dask-cluster-scheduler:8786") ``` Alternatively if you are outside of the Kubernetes cluster you can change the `Service` to use [`LoadBalancer`](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) or [`NodePort`](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport) or use `kubectl` to port forward the connection locally. ```console $ kubectl port-forward svc/rapids-dask-cluster-service 8786:8786 Forwarding from 127.0.0.1:8786 -> 8786 ``` ```python from dask.distributed import Client client = Client("localhost:8786") ``` ## Example using `KubeCluster` In addition to creating clusters via `kubectl` you can also do so from Python with [`dask_kubernetes.operator.KubeCluster`](https://kubernetes.dask.org/en/latest/operator_kubecluster.html#dask_kubernetes.operator.KubeCluster). This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. You can also generate a spec with make_cluster_spec() which KubeCluster uses internally and then modify it with your custom options. We will use this to add node affinity to the scheduler. In the following example, the same cluster configuration as the `kubectl` example is used. ```python from dask_kubernetes.operator import KubeCluster, make_cluster_spec spec = make_cluster_spec( name="rapids-dask-cluster", image="rapidsai/base:25.12a-cuda12-py3.13", n_workers=2, resources={"limits": {"nvidia.com/gpu": "1"}}, worker_command="dask-cuda-worker", ) ``` To add the node affinity to the worker, you can create a custom dictionary specifying the type of Pod affinity and the topology key. ```python affinity_config = { "podAffinity": { "preferredDuringSchedulingIgnoredDuringExecution": [ { "weight": 100, "podAffinityTerm": { "labelSelector": { "matchExpressions": [ { "key": "dask.org/component", "operator": "In", "values": ["worker"], } ] }, "topologyKey": "kubernetes.io/hostname", }, } ] } } ``` Now you can add this configuration to the spec created in the previous step, and create the Dask cluster using this custom spec. ```python spec["spec"]["worker"]["spec"]["affinity"] = affinity_config cluster = KubeCluster(custom_cluster_spec=spec) ``` If we check with `kubectl` we can see the above Python generated the same `DaskCluster` resource as the `kubectl` example above. ```console $ kubectl get daskclusters NAME AGE rapids-dask-cluster 3m28s $ kubectl get all -l dask.org/cluster-name=rapids-dask-cluster -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/rapids-dask-cluster-default-worker-12a055b2db-7b5bf8f66c-9mb59 1/1 Running 0 2s 10.244.2.3 gke-rapids-gpu-1-default-pool-d85b49-2545 pod/rapids-dask-cluster-default-worker-34437735ae-6fdd787f75-sdqzg 1/1 Running 0 2s 10.244.2.4 gke-rapids-gpu-1-default-pool-d85b49-2545 pod/rapids-dask-cluster-scheduler-6656cb88f6-cgm4t 0/1 Running 0 3s 10.244.3.3 gke-rapids-gpu-1-default-pool-d85b49-2f31 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/rapids-dask-cluster-scheduler ClusterIP 10.96.231.110 8786/TCP,8787/TCP 3s dask.org/cluster-name=rapids-dask-cluster,dask.org/component=scheduler ``` With this cluster object in Python we can also connect a client to it directly without needing to know the address as Dask will discover that for us. It also automatically sets up port forwarding if you are outside of the Kubernetes cluster. ```python from dask.distributed import Client client = Client(cluster) ``` This object can also be used to scale the workers up and down. ```python cluster.scale(5) ``` And to manually close the cluster. ```python cluster.close() ``` #### NOTE By default the `KubeCluster` command registers an exit hook so when the Python process exits the cluster is deleted automatically. You can disable this by setting `KubeCluster(..., shutdown_on_close=False)` when launching the cluster. This is useful if you have a multi-stage pipeline made up of multiple Python processes and you want your Dask cluster to persist between them. You can also connect a `KubeCluster` object to your existing cluster with `cluster = KubeCluster.from_name(name="rapids-dask")` if you wish to use the cluster or manually call `cluster.close()` in the future. # index.html.md # GPU optimization for the Dask scheduler on Kubernetes An optimization users can make while deploying Dask clusters is to ensure that the scheduler is placed on a node with a less powerful GPU to reduce overall cost. [This previous guide](https://docs.rapids.ai/deployment/stable/guides/scheduler-gpu-requirements/) explains why the scheduler needs access to the same environment as the workers, as there are a few edge cases where the scheduler does serialize data and unpickles high-level graphs. #### WARNING This guide outlines our current advice on scheduler hardware requirements, but this may be subject to change. However, when working with nodes with multiple GPUs, placing the scheduler on one of these nodes would be a waste of resources. This guide walks through the steps to create a Kubernetes cluster on GKE along with a nodepool of less powerful Nvidia Tesla T4 GPUs and placing the scheduler on this node using Kubernetes node affinity. ## Prerequisites First you’ll need to have the [`gcloud` CLI tool](https://cloud.google.com/sdk/gcloud) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), etc for managing Kubernetes. Ensure you are logged into the `gcloud` CLI. ```bash $ gcloud init ``` ## Create the Kubernetes cluster Now we can launch a GPU enabled GKE cluster. ```bash $ gcloud container clusters create rapids-gpu \ --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g \ --zone us-central1-c --release-channel stable ``` With this command, you’ve launched a GKE cluster called `rapids-gpu`. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs. ## Create the dedicated nodepool for the scheduler Now create a new nodepool on this GPU cluster. ```bash $ gcloud container node-pools create scheduler-pool --cluster rapids-gpu \ --accelerator type=nvidia-tesla-t4,count=1 --machine-type n1-standard-2 \ --num-nodes 1 --node-labels dedicated=scheduler --zone us-central1-c ``` With this command, you’ve created an additional nodepool called `scheduler-pool` with 1 node. You’ve also specified that it should use a node of type n1-standard-2, with one T4 GPU. We also add a Kubernetes label `dedicated=scheduled` to the node in this nodepool which will be used to place the scheduler onto this node. ## Install drivers Next, [install the NVIDIA drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) onto each node. ```console $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml daemonset.apps/nvidia-driver-installer created ``` Verify that the NVIDIA drivers are successfully installed. ```console $ kubectl get po -A --watch | grep nvidia kube-system nvidia-driver-installer-6zwcn 1/1 Running 0 8m47s kube-system nvidia-driver-installer-8zmmn 1/1 Running 0 8m47s kube-system nvidia-driver-installer-mjkb8 1/1 Running 0 8m47s kube-system nvidia-gpu-device-plugin-5ffkm 1/1 Running 0 13m kube-system nvidia-gpu-device-plugin-d599s 1/1 Running 0 13m kube-system nvidia-gpu-device-plugin-jrgjh 1/1 Running 0 13m ``` After your drivers are installed, you are ready to test your cluster. Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected. ```bash cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04" resources: limits: nvidia.com/gpu: 1 EOF ``` ```console $ kubectl logs pod/cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that Pod. ```console $ kubectl delete pod cuda-vectoradd pod "cuda-vectoradd" deleted ``` ### Installing Dask operator with Helm The operator has a Helm chart which can be used to manage the installation of the operator. The chart is published in the [Dask Helm Repo](https://helm.dask.org) repository, and can be installed via: ```console $ helm repo add dask https://helm.dask.org "dask" has been added to your repositories ``` ```console $ helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "dask" chart repository Update Complete. ⎈Happy Helming!⎈ ``` ```console $ helm install --create-namespace -n dask-operator --generate-name dask/dask-kubernetes-operator NAME: dask-kubernetes-operator-1666875935 NAMESPACE: dask-operator STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Operator has been installed successfully. ``` Then you should be able to list your Dask clusters via `kubectl`. ```console $ kubectl get daskclusters No resources found in default namespace. ``` We can also check the operator Pod is running: ```console $ kubectl get pods -A -l app.kubernetes.io/name=dask-kubernetes-operator NAMESPACE NAME READY STATUS RESTARTS AGE dask-operator dask-kubernetes-operator-775b8bbbd5-zdrf7 1/1 Running 0 74s ``` ## Configuring a RAPIDS `DaskCluster` To configure the `DaskCluster` resource to run RAPIDS you need to set a few things: - The container image must contain RAPIDS, the [official RAPIDS container images](https://docs.rapids.ai/install/#docker) are a good choice for this. - The Dask workers must be configured with one or more NVIDIA GPU resources. - The worker command must be set to `dask-cuda-worker`. ## Creating a RAPIDS `DaskCluster` using `kubectl` Here is an example resource manifest for launching a RAPIDS Dask cluster with the scheduler optimization ```yaml # rapids-dask-cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: rapids-dask-cluster labels: dask.org/cluster-name: rapids-dask-cluster spec: worker: replicas: 2 spec: containers: - name: worker image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" args: - dask-cuda-worker - --name - $(DASK_WORKER_NAME) resources: limits: nvidia.com/gpu: "1" scheduler: spec: containers: - name: scheduler image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" env: args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 15 periodSeconds: 20 resources: limits: nvidia.com/gpu: "1" affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: dedicated operator: In values: - scheduler service: type: ClusterIP selector: dask.org/cluster-name: rapids-dask-cluster dask.org/component: scheduler ports: - name: tcp-comm protocol: TCP port: 8786 targetPort: "tcp-comm" - name: http-dashboard protocol: TCP port: 8787 targetPort: "http-dashboard" ``` You can create this cluster with `kubectl`. ```bash $ kubectl apply -f rapids-dask-cluster.yaml ``` ### Manifest breakdown Most of this manifest is explained in the [Dask Operator](https://docs.rapids.ai/deployment/stable/tools/kubernetes/dask-operator/#example-using-kubecluster) documentation in the tools section of the RAPIDS documentation. The only addition made to the example from the above documentation page is the following section in the scheduler configuration ```yaml # ... affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: dedicated operator: In values: - scheduler # ... ``` For the Dask scheduler Pod we are setting a node affinity using the label previously specified on the dedicated node. Node affinity in Kubernetes allows you to constrain which nodes your Pod can be scheduled based on node labels. This is also intended to be a soft requirement as we are using the `preferredDuringSchedulingIgnoredDuringExecution` type of node affinity. The Kubernetes scheduler tries to find a node which meets the rule. If a matching node is not available, the Kubernetes scheduler still schedules the Pod on any available node. This ensures that you will not face any issues with the Dask cluster even if the T4 node is unavailable. ### Accessing your Dask cluster Once you have created your `DaskCluster` resource we can use `kubectl` to check the status of all the other resources it created for us. ```console $ kubectl get all -l dask.org/cluster-name=rapids-dask-cluster NAME READY STATUS RESTARTS AGE pod/rapids-dask-cluster-default-worker-group-worker-0c202b85fd 1/1 Running 0 4m13s pod/rapids-dask-cluster-default-worker-group-worker-ff5d376714 1/1 Running 0 4m13s pod/rapids-dask-cluster-scheduler 1/1 Running 0 4m14s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rapids-dask-cluster-service ClusterIP 10.96.223.217 8786/TCP,8787/TCP 4m13s ``` Here you can see our scheduler Pod and two worker Pods along with the scheduler service. If you have a Python session running within the Kubernetes cluster (like the [example one on the Kubernetes page](../platforms/kubernetes.md)) you should be able to connect a Dask distributed client directly. ```python from dask.distributed import Client client = Client("rapids-dask-cluster-scheduler:8786") ``` Alternatively if you are outside of the Kubernetes cluster you can change the `Service` to use [`LoadBalancer`](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) or [`NodePort`](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport) or use `kubectl` to port forward the connection locally. ```console $ kubectl port-forward svc/rapids-dask-cluster-service 8786:8786 Forwarding from 127.0.0.1:8786 -> 8786 ``` ```python from dask.distributed import Client client = Client("localhost:8786") ``` ## Example using `KubeCluster` In addition to creating clusters via `kubectl` you can also do so from Python with [`dask_kubernetes.operator.KubeCluster`](https://kubernetes.dask.org/en/latest/operator_kubecluster.html#dask_kubernetes.operator.KubeCluster). This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. You can also generate a spec with `make_cluster_spec()` which KubeCluster uses internally and then modify it with your custom options. We will use this to add node affinity to the scheduler. ```python from dask_kubernetes.operator import KubeCluster, make_cluster_spec spec = make_cluster_spec( name="rapids-dask-cluster", image="rapidsai/base:25.12a-cuda12-py3.13", n_workers=2, resources={"limits": {"nvidia.com/gpu": "1"}}, worker_command="dask-cuda-worker", ) ``` To add the node affinity to the scheduler, you can create a custom dictionary specifying the type of node affinity and the label of the node. ```python affinity_config = { "nodeAffinity": { "preferredDuringSchedulingIgnoredDuringExecution": [ { "weight": 100, "preference": { "matchExpressions": [ {"key": "dedicated", "operator": "In", "values": ["scheduler"]} ] }, } ] } } ``` Now you can add this configuration to the spec created in the previous step, and create the Dask cluster using this custom spec. ```python spec["spec"]["scheduler"]["spec"]["affinity"] = affinity_config cluster = KubeCluster(custom_cluster_spec=spec) ``` If we check with `kubectl` we can see the above Python generated the same `DaskCluster` resource as the `kubectl` example above. ```console $ kubectl get daskclusters NAME AGE rapids-dask-cluster 3m28s $ kubectl get all -l dask.org/cluster-name=rapids-dask-cluster NAME READY STATUS RESTARTS AGE pod/rapids-dask-cluster-default-worker-group-worker-07d674589a 1/1 Running 0 3m30s pod/rapids-dask-cluster-default-worker-group-worker-a55ed88265 1/1 Running 0 3m30s pod/rapids-dask-cluster-scheduler 1/1 Running 0 3m30s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rapids-dask-cluster-service ClusterIP 10.96.200.202 8786/TCP,8787/TCP 3m30s ``` With this cluster object in Python we can also connect a client to it directly without needing to know the address as Dask will discover that for us. It also automatically sets up port forwarding if you are outside of the Kubernetes cluster. ```python from dask.distributed import Client client = Client(cluster) ``` This object can also be used to scale the workers up and down. ```python cluster.scale(5) ``` And to manually close the cluster. ```python cluster.close() ``` #### NOTE By default the `KubeCluster` command registers an exit hook so when the Python process exits the cluster is deleted automatically. You can disable this by setting `KubeCluster(..., shutdown_on_close=False)` when launching the cluster. This is useful if you have a multi-stage pipeline made up of multiple Python processes and you want your Dask cluster to persist between them. You can also connect a `KubeCluster` object to your existing cluster with `cluster = KubeCluster.from_name(name="rapids-dask")` if you wish to use the cluster or manually call `cluster.close()` in the future. # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Caching Docker Images For Autoscaling Workloads The [Dask Autoscaler](https://kubernetes.dask.org/en/latest/operator_resources.html#daskautoscaler) leverages Dask’s adaptive mode and allows the scheduler to scale the number of workers up and down based on the task graph. When scaling the Dask cluster up or down, there is no guarantee that newly created worker Pods will be scheduled on the same node as previously removed workers. As a result, when a new node is allocated for a worker Pod, the cluster will incur a pull penalty due to the need to download the Docker image. ## Using a Daemonset to cache images To guarantee that each node runs a consistent workload, we will deploy a Kubernetes [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) utilizing the RAPIDS image. This DaemonSet will prevent Dask worker Pods created from this image from entering a pending state when tasks are scheduled. This is an example manifest to deploy a Daemonset with the RAPIDS container. ```yaml #caching-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: prepuller namespace: image-cache spec: selector: matchLabels: name: prepuller template: metadata: labels: name: prepuller spec: initContainers: - name: prepuller-1 image: "rapidsai/base:25.12a-cuda12-py3.13" command: ["sh", "-c", "'true'"] containers: - name: pause image: gcr.io/google_containers/pause:3.2 resources: limits: cpu: 1m memory: 8Mi requests: cpu: 1m memory: 8Mi ``` You can create this Daemonset with `kubectl`. ```bash $ kubectl apply -f caching-daemonset.yaml ``` The DaemonSet is deployed in the `image-cache` namespace. In the `initContainers` section, we specify the image to be pulled and cached within the cluster, utilizing any executable command that terminates successfully. Additionally, the `pause` container is used to ensure the Pod transitions into a Running state without consuming resources or running any processes. When deploying the DaemonSet, after all pre-puller Pods are running successfully, you can confirm that the images have been cached across all nodes in the cluster. As the Kubernetes cluster is scaled up or down, the DaemonSet will automatically pull and cache the necessary images on any newly added nodes, ensuring consistent image availability throughout # index.html.md # Building RAPIDS containers from a custom base image This guide provides instructions to add RAPIDS and CUDA to your existing Docker images. This approach allows you to integrate RAPIDS libraries into containers that must start from a specific base image, such as application-specific containers. The CUDA installation steps are sourced from the official [NVIDIA CUDA Container Images Repository](https://gitlab.com/nvidia/container-images/cuda). #### WARNING We strongly recommend that you use the official CUDA container images published by NVIDIA. This guide is intended for those extreme situations where you cannot use the CUDA images as the base and need to manually install CUDA components on your containers. This approach introduces significant complexity and potential issues that can be difficult to debug. We cannot provide support for users beyond what is on this page. If you have the flexibility to choose your base image, see the [Custom RAPIDS Docker Guide](../custom-docker.md) which starts from NVIDIA’s official CUDA images for a simpler setup. ## Overview If you cannot use NVIDIA’s CUDA container images, you will need to manually install CUDA components in your existing Docker image. The components you need depends on the package manager used to install RAPIDS: - **For conda installations**: You need the components from the NVIDIA `base` CUDA images - **For pip installations**: You need the components from the NVIDIA `runtime` CUDA images ## Understanding CUDA Image Components NVIDIA provides three tiers of CUDA container images, each building on the previous: ### Base Components (Required for RAPIDS on conda) The **base** images provide the minimal CUDA runtime environment: | Component | Package Name | Purpose | |--------------------|----------------|---------------------------------------------------| | CUDA Runtime | `cuda-cudart` | Core CUDA runtime library (`libcudart.so`) | | CUDA Compatibility | `cuda-compat` | Forward compatibility libraries for older drivers | ### Runtime Components (Required for RAPIDS on pip) The **runtime** images include all the base components plus additional CUDA packages such as: | Component | Package Name | Purpose | |-------------------------------|------------------|----------------------------------------------------| | **All Base Components** | (see above) | Core CUDA runtime | | CUDA Libraries | `cuda-libraries` | Comprehensive CUDA library collection | | CUDA Math Libraries | `libcublas` | Basic Linear Algebra Subprograms (BLAS) | | NVIDIA Performance Primitives | `libnpp` | Image, signal and video processing primitives | | Sparse Matrix Library | `libcusparse` | Sparse matrix operations | | Profiling Tools | `cuda-nvtx` | NVIDIA Tools Extension for profiling | | Communication Library | `libnccl2` | Multi-GPU and multi-node collective communications | ### Development Components (Optional) The **devel** images add development tools to runtime images such as: - CUDA development headers and static libraries - CUDA compiler (`nvcc`) - Debugger and profiler tools - Additional development utilities #### NOTE Development components are typically not needed for RAPIDS usage unless you plan to compile CUDA code within your container. For the complete and up to date list of runtime and devel components, see the respective Dockerfiles in the [NVIDIA CUDA Container Images Repository](https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist). ## Getting the Right Components for Your Setup The [NVIDIA CUDA Container Images repository](https://gitlab.com/nvidia/container-images/cuda) contains a `dist/` directory with pre-built Dockerfiles organized by CUDA version, Linux distribution, and container type (base, runtime, devel). ### Supported Distributions CUDA components are available for most popular Linux distributions. For the complete and current list of supported distributions for your desired version, check the repository linked above. ### Key Differences by Distribution Type **Ubuntu/Debian distributions:** - Use `apt-get install` commands - Repository setup uses GPG keys and `.list` files **RHEL/CentOS/Rocky Linux distributions:** - Use `yum install` or `dnf install` commands - Repository setup uses `.repo` configuration files - Include repository files: `cuda.repo-x86_64`, `cuda.repo-arm64` ### Installing CUDA components on your container 1. Navigate to `dist/{cuda_version}/{your_os}/base/` or `runtime/` in the [repository](https://gitlab.com/nvidia/container-images/cuda) 2. Open the `Dockerfile` for your target distribution 3. Copy all `ENV` variables for package versioning and NVIDIA Container Toolkit support (see the Essential Environment Variables section below) 4. Copy the `RUN` commands for installing the packages 5. If you are using the `runtime` components, make sure to copy the `ENV` and `RUN` commands from the `base` Dockerfile as well 6. For RHEL-based systems, also copy any `.repo` configuration files needed #### NOTE Package versions change between CUDA releases. Always check the specific Dockerfile for your desired CUDA version and distribution to get the correct versions. ### Installing RAPIDS libraries on your container Refer to the Docker Templates in the [Custom RAPIDS Docker Guide](../custom-docker.md) to configure your RAPIDS installation, adding the conda or pip installation commands after the CUDA components are installed. ## Essential Environment Variables These environment variables are **required** when building CUDA containers, as they control GPU access and CUDA functionality through the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html) | Variable | Purpose | |------------------------------|----------------------------------| | `NVIDIA_VISIBLE_DEVICES` | Specifies which GPUs are visible | | `NVIDIA_DRIVER_CAPABILITIES` | Required driver capabilities | | `NVIDIA_REQUIRE_CUDA` | Driver version constraints | | `PATH` | Include CUDA binaries | | `LD_LIBRARY_PATH` | Include CUDA libraries | ## Complete Integration Examples Here are complete examples showing how to build a RAPIDS container with CUDA 12.9.1 components on an Ubuntu 24.04 base image: ### conda ### RAPIDS with conda (Base Components) Create an `env.yaml` file alongside your Dockerfile with your desired RAPIDS packages following the configuration described in the [Custom RAPIDS Docker Guide](../custom-docker.md). Set the `TARGETARCH` build argument to match your target architecture (`amd64` for x86_64 or `arm64` for ARM processors). ```dockerfile FROM ubuntu:24.04 # Build arguments ARG TARGETARCH=amd64 # Architecture detection and setup ENV NVARCH=${TARGETARCH/amd64/x86_64} ENV NVARCH=${NVARCH/arm64/sbsa} SHELL ["/bin/bash", "-euo", "pipefail", "-c"] # NVIDIA Repository Setup (Ubuntu 24.04) RUN apt-get update && apt-get install -y --no-install-recommends \ gnupg2 curl ca-certificates && \ curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/${NVARCH}/3bf863cc.pub | apt-key add - && \ echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/${NVARCH} /" > /etc/apt/sources.list.d/cuda.list && \ apt-get purge --autoremove -y curl && \ rm -rf /var/lib/apt/lists/* # CUDA Base Package Versions (from CUDA 12.9.1 base image) ENV NV_CUDA_CUDART_VERSION=12.9.79-1 ENV CUDA_VERSION=12.9.1 # NVIDIA driver constraints ENV NVIDIA_REQUIRE_CUDA="cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571" # Install Base CUDA Components (from base image) RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-cudart-12-9=${NV_CUDA_CUDART_VERSION} \ cuda-compat-12-9 && \ rm -rf /var/lib/apt/lists/* # CUDA Environment Configuration ENV PATH=/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64 # NVIDIA Container Runtime Configuration ENV NVIDIA_VISIBLE_DEVICES=all ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility # Required for nvidia-docker v1 RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/nvidia.conf # Install system dependencies RUN apt-get update && \ apt-get install -y --no-install-recommends \ wget \ curl \ git \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Install Miniforge RUN wget -qO /tmp/miniforge.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh" && \ bash /tmp/miniforge.sh -b -p /opt/conda && \ rm /tmp/miniforge.sh && \ /opt/conda/bin/conda clean --all --yes # Add conda to PATH and activate base environment ENV PATH="/opt/conda/bin:${PATH}" ENV CONDA_DEFAULT_ENV=base ENV CONDA_PREFIX=/opt/conda # Create conda group and rapids user RUN groupadd -g 1001 conda && \ useradd -rm -d /home/rapids -s /bin/bash -g conda -u 1001 rapids && \ chown -R rapids:conda /opt/conda USER rapids WORKDIR /home/rapids # Copy the environment file template COPY --chmod=644 env.yaml /home/rapids/env.yaml # Update the base environment with user's packages from env.yaml # Note: The -n base flag ensures packages are installed to the base environment # overriding any 'name:' specified in the env.yaml file RUN /opt/conda/bin/conda env update -n base -f env.yaml && \ /opt/conda/bin/conda clean --all --yes CMD ["bash"] ``` ### pip ### RAPIDS with pip (Runtime Components) Create a `requirements.txt` file alongside your Dockerfile with your desired RAPIDS packages following the configuration described in the [Custom RAPIDS Docker Guide](../custom-docker.md). Set the `TARGETARCH` build argument to match your target architecture (`amd64` for x86_64 or `arm64` for ARM processors). You can also customize the Python version by changing the `PYTHON_VER` build argument. ```dockerfile FROM ubuntu:24.04 # Build arguments ARG PYTHON_VER=3.12 ARG TARGETARCH=amd64 # Architecture detection and setup ENV NVARCH=${TARGETARCH/amd64/x86_64} ENV NVARCH=${NVARCH/arm64/sbsa} SHELL ["/bin/bash", "-euo", "pipefail", "-c"] # NVIDIA Repository Setup (Ubuntu 24.04) RUN apt-get update && apt-get install -y --no-install-recommends \ gnupg2 curl ca-certificates && \ curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/${NVARCH}/3bf863cc.pub | apt-key add - && \ echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/${NVARCH} /" > /etc/apt/sources.list.d/cuda.list && \ apt-get purge --autoremove -y curl && \ rm -rf /var/lib/apt/lists/* # CUDA Package Versions (from CUDA 12.9.1 base and runtime images) ENV NV_CUDA_CUDART_VERSION=12.9.79-1 ENV NV_CUDA_LIB_VERSION=12.9.1-1 ENV NV_NVTX_VERSION=12.9.79-1 ENV NV_LIBNPP_VERSION=12.4.1.87-1 ENV NV_LIBNPP_PACKAGE=libnpp-12-9=${NV_LIBNPP_VERSION} ENV NV_LIBCUSPARSE_VERSION=12.5.10.65-1 ENV NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-9 ENV NV_LIBCUBLAS_VERSION=12.9.1.4-1 ENV NV_LIBCUBLAS_PACKAGE=${NV_LIBCUBLAS_PACKAGE_NAME}=${NV_LIBCUBLAS_VERSION} ENV NV_LIBNCCL_PACKAGE_NAME=libnccl2 ENV NV_LIBNCCL_PACKAGE_VERSION=2.27.3-1 ENV NCCL_VERSION=2.27.3-1 ENV NV_LIBNCCL_PACKAGE=${NV_LIBNCCL_PACKAGE_NAME}=${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.9 ENV CUDA_VERSION=12.9.1 # NVIDIA driver constraints ENV NVIDIA_REQUIRE_CUDA="cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571" # Install Base CUDA Components RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-cudart-12-9=${NV_CUDA_CUDART_VERSION} \ cuda-compat-12-9 && \ rm -rf /var/lib/apt/lists/* # Install Runtime CUDA Components RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-libraries-12-9=${NV_CUDA_LIB_VERSION} \ ${NV_LIBNPP_PACKAGE} \ cuda-nvtx-12-9=${NV_NVTX_VERSION} \ libcusparse-12-9=${NV_LIBCUSPARSE_VERSION} \ ${NV_LIBCUBLAS_PACKAGE} \ ${NV_LIBNCCL_PACKAGE} && \ rm -rf /var/lib/apt/lists/* # Keep apt from auto upgrading the cublas and nccl packages RUN apt-mark hold ${NV_LIBCUBLAS_PACKAGE_NAME} ${NV_LIBNCCL_PACKAGE_NAME} # CUDA Environment Configuration ENV PATH=/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64 # NVIDIA Container Runtime Configuration ENV NVIDIA_VISIBLE_DEVICES=all ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility # Required for nvidia-docker v1 RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/nvidia.conf # Install system dependencies RUN apt-get update && \ apt-get install -y --no-install-recommends \ python${PYTHON_VER} \ python${PYTHON_VER}-venv \ python3-pip \ wget \ curl \ git \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Create symbolic links for python and pip RUN ln -sf /usr/bin/python${PYTHON_VER} /usr/bin/python && \ ln -sf /usr/bin/python${PYTHON_VER} /usr/bin/python3 # Create rapids user RUN groupadd -g 1001 rapids && \ useradd -rm -d /home/rapids -s /bin/bash -g rapids -u 1001 rapids USER rapids WORKDIR /home/rapids # Create and activate virtual environment RUN python -m venv /home/rapids/venv ENV PATH="/home/rapids/venv/bin:$PATH" ENV VIRTUAL_ENV="/home/rapids/venv" # Upgrade pip RUN pip install --no-cache-dir --upgrade pip setuptools wheel # Copy the requirements file COPY --chmod=644 requirements.txt /home/rapids/requirements.txt # Install all packages RUN pip install --no-cache-dir -r requirements.txt CMD ["bash"] ``` ## Verifying Your Installation After starting your container, you can quickly test that RAPIDS is installed and running correctly. The container launches directly into a `bash` shell where you can install the [RAPIDS CLI](https://github.com/rapidsai/rapids-cli) command line utility to verify your installation. 1. **Run the Container Interactively** This command starts your container and drops you directly into a bash shell. ```bash # Build the conda-based container (requires env.yaml in build context) docker build -f conda-rapids.Dockerfile -t rapids-conda-cuda . # Build the pip-based container (requires requirements.txt in build context) docker build -f pip-rapids.Dockerfile -t rapids-pip-cuda . # Run conda container with GPU access docker run --gpus all -it rapids-conda-cuda # Run pip container with GPU access docker run --gpus all -it rapids-pip-cuda ``` 2. **Install RAPIDS CLI** Inside the containers, install the RAPIDS CLI: ```bash pip install rapids-cli ``` 3. **Test the installation using the Doctor subcommand** Once RAPIDS CLI is installed, you can use the `rapids doctor` subcommand to perform health checks. ```bash rapids doctor ``` 4. **Expected Output** If your installation is successful, you will see output similar to this: ```console 🧑‍⚕️ Performing REQUIRED health check for RAPIDS Running checks All checks passed! ``` For more RAPIDS on Docker, see the [Custom RAPIDS Docker Guide](../custom-docker.md) and the [RAPIDS installation guide](https://docs.rapids.ai/install/). # index.html.md # Multi-Instance GPU (MIG) [Multi-Instance GPU](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/) is a technology that allows partitioning a single GPU into multiple instances, making each one seem as a completely independent GPU. Each instance then receives a certain slice of the GPU computational resources and a pre-defined block of memory that is detached from the other instances by on-chip protections. Due to the protection layer to make MIG secure, certain limitations exist. One such limitation that is generally important for HPC applications is the lack of support for [CUDA Inter-Process Communication (IPC)](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#interprocess-communication), which enables transfers over NVLink and NVSwitch to greatly speed up communication between physical GPUs. When using MIG, [NVLink and NVSwitch](https://www.nvidia.com/en-us/data-center/nvlink/) are thus completely unavailable, forcing the application to take a more expensive communication channel via the system (CPU) memory. Given limitations in communication capability, we advise users to first understand the tradeoffs that have to be made when attempting to setup a cluster of MIG instances. While the partitioning could be beneficial to certain applications that need only a certain amount of compute capability, communication bottlenecks may be a problem and thus need to be thought of carefully. ## Dask Cluster Dask clusters of MIG instances are supported via Dask-CUDA as long as all MIG instances are identical with respect to memory. Much like a cluster of physical GPUs, mixing GPUs with different memory sizes is generally not a good idea as Dask may not be able to balance work correctly and eventually could lead to more frequent out-of-memory errors. For example, partitioning two GPUs into 7 x 10GB instances each and setting up a cluster with all 14 instances should be ok. However, partitioning one of the GPUs into 7 x 10GB instances and another with 3 x 20GB should be avoided. Unlike for a system composed of unpartitioned GPUs, Dask-CUDA cannot automatically infer the GPUs to be utilized for the cluster. In a MIG setup, the user is then required to specify the GPU instances to be used by the cluster. This is achieved by specifying either the `CUDA_VISIBLE_DEVICES` environment variable for either [`dask_cuda.LocalCUDACluster`](https://docs.rapids.ai/api/dask-cuda/stable/api/#dask_cuda.LocalCUDACluster) or `dask-cuda-worker`, or the homonymous argument for [`dask_cuda.LocalCUDACluster`](https://docs.rapids.ai/api/dask-cuda/stable/api/#dask_cuda.LocalCUDACluster). Physical GPUs can be addressed by their indices `[0..N)` (where `N` is the total number of GPUs installed) or by its name composed of the `GPU-` prefix followed by its UUID. MIG instances have no indices and can only be addressed by their names, composed of the `MIG-` prefix followed by its UUID. The name of a MIG instance will the look similar to: `MIG-41b3359c-e721-56e5-8009-12e5797ed514`. ### Determine MIG Names The simplest way to determine the names of MIG instances is to run `nvidia-smi -L` on the command line. ```console $ nvidia-smi -L GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb) MIG 2g.10gb Device 0: (UUID: MIG-41b3359c-e721-56e5-8009-12e5797ed514) MIG 2g.10gb Device 1: (UUID: MIG-65b79fff-6d3c-5490-a288-b31ec705f310) MIG 2g.10gb Device 2: (UUID: MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0) ``` In the example case above the system has one NVIDIA A100 with 3 x 10GB MIG instances. In the next sections we will see how to use the instance names to startup a Dask cluster composed of MIG GPUs. Please note that once a GPU is partitioned, the physical GPU (named `GPU-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb` above) is inaccessible for CUDA compute and cannot be used as part of a Dask cluster. Alternatively, MIG instance names can be obtained programmatically using [NVML](https://developer.nvidia.com/nvidia-management-library-nvml) or [PyNVML](https://pypi.org/project/nvidia-ml-py/). Please refer to the [NVML API](https://docs.nvidia.com/deploy/nvml-api/) to write appropriate utilities for that purpose. ### LocalCUDACluster Suppose you have 3 MIG instances on the local system: - `MIG-41b3359c-e721-56e5-8009-12e5797ed514` - `MIG-65b79fff-6d3c-5490-a288-b31ec705f310` - `MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0` To start a [`dask_cuda.LocalCUDACluster`](https://docs.rapids.ai/api/dask-cuda/stable/api/#dask_cuda.LocalCUDACluster), the user would run the following: ```python from dask_cuda import LocalCUDACluster cluster = LocalCUDACluster( CUDA_VISIBLE_DEVICES=[ "MIG-41b3359c-e721-56e5-8009-12e5797ed514", "MIG-65b79fff-6d3c-5490-a288-b31ec705f310", "MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0", ], # Other `LocalCUDACluster` arguments ) ``` ### dask-cuda-worker Suppose you have 3 MIG instances on the local system: - `MIG-41b3359c-e721-56e5-8009-12e5797ed514` - `MIG-65b79fff-6d3c-5490-a288-b31ec705f310` - `MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0` To start a `dask-cuda-worker` that the address to the scheduler is located in the `scheduler.json` file, the user would run the following: ```console $ CUDA_VISIBLE_DEVICES="MIG-41b3359c-e721-56e5-8009-12e5797ed514,MIG-65b79fff-6d3c-5490-a288-b31ec705f310,MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0" dask-cuda-worker scheduler.json # --other-arguments ``` Please note that in the example above we created 3 Dask-CUDA workers on one node, for a multi-node cluster, the correct MIG names need to be specified, and they will always be different for each host. ## XGBoost with Dask Cluster Currently [XGBoost](https://www.nvidia.com/en-us/glossary/data-science/xgboost/) only exposes support for GPU communication via NCCL, which does not support MIG. For this reason, A Dask cluster that utilizes XGBoost would have to utilize TCP instead for all communications which will likely cause in considerable performance degradation. Therefore, using XGBoost with MIG is not recommended. # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # dask-cuda [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/nightly/) is a library extending `LocalCluster` from `dask.distributed` to enable multi-GPU workloads. ## LocalCUDACluster You can use `LocalCUDACluster` to create a cluster of one or more GPUs on your local machine. You can launch a Dask scheduler on LocalCUDACluster to parallelize and distribute your RAPIDS workflows across multiple GPUs on a single node. In addition to enabling multi-GPU computation, `LocalCUDACluster` also provides a simple interface for managing the cluster, such as starting and stopping the cluster, querying the status of the nodes, and monitoring the workload distribution. ## Pre-requisites Before running these instructions, ensure you have installed the [`dask`](https://docs.dask.org/en/stable/install.html) and [`dask-cuda`](https://docs.rapids.ai/api/dask-cuda/nightly/install.html) packages in your local environment. ## Cluster setup ### Instantiate a LocalCUDACluster object The `LocalCUDACluster` class autodetects the GPUs in your system, so if you create it on a machine with two GPUs it will create a cluster with two workers, each of which is responsible for executing tasks on a separate GPU. ```python from dask_cuda import LocalCUDACluster from dask.distributed import Client cluster = LocalCUDACluster() ``` You can also restrict your cluster to use specific GPUs by setting the `CUDA_VISIBLE_DEVICES` environment variable, or as a keyword argument. ```python cluster = LocalCUDACluster( CUDA_VISIBLE_DEVICES="0,1" ) # Creates one worker for GPUs 0 and 1 ``` ### Connecting a Dask client The Dask scheduler coordinates the execution of tasks, whereas the Dask client is the user-facing interface that submits tasks to the scheduler and monitors their progress. ```python client = Client(cluster) ``` ## Test RAPIDS To test RAPIDS, create a `distributed` client for the cluster and query for the GPU model. ```python def get_gpu_model(): import pynvml pynvml.nvmlInit() return pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0)) result = client.submit(get_gpu_model).result() print(result) # b'Tesla V100-SXM2-16GB ``` # index.html.md # Kubernetes RAPIDS integrates with Kubernetes in many ways depending on your use case. ## Interactive Notebook For single-user interactive sessions you can run the [RAPIDS docker image](https://docs.rapids.ai/install/#docker) which contains a conda environment with the RAPIDS libraries and Jupyter for interactive use. You can run this directly on Kubernetes as a `Pod` and expose Jupyter via a `Service`. For example: ```yaml # rapids-notebook.yaml apiVersion: v1 kind: Service metadata: name: rapids-notebook labels: app: rapids-notebook spec: type: NodePort ports: - port: 8888 name: http targetPort: 8888 nodePort: 30002 selector: app: rapids-notebook --- apiVersion: v1 kind: Pod metadata: name: rapids-notebook labels: app: rapids-notebook spec: securityContext: fsGroup: 0 containers: - name: rapids-notebook image: "rapidsai/notebooks:25.12a-cuda12-py3.13" resources: limits: nvidia.com/gpu: 1 ports: - containerPort: 8888 name: notebook ``` ### Optional: Extended notebook configuration to enable launching multi-node Dask clusters Deploying an interactive single-user notebook can provide a great place to launch further resources. For example you could install `dask-kubernetes` and use the [dask-operator](../tools/kubernetes/dask-operator.md) to create multi-node Dask clusters from your notebooks. To do this you’ll need to create a couple of extra resources when launching your notebook `Pod`. ### Service account and role To be able to interact with the Kubernetes API from within your notebook and create Dask resources you’ll need to create a service account with an attached role. ```yaml apiVersion: v1 kind: ServiceAccount metadata: name: rapids-dask --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: rapids-dask rules: - apiGroups: [""] resources: ["pods", "services"] verbs: ["get", "list", "watch", "create", "delete"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] - apiGroups: [kubernetes.dask.org] resources: ["*"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: rapids-dask roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: rapids-dask subjects: - kind: ServiceAccount name: rapids-dask ``` Then you need to augment the `Pod` spec above with a reference to this service account. ```yaml apiVersion: v1 kind: Pod metadata: name: rapids-notebook labels: app: rapids-notebook spec: serviceAccountName: rapids-dask ... ``` ### Proxying the Dask dashboard and other services The RAPIDS container comes with the [jupyter-server-proxy](https://jupyter-server-proxy.readthedocs.io/en/latest/) plugin preinstalled which you can use to access other services running in your notebook via the Jupyter URL. However, by default [this is restricted to only proxying services running within your Jupyter Pod](https://jupyter-server-proxy.readthedocs.io/en/latest/arbitrary-ports-hosts.html). To access other resources like Dask clusters that have been launched in the Kubernetes cluster we need to configure Jupyter to allow this. First we create a `ConfigMap` with our configuration file. ```yaml apiVersion: v1 kind: ConfigMap metadata: name: jupyter-server-proxy-config data: jupyter_server_config.py: | c.ServerProxy.host_allowlist = lambda app, host: True ``` Then we further modify out `Pod` spec to mount in this `ConfigMap` to the right location. ```yaml apiVersion: v1 kind: Pod ... spec: containers - name: rapids-notebook ... volumeMounts: - name: jupyter-server-proxy-config mountPath: /root/.jupyter/jupyter_server_config.py subPath: jupyter_server_config.py volumes: - name: jupyter-server-proxy-config configMap: name: jupyter-server-proxy-config ``` We also might want to configure Dask to know where to look for the Dashboard via the proxied URL. We can set this via an environment variable in our `Pod`. ```yaml apiVersion: v1 kind: Pod ... spec: containers - name: rapids-notebook ... env: - name: DASK_DISTRIBUTED__DASHBOARD__LINK value: "/proxy/{host}:{port}/status" ``` ### Putting it all together Here’s an extended `rapids-notebook.yaml` spec putting all of this together. ```yaml # rapids-notebook.yaml (extended) apiVersion: v1 kind: ServiceAccount metadata: name: rapids-dask --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: rapids-dask rules: - apiGroups: [""] resources: ["pods", "services"] verbs: ["get", "list", "watch", "create", "delete"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] - apiGroups: [kubernetes.dask.org] resources: ["*"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: rapids-dask roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: rapids-dask subjects: - kind: ServiceAccount name: rapids-dask --- apiVersion: v1 kind: ConfigMap metadata: name: jupyter-server-proxy-config data: jupyter_server_config.py: | c.ServerProxy.host_allowlist = lambda app, host: True --- apiVersion: v1 kind: Service metadata: name: rapids-notebook labels: app: rapids-notebook spec: type: ClusterIP ports: - port: 8888 name: http targetPort: notebook selector: app: rapids-notebook --- apiVersion: v1 kind: Pod metadata: name: rapids-notebook labels: app: rapids-notebook spec: serviceAccountName: rapids-dask securityContext: fsGroup: 0 containers: - name: rapids-notebook image: rapidsai/notebooks:25.12a-cuda12-py3.13 resources: limits: nvidia.com/gpu: 1 ports: - containerPort: 8888 name: notebook env: - name: DASK_DISTRIBUTED__DASHBOARD__LINK value: "/proxy/{host}:{port}/status" volumeMounts: - name: jupyter-server-proxy-config mountPath: /root/.jupyter/jupyter_server_config.py subPath: jupyter_server_config.py volumes: - name: jupyter-server-proxy-config configMap: name: jupyter-server-proxy-config ``` ```bash $ kubectl apply -f rapids-notebook.yaml ``` The container creation takes approximately 7 min, you can check the status of the Pod by doing: ```bash $ kubectl get pods ``` Once it’s ready, Jupyter will be accessible on port `30002` of your Kubernetes nodes via `NodePort` service. Alternatively you could use a `LoadBalancer` service type [if you have one configured](https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/) or a `ClusterIP` and use `kubectl` to port forward the port locally and access it that way. ```bash $ kubectl port-forward service/rapids-notebook 8888 ``` Then you can open port `8888` in your browser to access Jupyter and use RAPIDS. ![Screenshot of the RAPIDS container running Jupyter showing the nvidia-smi command with a GPU listed](images/kubernetes-jupyter.png) #### NOTE Once you are done, make sure to delete your cluster to stop billing. ## Dask Operator [Dask has an operator](https://kubernetes.dask.org/en/latest/operator.html) that empowers users to create Dask clusters as native Kubernetes resources. This is useful for creating, scaling and removing Dask clusters dynamically and in a flexible way. Usually this is used in conjunction with an interactive session such as the [interactive notebook](#interactive-notebook) example above or from another service like [KubeFlow Notebooks](kubeflow.md). By dynamically launching Dask clusters configured to use RAPIDS on Kubernetes user’s can burst beyond their notebook session to many GPUs spreak across many nodes. Find out more on the [Dask Operator page](../tools/kubernetes/dask-operator.md). ## Helm Chart Individual users can also install the [Dask Helm Chart](https://helm.dask.org) which provides a `Pod` running Jupyter alongside a Dask cluster consisting of Pods running the Dask scheduler and worker components. You can customize this helm chart to run the RAPIDS container images as both the notebook server and Dask cluster components so that everything can benefit from GPU acceleration. Find out more on the [Dask Helm Chart page](../tools/kubernetes/dask-helm-chart.md). ## Dask Gateway Some organisations may want to provide Dask cluster provisioning as a central service where users are abstracted from the underlying platform like Kubernetes. This can be useful for reducing user permissions, limiting resources that users can consume and exposing things in a centralised way. For this you can deploy Dask Gateway which provides a server that users interact with programmatically and in turn launches Dask clusters on Kubernetes and proxies the connection back to the user. Users can configure what they want their Dask cluster to look like so it is possible to utilize GPUs and RAPIDS for an accelerated cluster. ## KubeFlow If you are using KubeFlow you can integrate RAPIDS right away by using the RAPIDS container images within notebooks and pipelines and by using the Dask Operator to launch GPU accelerated Dask clusters. Find out more on the [KubeFlow page](kubeflow.md). ### Related Examples Scaling up Hyperparameter Optimization with Multi-GPU Workload on Kubernetes library/xgboost library/optuna library/dask library/dask-kubernetes library/scikit-learn workflow/hpo dataset/nyc-taxi data-storage/gcs data-format/csv platforms/kubeflow platforms/kubernetes Autoscaling Multi-Tenant Kubernetes Deep-Dive cloud/gcp/gke tools/dask-operator library/cuspatial library/dask library/cudf data-format/parquet data-storage/gcs platforms/kubernetes Perform Time Series Forecasting on Google Kubernetes Engine with NVIDIA GPUs cloud/gcp/gke tools/dask-operator workflow/hpo workflow/xgboost library/dask library/dask-cuda library/xgboost library/optuna data-storage/gcs platforms/kubernetes Scaling up Hyperparameter Optimization with Kubernetes and XGBoost GPU Algorithm library/xgboost library/optuna library/dask tools/dask-kubernetes library/scikit-learn workflow/hpo platforms/kubeflow platforms/kubernetes # index.html.md # Coiled You can deploy RAPIDS on cloud VMs with GPUs using [Coiled](https://www.coiled.io/). Coiled is a software platform that manages Cloud VMs on your behalf. It manages software environments and can launch Python scripts, Jupyter Notebook servers, Dask clusters or even just individual Python functions. Remote machines are booted just in time and shut down when not in use or idle. By using the [`coiled`](https://anaconda.org/conda-forge/coiled) Python library, you can setup and manage Dask clusters with GPUs and RAPIDs on cloud computing environments such as GCP or AWS. ## Setup Head over to [Coiled](https://docs.coiled.io/user_guide/setup/index) and register for an account. Once your account is set up, install the coiled Python library/CLI tool. ```bash $ pip install coiled ``` Then you can authenticate with your Coiled account. ```bash $ coiled login ``` For more information see the [Coiled Getting Started documentation](https://docs.coiled.io/user_guide/setup/index). ## Notebook Quickstart The simplest way to get up and running with RAPIDS on Coiled is to launch a Jupyter notebook server using the RAPIDS notebook container. ```bash $ coiled notebook start --gpu --container rapidsai/notebooks:25.12a-cuda12-py3.13 ``` ![Screenshot of Jupyterlab running on Coiled executing some cudf GPU code](_static/images/platforms/coiled/coiled-jupyter.png) ## Software Environments By default when running remote operations Coiled will [attempt to create a copy of your local software environment](https://docs.coiled.io/user_guide/software/sync.html) which can be loaded onto the remote VMs. While this is an excellent feature it’s likely that you do not have all of the GPU software libraries you wish to use installed locally. In this case we need to tell Coiled which software environment to use. ### Container images All Coiled commands can be passed a container image to use. This container will be pulled onto the remote VM at launch time. ```bash $ coiled notebook start --gpu --container rapidsai/notebooks:25.12a-cuda12-py3.13 ``` This is often the most convenient way to try out existing software environments, but is often not the most performant due to the way container images are unpacked. ### Coiled Software Environments You can also created Coiled software environments ahead of time. These environments are built and cached on the cloud and can be pulled onto new VMs very quickly. You can create a RAPIDS software environment using a conda `environment.yaml` file or a pip `requirements.txt` file. #### Conda example Create an environment file containing the RAPIDS packages ```yaml # rapids-environment.yaml name: rapidsai-notebooks channels: - { { rapids_conda_channel } } - conda-forge - nvidia dependencies: # RAPIDS packages - rapids=25.12 - python=3.12 - cuda-version>=12.0,<=12.9 # (optional) Jupyter packages, necessary for Coiled Notebooks and Dask clusters with Jupyter enabled - jupyterlab - jupyterlab-nvdashboard - dask-labextension ``` ```bash $ coiled env create --name rapids --gpu-enabled --conda rapids-environment.yaml ``` Then you can specify this software environment when starting new Coiled resources. ```bash $ coiled notebook start --gpu --software rapidsai-notebooks ``` ## CLI Jobs You can execute a script in a container on an ephemeral VM with [Coiled CLI Jobs](https://docs.coiled.io/user_guide/cli-jobs.html). ```bash $ coiled run python my_code.py # Boots a VM on the cloud, runs the scripts, then shuts down again ``` We can use this to run GPU code on a remote environment using the RAPIDS container. You can set the coiled CLI to keep the VM around for a few minutes after execution is complete just in case you want to run it again and reuse the same hardware. ```bash $ coiled run --gpu --name rapids-demo --keepalive 5m --container rapidsai/base:25.12a-cuda12-py3.13 -- python my_code.py ... ``` This works very nicely when paired with the cudf.pandas CLI tool. For example we can run `python -m cudf.pandas my_script` to GPU accelerate our Pandas code without having to rewrite anything. For example [this script](https://gist.github.com/jacobtomlinson/2481ecf2e1d2787ae2864a6712eef97b#file-cudf_pandas_coiled_demo-py) processes some open NYC parking data. With `pandas` it takes around a minute, but with `cudf.pandas` it only takes a few seconds. ```bash $ coiled run --gpu --name rapids-demo --keepalive 5m --container rapidsai/base:25.12a-cuda12-py3.13 -- python -m cudf.pandas cudf_pandas_coiled_demo.py Output ------ This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf Calculate violations by state took: 3.470 seconds Calculate violations by vehicle type took: 0.145 seconds Calculate violations by day of week took: 1.238 seconds ``` ## Notebooks To start an interactive Jupyter notebook session with [Coiled Notebooks](https://docs.coiled.io/user_guide/notebooks.html) run the RAPIDS notebook container via the notebook service. ```bash $ coiled notebook start --gpu --container rapidsai/notebooks:25.12a-cuda12-py3.13 ``` Note that the `--gpu` flag will automatically select a `g4dn.xlarge` instance with a T4 GPU on AWS. You could additionally add the `--vm-type` flag to explicitly choose another machine type with different GPU configuration. For example to choose a machine with 4 L4 GPUs you would run the following. ```bash $ coiled notebook start --gpu --vm-type g6.24xlarge --container nvcr.io/nvidia/rapidsai/notebooks:24.12-cuda12.5-py3.12 ``` ## Dask Clusters Coiled’s [managed Dask clusters](https://docs.coiled.io/user_guide/dask.html) can also provision clusters using [dask-cuda](https://docs.rapids.ai/api/dask-cuda/nightly/) to enable using RAPIDS in a distributed way. ```python cluster = coiled.Cluster( container="rapidsai/notebooks:25.12a-cuda12-py3.13", # specify the software env to use jupyter=True, # run Jupyter server on scheduler scheduler_gpu=True, # add GPU to scheduler n_workers=4, worker_gpu=1, # single T4 per worker worker_class="dask_cuda.CUDAWorker", # recommended ) ``` Once the cluster has started you can also get the Jupyter URL and navigate to Jupyter Lab running on the Dask Scheduler node. ```python >>> print(cluster.jupyter_link) https://cluster-abc123.dask.host/jupyter/lab?token=dddeeefff444555666 ``` We can run `!nvidia-smi` in our notebook to see information on the GPU available to Jupyter. We can also connect a Dask client to see that information for the workers too. ```python from dask.distributed import Client client = Client() client ``` ![Screenshot of Jupyter Lab running on a Coiled Dask Cluster with GPUs](_static/images/platforms/coiled/jupyter-on-coiled.png) From this Jupyter session we can see that our notebook server has a GPU and we can connect to the Dask cluster with no configuration and see all the Dask Workers have GPUs too. # index.html.md # Kubeflow You can use RAPIDS with Kubeflow in a single Pod with [Kubeflow Notebooks](https://www.kubeflow.org/docs/components/notebooks/) or you can scale out to many Pods on many nodes of the Kubernetes cluster with the [dask-operator](../tools/kubernetes/dask-operator.md). #### NOTE These instructions were tested against [Kubeflow v1.5.1](https://github.com/kubeflow/manifests/releases/tag/v1.5.1) running on [Kubernetes v1.21](https://kubernetes.io/blog/2021/04/08/kubernetes-1-21-release-announcement/). Visit [Installing Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/) for instructions on installing Kubeflow on your Kubernetes cluster. ## Kubeflow Notebooks The [RAPIDS docker images](https://docs.rapids.ai/install/#docker) can be used directly in Kubeflow Notebooks with no additional configuration. To find the latest image head to [the RAPIDS install page](https://docs.rapids.ai/install), as shown in below, and choose a version of RAPIDS to use. Typically we want to choose the container image for the latest release. Verify the Docker image is selected when installing the latest RAPIDS release. Be sure to match the CUDA version in the container image with that installed on your Kubernetes nodes. The default CUDA version installed on GKE Stable is 11.4 for example, so we would want to choose that. From 11.5 onwards it doesn’t matter as they will be backward compatible. Copy the container image name from the install command (i.e. `rapidsai/base:25.12a-cuda12-py3.13`). #### NOTE You can [check your CUDA version](https://jacobtomlinson.dev/posts/2022/how-to-check-your-nvidia-driver-and-cuda-version-in-kubernetes/) by creating a Pod and running `nvidia-smi`. For example: ```console $ kubectl run nvidia-smi --restart=Never --rm -i --tty --image nvidia/cuda:11.0.3-base-ubuntu20.04 -- nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.46 Driver Version: 495.46 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ ... ``` Now in Kubeflow, access the Notebooks tab on the left and click “New Notebook”. ![Screenshot of the Kubeflow Notebooks page with the “New Notebook” button highlighted](images/kubeflow-create-notebook.png) On this page, we must set a few configuration options. First, let’s give it a name like `rapids`. We need to check the “use custom image” box and paste in the container image we got from the RAPIDS release selector. Then, we want to set the CPU and RAM to something a little higher (i.e. 2 CPUs and 8GB memory) and set the number of NVIDIA GPUs to 1. ![Screenshot of the Kubeflow Notebooks page](images/kubeflow-new-notebook.png) Then, you can scroll to the bottom of the page and hit launch. You should see it starting up in your list. The RAPIDS container images are packed full of amazing tools so this step can take a little while. ![Screenshot of the Kubeflow Notebooks page showing the rapids notebook starting up](images/kubeflow-notebook-running.png) You can verify everything works okay by opening a terminal in Jupyter and running: ```bash $ nvidia-smi ``` ![Screenshot of a terminal open in Juputer Lab with the output of the nvidia-smi command listing one A100 GPU](images/kubeflow-jupyter-nvidia-smi.png) The RAPIDS container also comes with some example notebooks which you can find in `/rapids/notebooks`. You can make a symbolic link to these from your home directory so you can easily navigate using the file explorer on the left `ln -s /rapids/notebooks /home/jovyan/notebooks`. Now you can navigate those example notebooks and explore all the libraries RAPIDS offers. For example, ETL developers that use [Pandas](https://pandas.pydata.org/) should check out the [cuDF](https://docs.rapids.ai/api/cudf/nightly/) notebooks for examples of accelerated dataframes. ![Screenshot of Jupyter Lab with the “10 minutes to cuDF and dask-cuDF” notebook open](images/kubeflow-jupyter-example-notebook.png) ## Scaling out to many GPUs Many of the RAPIDS libraries also allow you to scale out your computations onto many GPUs spread over many nodes for additional acceleration. To do this we leverage [Dask](https://www.dask.org/), an open source Python library for distributed computing. To use Dask, we need to create a scheduler and some workers that will perform our calculations. These workers will also need GPUs and the same Python environment as your notebook session. Dask has [an operator for Kubernetes](../tools/kubernetes/dask-operator.md) that you can use to manage Dask clusters on your Kubeflow cluster. ### Installing the Dask Kubernetes operator To install the operator we need to create any custom resources and the operator itself, please [refer to the documentation](https://kubernetes.dask.org/en/latest/installing.html) to find up-to-date installation instructions. From the terminal run the following command. ```console $ helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator NAME: dask-kubernetes-operator-1666875935 NAMESPACE: dask-operator STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Operator has been installed successfully. ``` Verify our resources were applied successfully by listing our Dask clusters. Don’t expect to see any resources yet but the command should succeed. ```console $ kubectl get daskclusters No resources found in default namespace. ``` You can also check the operator Pod is running and ready to launch new Dask clusters. ```console $ kubectl get pods -A -l app.kubernetes.io/name=dask-kubernetes-operator NAMESPACE NAME READY STATUS RESTARTS AGE dask-operator dask-kubernetes-operator-775b8bbbd5-zdrf7 1/1 Running 0 74s ``` Lastly, ensure that your notebook session can create and manage Dask custom resources. To do this you need to edit the `kubeflow-kubernetes-edit` cluster role that gets applied to notebook Pods. Add a new rule to the rules section for this role to allow everything in the `kubernetes.dask.org` API group. ```console $ kubectl edit clusterrole kubeflow-kubernetes-edit … rules: … - apiGroups: - "kubernetes.dask.org" verbs: - "*" resources: - "*" … ``` ### Creating a Dask cluster Now you can create `DaskCluster` resources in Kubernetes that will launch all the necessary Pods and services for our cluster to work. This can be done in YAML via the Kubernetes API or using the Python API from a notebook session as shown in this section. In a Jupyter session, create a new notebook and install the `dask-kubernetes` package which you will need to launch Dask clusters. ```ipython !pip install dask-kubernetes ``` Next, create a Dask cluster using the `KubeCluster` class. Set the container image to match the one used for your notebook environment and set the number of GPUs to 1. Also tell the RAPIDS container not to start Jupyter by default and run our Dask command instead. This can take a similar amount of time to starting up the notebook container as it will also have to pull the RAPIDS docker image. ```python from dask_kubernetes.experimental import KubeCluster cluster = KubeCluster( name="rapids-dask", image="rapidsai/base:25.12a-cuda12-py3.13", worker_command="dask-cuda-worker", n_workers=2, resources={"limits": {"nvidia.com/gpu": "1"}}, ) ``` ![Screenshot of the Dask cluster widget in Jupyter Lab showing two workers with A100 GPUs](images/kubeflow-jupyter-dask-cluster-widget.png) You can scale this cluster up and down either with the scaling tab in the widget in Jupyter or by calling `cluster.scale(n)` to set the number of workers (and therefore the number of GPUs). Now you can connect a Dask client to our cluster and from that point on any RAPIDS libraries that support dask such as `dask_cudf` will use our cluster to distribute our computation over all of our GPUs. ![Screenshot of some cudf code in Jupyter Lab that leverages Dask](images/kubeflow-jupyter-using-dask.png) ## Accessing the Dask dashboard from notebooks When working interactively in a notebook and leveraging a Dask cluster it can be really valuable to see the Dask dashboard. The dashboard is available on the scheduler `Pod` in the Dask cluster so we need to set some extra configuration to make this available from our notebook `Pod`. To do this, we can apply the following manifest. ```yaml # configure-dask-dashboard.yaml apiVersion: "kubeflow.org/v1alpha1" kind: PodDefault metadata: name: configure-dask-dashboard spec: selector: matchLabels: configure-dask-dashboard: "true" desc: "configure dask dashboard" env: - name: DASK_DISTRIBUTED__DASHBOARD__LINK value: "{NB_PREFIX}/proxy/{host}:{port}/status" volumeMounts: - name: jupyter-server-proxy-config mountPath: /root/.jupyter/jupyter_server_config.py subPath: jupyter_server_config.py volumes: - name: jupyter-server-proxy-config configMap: name: jupyter-server-proxy-config --- apiVersion: v1 kind: ConfigMap metadata: name: jupyter-server-proxy-config data: jupyter_server_config.py: | c.ServerProxy.host_allowlist = lambda app, host: True ``` Create a file with the above contents, and then apply it into your user’s namespace with `kubectl`. For the default `user@example.com` user it would look like this. ```bash $ kubectl apply -n kubeflow-user-example-com -f configure-dask-dashboard.yaml ``` This configuration file does two things. First it configures the [jupyter-server-proxy](https://github.com/jupyterhub/jupyter-server-proxy) running in your Notebook container to allow proxying to all hosts. We can do this safely because we are relying on Kubernetes (and specifically Istio) to enforce network access controls. It also sets the `distributed.dashboard-link` config option in Dask so that the widgets and `.dashboard_link` attributes of the `KubeCluster` and `Client` objects show a url that uses the Jupyter server proxy. Once you have created this configuration option you can select it when launching new notebook instances. ![Screenshot of the Kubeflow new notebook form with the “configure dask dashboard” configuration option selected](images/kubeflow-configure-dashboard-option.png) You can then follow the links provided by the widgets in your notebook to open the Dask Dashboard in a new tab. ![Screenshot of the Dask dashboard](images/kubeflow-dask-dashboard.png) You can also use the [Dask Jupyter Lab extension](https://github.com/dask/dask-labextension) to view various plots and stats about your Dask cluster right in Jupyter Lab. Open up the Dask tab on the left side menu and click the little search icon, this will connect Jupyter lab to the dashboard via the client in your notebook. Then you can click the various plots you want to see and arrange them in Jupyter Lab however you like by dragging the tabs around. ![Screenshot of Jupyter Lab with the Dask Lab extension open on the left and various Dask plots arranged on the screen](images/kubeflow-jupyter-dask-labextension.png) ### Related Examples Scaling up Hyperparameter Optimization with Multi-GPU Workload on Kubernetes library/xgboost library/optuna library/dask library/dask-kubernetes library/scikit-learn workflow/hpo dataset/nyc-taxi data-storage/gcs data-format/csv platforms/kubeflow platforms/kubernetes Scaling up Hyperparameter Optimization with Kubernetes and XGBoost GPU Algorithm library/xgboost library/optuna library/dask tools/dask-kubernetes library/scikit-learn workflow/hpo platforms/kubeflow platforms/kubernetes # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Anaconda Cloud Notebooks You can run RAPIDS workloads on [Anaconda Cloud Notebooks](https://www.anaconda.com/products/notebooks) by leveraging remote runtimes. ## Overview To get started, sign up for an Anaconda account and choose the [Starter Tier](https://www.anaconda.com/pricing). Navigate to [nb.anaconda.com](https://nb.anaconda.com/) and start your server. Once logged into JupyterLab, open the launcher and select “Launch a Remote Runtime”. ![Screenshot of the "Jupyter Lab Launcher" UI](_static/images/platforms/anaconda/launcher.png) Select an NVIDIA runtime. ![Screenshot of the "runtime selector" UI with the NVIDIA A10g environment selected](_static/images/platforms/anaconda/nvidia-runtime.png) Create a notebook and change the current runtime to your NVIDIA runtime. ![Screenshot of the "notebook runtime" UI with the NVIDIA A10g environment selected](_static/images/platforms/anaconda/select-nvidia-runtime.png) You will find common RAPIDS libraries including `cudf` and `cuml` already available in this environment and ready to use. # index.html.md # Snowflake You can access `cuDF` and `cuML` in the [Snowflake Notebooks on Container Runtime for ML](https://docs.snowflake.com/en/developer-guide/snowflake-ml/notebooks-on-spcs). Or you can install RAPIDS on [Snowflake](https://www.snowflake.com) via [Snowpark Container Services](https://docs.snowflake.com/en/developer-guide/snowpark-container-services/overview). ## Snowflake requirements - A non-trial Snowflake account in AWS or Azure for Notebooks, and for container services an account in a supported [AWS region](https://docs.snowflake.com/en/developer-guide/snowpark-container-services/overview#available-regions) - A Snowflake account login with a role that has the `ACCOUNTADMIN` role. If not, you will need to work with your `ACCOUNTADMIN` to perform the initial account setup. - Access to `INSTANCE_FAMILY` with NVIDIA GPUs. For this guides we will use `GPU_NV_S` (1 NVIDIA A10G - smallest NVIDIA GPU size available for Snowpark Containers to get started, and smallest instance type available for Notebooks) ## `cuDF` and `cuML` in Snowflake Notebooks ML Runtime The [Snowflake Notebooks on Container Runtime for ML](https://docs.snowflake.com/en/developer-guide/snowflake-ml/notebooks-on-spcs) has `cuDF` and `cuML` built in in the environment. If you want more control over your environment, or a closer experience to a Jupyter Notebook setup, follow the instructions for [RAPIDS on Snowflake via Snowpark Container Services](#rapids-snowpark) #### NOTE The following instructions are an adaptation of the [Getting Started with Snowflake Notebook Container Runtime](https://quickstarts.snowflake.com/guide/notebook-container-runtime/#1) and the [Train an XGBoost Model with GPUs using Snowflake Notebooks](https://quickstarts.snowflake.com/guide/train-an-xgboost-model-with-gpus-using-snowflake-notebooks/#1) guides from the Snowflake documentation. ### Set up the Snowflake Notebooks In a SQL worksheet in Snowflake, run the following commands to create all the necessary requirements to get started: ```sql USE ROLE accountadmin; CREATE OR REPLACE DATABASE container_runtime_lab; CREATE SCHEMA notebooks; CREATE OR REPLACE ROLE container_runtime_lab_user; GRANT ROLE container_runtime_lab_user to USER naty; GRANT USAGE ON DATABASE container_runtime_lab TO ROLE container_runtime_lab_user; GRANT ALL ON SCHEMA container_runtime_lab.notebooks TO ROLE container_runtime_lab_user; GRANT CREATE STAGE ON SCHEMA container_runtime_lab.notebooks TO ROLE container_runtime_lab_user; GRANT CREATE NOTEBOOK ON SCHEMA container_runtime_lab.notebooks TO ROLE container_runtime_lab_user; GRANT CREATE SERVICE ON SCHEMA container_runtime_lab.notebooks TO ROLE container_runtime_lab_user; CREATE OR REPLACE WAREHOUSE CONTAINER_RUNTIME_WH AUTO_SUSPEND = 60; GRANT ALL ON WAREHOUSE CONTAINER_RUNTIME_WH TO ROLE container_runtime_lab_user; -- Create and grant access to EAIs -- Create network rules (these are schema-level objects; end users do not need direct access to the network rules) create network rule allow_all_rule TYPE = 'HOST_PORT' MODE= 'EGRESS' VALUE_LIST = ('0.0.0.0:443','0.0.0.0:80'); -- Create external access integration (these are account-level objects; end users need access to this to access -- the public internet with endpoints defined in network rules) -- If you need to restrict access and create a different network rule, check pypi_network_rule example in -- https://quickstarts.snowflake.com/guide/notebook-container-runtime/#1 CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION allow_all_integration ALLOWED_NETWORK_RULES = (allow_all_rule) ENABLED = true; GRANT USAGE ON INTEGRATION allow_all_integration TO ROLE container_runtime_lab_user; -- Create compute pool to leverage multiple GPUs (see docs - https://docs.snowflake.com/en/developer-guide/snowpark-container-services/working-with-compute-pool) CREATE COMPUTE POOL IF NOT EXISTS GPU_NV_S_compute_pool MIN_NODES = 1 MAX_NODES = 1 INSTANCE_FAMILY = GPU_NV_S; -- Grant usage of compute pool to newly created role GRANT USAGE ON COMPUTE POOL GPU_NV_S_compute_pool to ROLE container_runtime_lab_user; ``` ### Create or Upload a new Notebook 1. Make sure under your user you select the role `container_runtime_lab_user` that you just created during the setup step. ![Screenshot of how to switch role to container_runtime_lab_user](images/snowflake_container_runtime_lab_user.png) 1. In the Snowflake app, on the left panel, go to **Projects** -> **Notebooks**. Once there you’ll be able to create a new notebook by selecting the `+ Notebook` button, or if you click the dropdown you’ll be able to import one. In either case, you will need to make some selections, make sure you select the right database, runtime version, compute pool, etc. ![Screenshot of Notebook creation setup](images/snowflake_notebook_creation_setup.png) 1. For this example we suggest you upload the following [notebook cuml example](https://github.com/rapidsai/deployment/tree/main/source/examples/cuml-snowflake-nb/notebook.ipynb). 2. Once the notebook is uploaded, we need to make sure we have access to the internet before we can get started. Go to the three dots at the top right of your Snowflake app and select **Network settings**, then go to **External access** and toggle on the network access `ALLOW_ALL_INTEGRATION` we created in the setup step, and hit **Save** ![Screenshot of how to access Notebook settings](images/snowflake_notebook_settings.png)![Screenshot of Notebook setting external access](images/snowflake_allow_all_integration.png) 1. On the top right hit **Start** to get the compute pool going. After a few minutes you will see the status is **Active**, run the notebook to see `cuml.accel` in action. 2. When you are done, end your session and suspend the compute pool. ## RAPIDS on Snowflake via Snowpark Container Services #### NOTE The following instructions are an adaptation of the [Introduction to Snowpark container Services](https://quickstarts.snowflake.com/guide/intro_to_snowpark_container_services/#0) guide from the Snowflake documentation. ### Set up the Snowflake environment In a SQL worksheet in Snowflake, run the following commands to create the role, database, warehouse, and stage that we need to get started: ```sql -- Create an CONTAINER_USER_ROLE with required privileges USE ROLE ACCOUNTADMIN; CREATE ROLE CONTAINER_USER_ROLE; GRANT CREATE DATABASE ON ACCOUNT TO ROLE CONTAINER_USER_ROLE; GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE CONTAINER_USER_ROLE; GRANT CREATE COMPUTE POOL ON ACCOUNT TO ROLE CONTAINER_USER_ROLE; GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE CONTAINER_USER_ROLE; GRANT MONITOR USAGE ON ACCOUNT TO ROLE CONTAINER_USER_ROLE; GRANT BIND SERVICE ENDPOINT ON ACCOUNT TO ROLE CONTAINER_USER_ROLE; GRANT IMPORTED PRIVILEGES ON DATABASE snowflake TO ROLE CONTAINER_USER_ROLE; -- Grant CONTAINER_USER_ROLE to ACCOUNTADMIN grant role CONTAINER_USER_ROLE to role ACCOUNTADMIN; -- Create Database, Warehouse, and Image spec stage USE ROLE CONTAINER_USER_ROLE; CREATE OR REPLACE DATABASE CONTAINER_HOL_DB; CREATE OR REPLACE WAREHOUSE CONTAINER_HOL_WH WAREHOUSE_SIZE = XSMALL AUTO_SUSPEND = 120 AUTO_RESUME = TRUE; CREATE STAGE IF NOT EXISTS specs ENCRYPTION = (TYPE='SNOWFLAKE_SSE'); CREATE STAGE IF NOT EXISTS volumes ENCRYPTION = (TYPE='SNOWFLAKE_SSE') DIRECTORY = (ENABLE = TRUE); ``` Then we proceed to create the external access integration, the compute pool (with GPU resources), and the image repository: ```sql USE ROLE ACCOUNTADMIN; CREATE OR REPLACE NETWORK RULE ALLOW_ALL_RULE TYPE = 'HOST_PORT' MODE = 'EGRESS' VALUE_LIST= ('0.0.0.0:443', '0.0.0.0:80'); CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION ALLOW_ALL_EAI ALLOWED_NETWORK_RULES = (ALLOW_ALL_RULE) ENABLED = true; GRANT USAGE ON INTEGRATION ALLOW_ALL_EAI TO ROLE CONTAINER_USER_ROLE; USE ROLE CONTAINER_USER_ROLE; CREATE COMPUTE POOL IF NOT EXISTS CONTAINER_HOL_POOL MIN_NODES = 1 MAX_NODES = 1 INSTANCE_FAMILY = GPU_NV_S; -- instance with GPU CREATE IMAGE REPOSITORY CONTAINER_HOL_DB.PUBLIC.IMAGE_REPO; SHOW IMAGE REPOSITORIES IN SCHEMA CONTAINER_HOL_DB.PUBLIC; ``` ### Docker image push via SnowCLI The next step in the process is to push to the image registry the docker image you will want to run via the service. #### Build Docker image locally For this guide, we build an image that starts from the RAPIDS notebook image and adds some extra snowflake packages. Create a Dockerfile as follow: ```Dockerfile FROM rapidsai/notebooks:25.12a-cuda12-py3.13 RUN pip install "snowflake-snowpark-python[pandas]" snowflake-connector-python ``` #### NOTE - The `python=3.11`, is the latest supported by the Snowflake connector package. - The use of `amd64` platform is required by Snowflake. Build the image in the directory where your Dockerfile is located. Notice that no GPU is needed to build this image. ```bash $ docker build --platform=linux/amd64 -t /rapids-nb-snowflake:latest . ``` #### Install SnowCLI Install the SnowCLI following your preferred method instructions in the [documentation](https://docs.snowflake.com/en/developer-guide/snowflake-cli/installation/installation). Once installed, configure your Snowflake CLI connection, and follow the wizard: #### NOTE When you follow the wizard you will need `-`, you can obtain them by running the following in the Snowflake SQL worksheet. ```sql SELECT CURRENT_ORGANIZATION_NAME(); --org SELECT CURRENT_ACCOUNT_NAME(); --account name ``` ```bash $ snow connection add ``` ```bash connection name : CONTAINER_HOL account : - # e.g. MYORGANIZATION-MYACCOUNT user : password : role: CONTAINER_USER_ROLE warehouse : CONTAINER_HOL_WH database : CONTAINER_HOL_DB schema : public host: port: region: authenticator: username_password_mfa # only needed if MFA and MFA caching are enabled private key file: token file path: ``` Test the connection: ```bash $ snow connection test --connection "CONTAINER_HOL" ``` To be able to push the docker image we need to get the snowflake registry hostname from the repository url. In a Snowflake SQL worksheet run: ```sql USE ROLE CONTAINER_USER_ROLE; SHOW IMAGE REPOSITORIES IN SCHEMA CONTAINER_HOL_DB.PUBLIC; ``` You will see that the repository url is `org-account.registry.snowflakecomputing.com/container_hol_db/public/image_repo` where `org-account` refers to your organization and account, the `SNOWFLAKE_REGISTRY_HOSTNAME` is the url up to the `.com`. i.e. `org-account.registry.snowflakecomputing.com` First we login into the snowflake image-registry via terminal: #### NOTE If you have **MFA** activated you will want to allow [client MFA caching] (https://docs.snowflake.com/en/user-guide/security-mfa#using-mfa-token-caching-to-minimize-the-number-of-prompts-during-authentication-optional) to reduce the number of prompts that must be acknowledged while connecting and authenticating to Snowflake. To enable this, you need `ACCOUNTADMIN` system role and in a sql sheet run: ```sql ALTER ACCOUNT SET ALLOW_CLIENT_MFA_CACHING = TRUE; ``` and if you are using the Snowflake Connector for Python you need: ```bash $ pip install "snowflake-connector-python[secure-local-storage]" ``` ```bash $ snow spcs image-registry login --connection CONTAINER_HOL ``` We tag and push the image, make sure you replace the repository url for `org-account.registry.snowflakecomputing.com/container_hol_db/public/image_repo`: ```bash $ docker tag /rapids-nb-snowflake:latest /rapids-nb-snowflake:dev ``` Verify that the new tagged image exists by running: ```bash $ docker image list ``` Push the image to snowflake: ```bash $ docker push /rapids-nb-snowflake:dev ``` #### NOTE This step will take some time, while this process completes we can continue with next step to configure and push the Spec YAML. When the `docker push` command completes, you can verify that the image exists in your Snowflake Image Repository by running the following in the Snowflake SQL worksheet ```sql USE ROLE CONTAINER_USER_ROLE; CALL SYSTEM$REGISTRY_LIST_IMAGES('/CONTAINER_HOL_DB/PUBLIC/IMAGE_REPO'); ``` ### Configure and Push Spec YAML Snowpark Container Services are defined and configured using YAML files. There is support for multiple parameters configurations, refer to then [Snowpark container services specification reference](https://docs.snowflake.com/en/developer-guide/snowpark-container-services/specification-reference) for more information. Locally, create the following file `rapids-snowpark.yaml`: ```yaml spec: containers: - name: rapids-nb-snowpark image: .registry.snowflakecomputing.com/container_hol_db/public/image_repo/rapids-nb-snowflake:dev volumeMounts: - name: rapids-notebooks mountPath: /home/rapids/notebooks/workspace resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 endpoints: - name: jupyter port: 8888 public: true - name: dask-client port: 8786 protocol: TCP - name: dask-dashboard port: 8787 public: true volumes: - name: rapids-notebooks source: "@volumes/rapids-notebooks" uid: 1001 # rapids user's UID gid: 1000 ``` Notice that in we mounted the `@volumes/rapids-notebooks` internal stage location to our `/home/rapids/notebooks/workspace` directory inside of our running container. Anything that is added to this directory will persist. We use `snow-cli` to push this `yaml` file: ```bash $ snow stage copy rapids-snowpark.yaml @specs --overwrite --connection CONTAINER_HOL ``` Verify that your `yaml` was pushed properly by running the following SQL in the Snowflake worksheet: ```sql USE ROLE CONTAINER_USER_ROLE; LS @CONTAINER_HOL_DB.PUBLIC.SPECS; ``` ### Create and Test the Service Now that we have successfully pushed the image and the spec YAML, we have all the components in Snowflake to create our service. We only need a service name, a compute pool and the spec file. Run this SQL in the Snowflake worksheet: ```sql USE ROLE CONTAINER_USER_ROLE; CREATE SERVICE CONTAINER_HOL_DB.PUBLIC.rapids_snowpark_service in compute pool CONTAINER_HOL_POOL from @specs specification_file='rapids-snowpark.yaml' external_access_integrations = (ALLOW_ALL_EAI); ``` Run the following to verify that the service is successfully running. ```sql CALL SYSTEM$GET_SERVICE_STATUS('CONTAINER_HOL_DB.PUBLIC.rapids_snowpark_service'); ``` Since we specified the `jupyter` endpoint to be public, Snowflake will generate a url that can be used to access the service via the browser. To get the url, run in the SQL snowflake worksheet: ```sql SHOW ENDPOINTS IN SERVICE RAPIDS_SNOWPARK_SERVICE; ``` Copy the jupyter `ingress_url` in the browser. You will see a jupyter lab with a set of notebooks to get started with RAPIDS. ![Screenshot of Jupyter Lab with rapids example notebooks directories.](images/snowflake_jupyter.png) ### Shutdown and Cleanup If you no longer need the service and the compute pool up and running, we can stop the service and suspend the compute pool to avoid incurring in any charges. In the Snowflake SQL worksheet run: ```sql USE ROLE CONTAINER_USER_ROLE; ALTER COMPUTE POOL CONTAINER_HOL_POOL STOP ALL; ALTER COMPUTE POOL CONTAINER_HOL_POOL SUSPEND; ``` If you want to cleanup completely and remove all of the objects created, run the following: ```sql USE ROLE CONTAINER_USER_ROLE; ALTER COMPUTE POOL CONTAINER_HOL_POOL STOP ALL; ALTER COMPUTE POOL CONTAINER_HOL_POOL SUSPEND; DROP COMPUTE POOL CONTAINER_HOL_POOL; DROP DATABASE CONTAINER_HOL_DB; DROP WAREHOUSE CONTAINER_HOL_WH; USE ROLE ACCOUNTADMIN; DROP ROLE CONTAINER_USER_ROLE; DROP EXTERNAL ACCESS INTEGRATION ALLOW_ALL_EAI; ``` #### Related Examples Getting Started with cuML’s accelerator mode (cuml.accel) in Snowflake Notebooks library/cuml platforms/snowflake Getting Started with cudf.pandas and Snowflake library/cudf platforms/snowflake # index.html.md # Databricks You can install RAPIDS on Databricks in a few different ways: 1. Accelerate machine learning workflows in a single-node GPU notebook environment 2. Spark users can install [RAPIDS Accelerator for Apache Spark 3.x on Databricks](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/databricks.html) 3. Install Dask alongside Spark and then use libraries like `dask-cudf` for multi-node workloads ## Single-node GPU Notebook environment ### Create init-script To get started, you must first configure an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install RAPIDS libraries and all other dependencies for your project. Databricks recommends using [cluster-scoped](https://docs.databricks.com/en/init-scripts/cluster-scoped.html) init scripts stored in the workspace files. Navigate to the top-left **Workspace** tab and click on your **Home** directory then select **Add** > **File** from the menu. Create an `init.sh` script with contents: ```bash #!/bin/bash set -e # Install RAPIDS libraries pip install \ --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \ "cudf-cu12>=25.12.*,>=0.0.0a0" "cuml-cu12>=25.12.*,>=0.0.0a0" \ "dask-cuda>=25.12.*,>=0.0.0a0" ``` ### Launch cluster To get started, navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose **“Single node”**. ![Screenshot of the Databricks compute page](images/databricks-create-compute.png) In order to launch a GPU node uncheck **Use Photon Acceleration** and select any `15.x`, `16.x` or `17.x` ML LTS runtime with GPU support. For example for long-term support releases you could select the `15.4 LTS ML (includes Apache Spark 3.5.0, GPU, Scala 2.12)` runtime version. The “GPU accelerated” nodes should now be available in the **Node type** dropdown. ![Screenshot of selecting a g4dn.xlarge node type](images/databricks-choose-gpu-node.png) Then expand the **Advanced Options** section, open the **Init Scripts** tab and enter the file path to the init-script in your Workspace directory starting with `/Users//.sh` and click **“Add”**. ![Screenshot of init script path](images/databricks-dask-init-script.png) Select **Create Compute** ### Test RAPIDS Once your cluster has started, you can create a new notebook or open an existing one from the `/Workspace` directory then attach it to your running cluster. ```python import cudf gdf = cudf.DataFrame({"a":[1,2,3],"b":[4,5,6]}) gdf a b 0 1 4 1 2 5 2 3 6 ``` #### Quickstart with cuDF Pandas RAPIDS recently introduced cuDF’s [pandas accelerator mode](https://rapids.ai/cudf-pandas/) to accelerate existing pandas workflows with zero changes to code. Using `cudf.pandas` in Databricks on a single-node can offer significant performance improvements over traditional pandas when dealing with large datasets; operations are optimized to run on the GPU (cuDF) whenever possible, seamlessly falling back to the CPU (pandas) when necessary, with synchronization happening in the background. Below is a quick example how to load the `cudf.pandas` extension in a Jupyter notebook: ```python %load_ext cudf.pandas %%time import pandas as pd df = pd.read_parquet( "nyc_parking_violations_2022.parquet", columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"] ) (df[["Registration State", "Violation Description"]] .value_counts() .groupby("Registration State") .head(1) .sort_index() .reset_index() ) ``` Upload the [10 Minutes to RAPIDS cuDF Pandas notebook](https://colab.research.google.com/drive/12tCzP94zFG2BRduACucn5Q_OcX1TUKY3) in your single-node Databricks cluster and run through the cells. **NOTE**: cuDF pandas is open beta and under active development. You can [learn more through the documentation](https://docs.rapids.ai/api/cudf/nightly/?_gl=1*1oyfbsi*_ga*MTc5NDYzNzYyNC4xNjgzMDc2ODc2*_ga_RKXFW6CM42*MTcwNTU4NDUyNS4yMC4wLjE3MDU1ODQ1MjUuNjAuMC4w) and the [release blog](https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/). # index.html.md # NVIDIA AI Workbench [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/) is a developer toolkit for data science, machine learning, and AI projects. It lets you develop on your laptop/workstation and then easily transition workloads to scalable GPU resources in a data center or the cloud. AI Workbench is free, you can install it in minutes on both local or remote computers, and offers a desktop application as well as a command-line interface (CLI). ## Installation You can install AI Workbench locally, or on a remote computer that you have SSH access to. Follow the [AI Workbench installation](https://docs.nvidia.com/ai-workbench/user-guide/latest/installation/overview.html) documentation for instructions on installing on different operating systems. ## Configure your system Once you have installed AI Workbench you can launch the desktop application. On first run it will talk you through installing some dependencies if they aren’t available already. Then you will be able to choose between using your local environment or working on a remote system (you can switch between them later very easily). If you wish to configure a remote system click the “Add Remote System” button and enter the configuration information for that system. ![Screenshot of adding a new remote location with a form where you can enter SSH information](_static/images/platforms/nvidia-ai-workbench/add-remote-system-dialog.png) Once configured select the system you wish to use. You will then be greeted with a screen where you can create a new project or clone an existing one. ![Screenshot of ](_static/images/platforms/nvidia-ai-workbench/new-project.png) Select “Start a new project” and give it a name and description. You can also change the default location to store the project files. ![Screenshot of the "Start a new project" button](_static/images/platforms/nvidia-ai-workbench/create-project.png) Then scroll down and select “RAPIDS with CUDA” from the list of templates. ![Screenshot of the template selector with "RAPIDS with CUDA" highlighted](_static/images/platforms/nvidia-ai-workbench/rapids-with-cuda.png) The new project will then be created. AI Workbench will automatically build a container for this project, this may take a few minutes. ![Screenshot of the AI workbench UI. In the bottom corner the build status says "Building" and the "Open Jupyterlab" button is greyed out](_static/images/platforms/nvidia-ai-workbench/project-building.png) Once the project has built you can select “Open Jupyterlab” to launch Jupyter in your RAPIDS environment. ![Screenshot of the AI workbench UI. In the bottom corner the build status says "Build Ready" and the "Open Jupyterlab" button is highlighted](_static/images/platforms/nvidia-ai-workbench/open-jupyter.png) Then you can start working with the RAPIDS libraries in your notebooks. ![Screenshot of Jupyterlab running some cudf code to demonstrate that the RAPIDS libraries are available and working](_static/images/platforms/nvidia-ai-workbench/cudf-example.png) ## Further reading For more information and to learn more about what you can do with NVIDIA AI Workbench [see the documentation](https://docs.nvidia.com/ai-workbench/user-guide/latest/overview/introduction.html). # index.html.md # KServe [KServe](https://kserve.github.io/website) is a standard model inference platform built for Kubernetes. It provides consistent interface for multiple machine learning frameworks. In this page, we will show you how to deploy RAPIDS models using KServe. #### NOTE These instructions were tested against KServe v0.10 running on [Kubernetes v1.21](https://kubernetes.io/blog/2021/04/08/kubernetes-1-21-release-announcement/). ## Setting up Kubernetes cluster with GPU access First, you should set up a Kubernetes cluster with access to NVIDIA GPUs. Visit [the Cloud Section](../cloud/index.md) for guidance. ## Installing KServe Visit [Getting Started with KServe](https://kserve.github.io/website/latest/get_started/) to install KServe in your Kubernetes cluster. If you are starting out, we recommend the use of the “Quickstart” script (`quick_install.sh`) provided in the page. On the other hand, if you are setting up a production-grade system, follow direction in [Administration Guide](https://kserve.github.io/website/latest/admin/serverless/serverless) instead. ## Setting up First InferenceService Once KServe is installed, visit [First InferenceService](https://kserve.github.io/website/latest/get_started/first_isvc/) to quickly set up a first inference endpoint. (The example uses the [Support Vector Machine from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) to classify [the Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).) Follow through all the steps carefully and make sure everything works. In particular, you should be able to submit inference requests using cURL. ## Setting up InferenceService with Triton-FIL [The FIL backend for Triton Inference Server](https://github.com/triton-inference-server/fil_backend) (Triton-FIL in short) is an optimized inference runtime for many kinds of tree-based models including: XGBoost, LightGBM, scikit-learn, and cuML RandomForest. We can use Triton-FIL together with KServe and serve any tree-based models. The following manifest sets up an inference endpoint using Triton-FIL: ```yaml # triton-fil.yaml apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: triton-fil spec: predictor: triton: storageUri: gs://path-to-gcloud-storage-bucket/model-directory runtimeVersion: 22.12-py3 ``` where `model-directory` is set up with the following hierarchy: ```text model-directory/ \__ model/ \__ config.pbtxt \__ 1/ \__ [model file goes here] ``` where `config.pbtxt` contains the configuration for the Triton-FIL backend. A typical `config.pbtxt` is given below, with explanation interspersed as `#` comments. Before use, make sure to remove `#` comments and fill in the blanks. ```text backend: "fil" max_batch_size: 32768 input [ { name: "input__0" data_type: TYPE_FP32 dims: [ ___ ] # Number of features (columns) in the training data } ] output [ { name: "output__0" data_type: TYPE_FP32 dims: [ 1 ] } ] instance_group [{ kind: KIND_AUTO }] # Triton-FIL will intelligently choose between CPU and GPU parameters [ { key: "model_type" value: { string_value: "_____" } # Can be "xgboost", "xgboost_json", "lightgbm", or "treelite_checkpoint" # See subsections for examples }, { key: "output_class" value: { string_value: "____" } # true (if classifier), or false (if regressor) }, { key: "threshold" value: { string_value: "0.5" } # Threshold for predicing the positive class in a binary classifier } ] dynamic_batching {} ``` We will show you concrete examples below. But first some general notes: - The payload JSON will look different from the First InferenceService example: ```json { "inputs" : [ { "name" : "input__0", "shape" : [ 1, 6 ], "datatype" : "FP32", "data" : [0, 0, 0, 0, 0, 0] ], "outputs" : [ { "name" : "output__0", "parameters" : { "classification" : 2 } } ] } ``` - Triton-FIL uses v2 version of KServe protocol, so make sure to use `v2` URL when sending inference request: ```bash $ INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') ``` ```bash $ INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway \ -o jsonpath='{.spec.ports[?(@.name=="http2")].port}') ``` ```bash $ SERVICE_HOSTNAME=$(kubectl get inferenceservice -n kserve-test \ -o jsonpath='{.status.url}' | cut -d "/" -f 3) ``` ```bash $ curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \ "http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models//infer" \ -d @./payload.json ``` ### XGBoost To deploy an XGBoost model, save it using the JSON format: ```python import xgboost as xgb clf = xgb.XGBClassifier(...) clf.fit(X, y) clf.save_model("my_xgboost_model.json") # Note the .json extension ``` Rename the model file to `xgboost.json`, as this is convention used by Triton-FIL. After moving the model file into the model directory, the directory should look like this: ```text model-directory/ \__ model/ \__ config.pbtxt \__ 1/ \__ xgboost.json ``` In `config.pbtxt`, set `model_type="xgboost_json"`. ### cuML RandomForest To deploy a cuML random forest, save it as a Treelite checkpoint file: ```python from cuml.ensemble import RandomForestClassifier as cumlRandomForestClassifier clf = cumlRandomForestClassifier(...) clf.fit(X, y) clf.convert_to_treelite_model().to_treelite_checkpoint("./checkpoint.tl") ``` Rename the checkpoint file to `checkpoint.tl`, as this is convention used by Triton-FIL. After moving the model file into the model directory, the directory should look like this: ```text model-directory/ \__ model/ \__ config.pbtxt \__ 1/ \__ checkpoint.tl ``` ### Configuring Triton-FIL Triton-FIL offers many configuration options, and we only showed you a few of them. Please visit [FIL Backend Model Configuration](https://github.com/triton-inference-server/fil_backend/blob/main/docs/model_config.md) to check out the rest. # index.html.md # RAPIDS on Google Colab ## Overview RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This guide is applicable for users who want to utilize the full suite of the RAPIDS libraries for their workflows. It is broken into two sections: 1. [RAPIDS Quick Install](#colab-quick) - applicable for most users and quickly installs all the RAPIDS Stable packages. 2. [RAPIDS Custom Setup Instructions](#colab-custom) - step by step set up instructions covering the **must haves** for when a user needs to adapt instance to their workflows. In both sections, we will be installing RAPIDS on colab using pip. The pip installation allows users to install libraries like cuDF, cuML, cuGraph, and cuXfilter stable versions in a few minutes. RAPIDS install on Colab strives to be an “always working” solution, and sometimes will **pin** RAPIDS versions to ensure compatibility. ## Section 1: RAPIDS Quick Install ### Links Please follow the links below to our install templates: #### Pip 1. Open the pip template link by clicking this button –>

. 2. Click **Runtime** > **Run All**. 3. Wait a few minutes for the installation to complete without errors. 4. Add your code in the cells below the template. ## Section 2: User Customizable RAPIDS Install Instructions ### 1. Launch notebook To get started in [Google Colab](https://colab.research.google.com/), click `File` at the top toolbar to Create new or Upload existing notebook ### 2. Set the Runtime Click the `Runtime` dropdown and select `Change Runtime Type` ![Screenshot of create runtime and runtime type](images/googlecolab-select-runtime-type.png) Choose GPU for Hardware Accelerator ![Screenshot of gpu for hardware accelerator](images/googlecolab-select-gpu-hardware-accelerator.png) ### 3. Check GPU type Check the output of `!nvidia-smi` to make sure you’ve been allocated a Rapids Compatible GPU ([see the RAPIDS install docs](https://docs.rapids.ai/install/#system-req)). ![Screenshot of nvidia-smi](images/googlecolab-output-nvidia-smi.png) ### 4. Install RAPIDS on Colab You can install RAPIDS using pip. The script first checks GPU compatibility with RAPIDS, then installs the latest **stable** versions of some core RAPIDS libraries (e.g. cuDF, cuML, cuGraph, and xgboost) using `pip`. ```bash # Colab warns and provides remediation steps if the GPUs is not compatible with RAPIDS. !git clone https://github.com/rapidsai/rapidsai-csp-utils.git !python rapidsai-csp-utils/colab/pip-install.py ``` ### 5. Test RAPIDS Run the following in a Python cell. ```python import cudf gdf = cudf.DataFrame({"a":[1,2,3], "b":[4,5,6]}) gdf a b 0 1 4 1 2 5 2 3 6 ``` ### 6. Next steps Try a more thorough example of using cuDF on Google Colab, “10 Minutes to RAPIDS cuDF’s pandas accelerator mode (cudf.pandas)” ([Google Colab link](https://nvda.ws/rapids-cudf)). # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Virtual Server for VPC ## Create Instance Create a new [Virtual Server (for VPC)](https://www.ibm.com/cloud/virtual-servers) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). 1. Open the [**Virtual Server Dashboard**](https://cloud.ibm.com/vpc-ext/compute/vs). 2. Select **Create**. 3. Give the server a **name** and select your **resource group**. 4. Under **Operating System** choose **Ubuntu Linux**. 5. Under **Profile** select **View all profiles** and select a profile with NVIDIA GPUs. 6. Under **SSH Keys** choose your SSH key. 7. Under network settings create a security group (or choose an existing) that allows SSH access on port `22` and also allow ports `8888,8786,8787` to access Jupyter and Dask. 8. Select **Create Virtual Server**. ## Create floating IP To access the virtual server we need to attach a public IP address. 1. Open [**Floating IPs**](https://cloud.ibm.com/vpc-ext/network/floatingIPs) 2. Select **Reserve**. 3. Give the Floating IP a **name**. 4. Under **Resource to bind** select the virtual server you just created. ## Connect to the instance Next we need to connect to the instance. 1. Open [**Floating IPs**](https://cloud.ibm.com/vpc-ext/network/floatingIPs) 2. Locate the IP you just created and note the address. 3. In your terminal run `ssh root@` #### NOTE For a short guide on launching your instance and accessing it, read the [Getting Started with IBM Virtual Server Documentation](https://cloud.ibm.com/docs/virtual-servers?topic=virtual-servers-getting-started-tutorial). ## Install NVIDIA Drivers Next we need to install the NVIDIA drivers and container runtime. 1. Ensure build essentials are installed `apt-get update && apt-get install build-essential -y`. 2. Install the [NVIDIA drivers](https://www.nvidia.com/Download/index.aspx?lang=en-us). 3. Install [Docker and the NVIDIA Docker runtime](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). ### How do I check everything installed successfully? You can check everything installed correctly by running `nvidia-smi` in a container. ```console $ docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:04:01.0 Off | 0 | | N/A 33C P0 36W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` ## Install RAPIDS There are a selection of methods you can use to install RAPIDS which you can see via the [RAPIDS release selector](https://docs.rapids.ai/install#selector). For this example we are going to run the RAPIDS Docker container so we need to know the name of the most recent container. On the release selector choose **Docker** in the **Method** column. Then copy the commands shown: ```bash docker pull rapidsai/notebooks:25.12a-cuda12-py3.13 docker run --gpus all --rm -it \ --shm-size=1g --ulimit memlock=-1 \ -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/notebooks:25.12a-cuda12-py3.13 ``` #### NOTE If you see a “docker socket permission denied” error while running these commands try closing and reconnecting your SSH window. This happens because your user was added to the `docker` group only after you signed in. ## Test RAPIDS To access Jupyter, navigate to `:8888` in the browser. In a Python notebook, check that you can import and use RAPIDS libraries like `cudf`. ```ipython In [1]: import cudf In [2]: df = cudf.datasets.timeseries() In [3]: df.head() Out[3]: id name x y timestamp 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 ``` Open `cudf/10min.ipynb` and execute the cells to explore more of how `cudf` works. When running a Dask cluster you can also visit `:8787` to monitor the Dask cluster status. ### Related Examples HPO with dask-ml and cuml dataset/airline library/numpy library/pandas library/xgboost library/dask library/dask-cuda library/dask-ml library/cuml cloud/aws/ec2 cloud/azure/azure-vm cloud/gcp/compute-engine cloud/ibm/virtual-server library/sklearn data-storage/s3 workflow/hpo # index.html.md # Azure Virtual Machine ## Create Virtual Machine Create a new [Azure Virtual Machine](https://azure.microsoft.com/en-gb/products/virtual-machines/) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). NVIDIA maintains a [Virtual Machine Image (VMI) that pre-installs NVIDIA drivers and container runtimes](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nvidia.ngc_azure_17_11?tab=Overview), we recommend using this image as the starting point. ### via Azure Portal 1. Select a resource group or create one if needed. 2. Select the latest **NVIDIA GPU-Optimized VMI** version from the drop down list, then select **Get It Now** (if there are multiple `Gen` versions, select the latest). 3. If already logged in on Azure, select continue clicking **Create**. 4. In **Create a virtual machine** interface, fill in required information for the vm. - Select a GPU enabled VM size (see [recommended VM types](https://docs.rapids.ai/deployment/stable/cloud/azure/)). - In “Configure security features” select Standard. - Make sure you create ssh keys and download them. ### Note that not all regions support availability zones with GPU VMs. When the GPU VM size is not selectable with notice: **The size is not available in zone x. No zones are supported.** It means the GPU VM does not support availability zone. Try other availability options. ![azure-gpuvm-availability-zone-error](_static/azure_availability_zone.PNG) Click **Review+Create** to start the virtual machine. ### via Azure CLI Prepare the following environment variables. | Name | Description | Example | |--------------------|----------------------|----------------------------------------------------------------| | `AZ_VMNAME` | Name for VM | `RapidsAI-V100` | | `AZ_RESOURCEGROUP` | Resource group of VM | `rapidsai-deployment` | | `AZ_LOCATION` | Region of VM | `westus2` | | `AZ_IMAGE` | URN of image | `nvidia:ngc_azure_17_11:ngc-base-version-22_06_0-gen2:22.06.0` | | `AZ_SIZE` | VM Size | `Standard_NC6s_v3` | | `AZ_USERNAME` | User name of VM | `rapidsai` | | `AZ_SSH_KEY` | public ssh key | `~/.ssh/id_rsa.pub` | ```bash $ az vm create \ --name ${AZ_VMNAME} \ --resource-group ${AZ_RESOURCEGROUP} \ --image ${AZ_IMAGE} \ --location ${AZ_LOCATION} \ --size ${AZ_SIZE} \ --admin-username ${AZ_USERNAME} \ --ssh-key-value ${AZ_SSH_KEY} ``` #### NOTE Use `az vm image list --publisher Nvidia --all --output table` to inspect URNs of official NVIDIA images on Azure. #### NOTE See [this link](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/mac-create-ssh-keys) for supported ssh keys on Azure. ## Create Network Security Group Next we need to allow network traffic to the VM so we can access Jupyter and Dask. ### via Azure Portal 1. After creating VM, select **Go to resource** to access VM. 2. Select **Networking** -> **Networking Settings** in the left panel. 3. Select **+Create port rule** -> **Add inbound port rule**. 4. Set **Destination port ranges** to `8888,8787`. 5. Modify the “Name” to avoid the `,` or any other symbols. ### See example of port setting. ![set-ports-inbound-sec](_static/azure-set-ports-inbound-sec.png) 1. Keep rest unchanged. Select **Add**. ### via Azure CLI | Name | Description | Example | |------------------|---------------------|----------------------------| | `AZ_NSGNAME` | NSG name for the VM | `${AZ_VMNAME}NSG` | | `AZ_NSGRULENAME` | Name for NSG rule | `Allow-Dask-Jupyter-ports` | ```bash $ az network nsg rule create \ -g ${AZ_RESOURCEGROUP} \ --nsg-name ${AZ_NSGNAME} \ -n ${AZ_NSGRULENAME} \ --priority 1050 \ --destination-port-ranges 8888 8787 ``` ## Install RAPIDS Next, we can SSH into our VM to install RAPIDS. SSH instructions can be found by selecting **Connect** in the left panel. There are a selection of methods you can use to install RAPIDS which you can see via the [RAPIDS release selector](https://docs.rapids.ai/install#selector). For this example we are going to run the RAPIDS Docker container so we need to know the name of the most recent container. On the release selector choose **Docker** in the **Method** column. Then copy the commands shown: ```bash docker pull rapidsai/notebooks:25.12a-cuda12-py3.13 docker run --gpus all --rm -it \ --shm-size=1g --ulimit memlock=-1 \ -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/notebooks:25.12a-cuda12-py3.13 ``` #### NOTE If you see a “docker socket permission denied” error while running these commands try closing and reconnecting your SSH window. This happens because your user was added to the `docker` group only after you signed in. ## Test RAPIDS To access Jupyter, navigate to `:8888` in the browser. In a Python notebook, check that you can import and use RAPIDS libraries like `cudf`. ```ipython In [1]: import cudf In [2]: df = cudf.datasets.timeseries() In [3]: df.head() Out[3]: id name x y timestamp 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 ``` Open `cudf/10min.ipynb` and execute the cells to explore more of how `cudf` works. When running a Dask cluster you can also visit `:8787` to monitor the Dask cluster status. ### Useful Links - [Using NGC with Azure](https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-azure/index.html) #### Related Examples Measuring Performance with the One Billion Row Challenge tools/dask-cuda data-format/csv library/cudf library/cupy library/dask library/pandas cloud/aws/ec2 cloud/aws/sagemaker cloud/azure/azure-vm cloud/azure/ml cloud/gcp/compute-engine cloud/gcp/vertex-ai HPO with dask-ml and cuml dataset/airline library/numpy library/pandas library/xgboost library/dask library/dask-cuda library/dask-ml library/cuml cloud/aws/ec2 cloud/azure/azure-vm cloud/gcp/compute-engine cloud/ibm/virtual-server library/sklearn data-storage/s3 workflow/hpo # index.html.md # Azure Machine Learning RAPIDS can be deployed at scale using [Azure Machine Learning Service](https://learn.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-machine-learning) and can be scaled up to any size needed. ## Pre-requisites Use existing or create new Azure Machine Learning workspace through the [Azure portal](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=azure-portal#create-a-workspace), [Azure ML Python SDK](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python#create-a-workspace), [Azure CLI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli?tabs=createnewresources) or [Azure Resource Manager templates](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-workspace-template?tabs=azcli). Follow these high-level steps to get started: **1. Create.** Create your Azure Resource Group. **2. Workspace.** Within the Resource Group, create an Azure Machine Learning service Workspace. **3. Quota.** Check your subscription Usage + Quota to ensure you have enough quota within your region to launch your desired compute instance. ## Azure ML Compute instance Although it is possible to install Azure Machine Learning on your local computer, it is recommended to utilize [Azure’s ML Compute instances](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-instance), fully managed and secure development environments that can also serve as a [compute target](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target?view=azureml-api-2) for ML training. The compute instance provides an integrated Jupyter notebook service, JupyterLab, Azure ML Python SDK, CLI, and other essential tools. ### Select your instance Sign in to [Azure Machine Learning Studio](https://ml.azure.com/) and navigate to your workspace on the left-side menu. Select **New** > **Compute instance** (Create compute instance) > choose an [Azure RAPIDS compatible GPU](https://docs.rapids.ai/deployment/stable/cloud/azure/) VM size (e.g., `Standard_NC12s_v3`) ![Screenshot of create new notebook with a gpu-instance](images/azureml-create-notebook-instance.png) ### Provision RAPIDS setup script Navigate to the **Applications** section. Choose “Provision with a creation script” to install RAPIDS and dependencies. Put the following in a local file called `rapids-azure-startup.sh`: #### NOTE The script below has `set -e` to avoid silent fails. In case of failure, remove this line from the script, the VM will boot and inspect it by running each line of the script to see where it fails. ```bash #!/bin/bash set -e sudo -u azureuser -i <<'EOF' source /anaconda/etc/profile.d/conda.sh conda create -y -n rapids \ --override-channels \ -c rapidsai-nightly -c conda-forge -c nvidia \ -c microsoft \ rapids=25.12 python=3.13 'cuda-version>=12.0,<=12.9' \ 'azure-identity>=1.19' \ ipykernel conda activate rapids pip install 'azure-ai-ml>=1.24' python -m ipykernel install --user --name rapids echo "kernel install completed" EOF ``` Select `local file`, then `Browse`, and upload that script. ![Screenshot of the provision setup script screen](images/azureml-provision-setup-script.png) Refer to [Azure ML documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-customize-compute-instance) for more details on how to create the setup script. Launch the instance. ### Select the RAPIDS environment Once your Notebook Instance is `Running`, open “JupyterLab” and select the `rapids` kernel when working with a new notebook. ## Azure ML Compute cluster In the next section we will launch Azure’s [ML Compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python) to distribute your RAPIDS training jobs across a cluster of single or multi-GPU compute nodes. The Compute cluster scales up automatically when a job is submitted, and executes in a containerized environment, packaging your model dependencies in a Docker container. ### Instantiate workspace Use Azure’s client libraries to set up some resources. ```python from azure.ai.ml import MLClient from azure.identity import DefaultAzureCredential # Get a handle to the workspace. # # Azure ML places the workspace config at the default working # directory for notebooks by default. # # If it isn't found, open a shell and look in the # directory indicated by 'echo ${JUPYTER_SERVER_ROOT}'. ml_client = MLClient.from_config( credential=DefaultAzureCredential(), path="./config.json", ) ``` ### Create AMLCompute You will need to create a [compute target](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target?view=azureml-api-2#azure-machine-learning-compute-managed) using Azure ML managed compute ([AmlCompute](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-ai-ml/0.1.0b4/azure.ai.ml.entities.html)) for remote training. #### NOTE Be sure to check instance availability and its limits within the region where you created your compute instance. This [article](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas?view=azureml-api-2#azure-machine-learning-compute) includes details on the default limits and how to request more quota. [**size**]: The VM family of the nodes. Specify from one of **NC_v2**, **NC_v3**, **ND** or **ND_v2** GPU virtual machines (e.g `Standard_NC12s_v3`) [**max_instances**]: The max number of nodes to autoscale up to when you run a job #### NOTE You may choose to use low-priority VMs to run your workloads. These VMs don’t have guaranteed availability but allow you to take advantage of Azure’s unused capacity at a significant cost savings. The amount of available capacity can vary based on size, region, time of day, and more. ```python from azure.ai.ml.entities import AmlCompute gpu_compute = AmlCompute( name="rapids-cluster", type="amlcompute", size="Standard_NC12s_v3", # this VM type needs to be available in your current region max_instances=3, idle_time_before_scale_down=300, # Seconds of idle time before scaling down tier="low_priority", # optional ) ml_client.begin_create_or_update(gpu_compute).result() ``` If you name your cluster `"rapids-cluster"` you can check [https://ml.azure.com/compute/rapids-cluster/details](https://ml.azure.com/compute/rapids-cluster/details) to see the details about your cluster. ### Access Datastore URI A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls&view=azureml-api-2#access-data-from-a-datastore-uri-like-a-filesystem-preview) is a reference to a blob storage location (path) on your Azure account. You can copy-and-paste the datastore URI from the AzureML Studio UI: 1. Select **Data** from the left-hand menu > **Datastores** > choose your datastore name > **Browse** 2. Find the file/folder containing your dataset and click the ellipsis (…) next to it. 3. From the menu, choose **Copy URI** and select **Datastore URI** format to copy into your notebook. ![Screenshot of access datastore uri screen](images/azureml-access-datastore-uri.png) ### Custom RAPIDS Environment To run an AzureML experiment, you must specify an [environment](https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments?view=azureml-api-2) that contains all the necessary software dependencies to run the training script on distributed nodes.
You can define an environment from a [pre-built](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-docker-image) docker image or create-your-own from a [Dockerfile](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-docker-build-context) or [conda](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-conda-specification) specification file. In a notebook cell, run the following to copy the example code from this documentation into a new folder, and to create a Dockerfile to build and image that starts from a RAPIDS image and install additional packages needed for the workflow. ```ipython %%bash mkdir -p ./training-code repo_url='https://raw.githubusercontent.com/rapidsai/deployment/refs/heads/main/source/examples' # download training scripts wget -O ./training-code/train_rapids.py "${repo_url}/rapids-azureml-hpo/train_rapids.py" wget -O ./training-code/rapids_csp_azure.py "${repo_url}/rapids-azureml-hpo/rapids_csp_azure.py" touch ./training-code/__init__.py # create a Dockerfile defining the image the code will run in cat > ./training-code/Dockerfile <=2024.4.4' \ && pip install azureml-mlflow EOF ``` Now create the Environment, making sure to label and provide a description: ```python from azure.ai.ml.entities import Environment, BuildContext # NOTE: 'path' should be a filepath pointing to a directory containing a file named 'Dockerfile' env_docker_image = Environment( build=BuildContext(path="./training-code/"), name="rapids-mlflow", # label description="RAPIDS environment with azureml-mlflow", ) ml_client.environments.create_or_update(env_docker_image) ``` ### Submit RAPIDS Training jobs Now that we have our environment and custom logic, we can configure and run the `command` [class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) to submit training jobs. `inputs` is a dictionary of command-line arguments to pass to the training script. ```python from azure.ai.ml import command, Input # replace this with your own dataset datastore_name = "workspaceartifactstore" dataset = "airline_20000000.parquet" data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}" command_job = command( environment=f"{env_docker_image.name}:{env_docker_image.version}", experiment_name="test_rapids_mlflow", code="./training-code", command="python train_rapids.py \ --data_dir ${{inputs.data_dir}} \ --n_bins ${{inputs.n_bins}} \ --cv_folds ${{inputs.cv_folds}} \ --n_estimators ${{inputs.n_estimators}} \ --max_depth ${{inputs.max_depth}} \ --max_features ${{inputs.max_features}}", inputs={ "data_dir": Input(type="uri_file", path=data_uri), "n_bins": 32, "cv_folds": 5, "n_estimators": 50, "max_depth": 10, "max_features": 1.0, }, compute=gpu_compute.name, ) # submit training job returned_job = ml_client.jobs.create_or_update(command_job) returned_job # displays status and details page of the experiment ``` After creating the job, click on the details page provided in the output of `returned_job`, or go to [the “Experiments” page](https://ml.azure.com/experiments) to view logs, metrics, and outputs. #### NOTE For reference this job took ~7 min while using `size="Standard_NC6s_v3"` in the `gpu_compute` creation ![Screenshot of job under the test_rapids_mlflow experiment](images/azureml_returned_job_completed.png) Next, we can perform a sweep over a set of hyperparameters. ```python from azure.ai.ml.sweep import Choice, Uniform # define hyperparameter space to sweep over command_job_for_sweep = command_job( n_estimators=Choice(values=range(50, 500)), max_depth=Choice(values=range(5, 19)), max_features=Uniform(min_value=0.2, max_value=1.0), ) # apply hyperparameter sweep_job sweep_job = command_job_for_sweep.sweep( compute=gpu_compute.name, sampling_algorithm="random", primary_metric="Accuracy", goal="Maximize", ) # setting a very small limit of trials for demo purposes sweep_job.set_limits( max_total_trials=3, max_concurrent_trials=3, timeout=18000, trial_timeout=3600 ) # submit job returned_sweep_job = ml_client.create_or_update(sweep_job) returned_sweep_job ``` Once the job is created, click on the details page provided in the output of `returned_sweep_job`, or go to [the “Experiments” page](https://ml.azure.com/experiments) to view logs, metrics, and outputs. The three trials set in the `sweep_job.set_limits(...)` take between 20-40 min to complete when using `size="Standard_NC6s_v3"`. ### Clean Up When you’re done, remove the compute resources. ```python ml_client.compute.begin_delete(gpu_compute.name).wait() ``` Then check [https://ml.azure.com/compute/list/instances](https://ml.azure.com/compute/list/instances) and make sure your compute instance is also stopped, and deleted if desired. # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Azure VM Cluster (via Dask) ## Create a Cluster using Dask Cloud Provider The easiest way to setup a multi-node, multi-GPU cluster on Azure is to use [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/azure.html). ### 1. Install Dask Cloud Provider Dask Cloud Provider can be installed via `conda` or `pip`. The Azure-specific capabilities will need to be installed via the `[azure]` pip extra. ```shell $ pip install dask-cloudprovider[azure] ``` ### 2. Configure your Azure Resources Set up your [Azure Resource Group](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups), [Virtual Network](https://cloudprovider.dask.org/en/latest/azure.html#virtual-networks), and [Security Group](https://cloudprovider.dask.org/en/latest/azure.html#security-groups) according to [Dask Cloud Provider instructions](https://cloudprovider.dask.org/en/latest/azure.html#authentication). ### 3. Create a Cluster In Python terminal, a cluster can be created using the `dask_cloudprovider` package. The below example creates a cluster with 2 workers in `westus2` with `Standard_NC12s_v3` VMs. The VMs should have at least 100GB of disk space in order to accommodate the RAPIDS container image and related dependencies. ```python from dask_cloudprovider.azure import AzureVMCluster resource_group = "" vnet = "" security_group = "" subscription_id = "" cluster = AzureVMCluster( resource_group=resource_group, vnet=vnet, security_group=security_group, subscription_id=subscription_id, location="westus2", vm_size="Standard_NC12s_v3", public_ingress=True, disk_size=100, n_workers=2, worker_class="dask_cuda.CUDAWorker", docker_image="rapidsai/base:25.12a-cuda12-py3.13", docker_args="-p 8787:8787 -p 8786:8786", ) ``` ### 4. Test RAPIDS To test RAPIDS, create a distributed client for the cluster and query for the GPU model. ```python from dask.distributed import Client client = Client(cluster) def get_gpu_model(): import pynvml pynvml.nvmlInit() return pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0)) client.submit(get_gpu_model).result() ``` ```shell Out[5]: b'Tesla V100-PCIE-16GB' ``` ### 5. Cleanup Once done with the cluster, ensure the `cluster` and `client` are closed: ```python client.close() cluster.close() ``` #### Related Examples Multi-Node Multi-GPU XGBoost Example on Azure using dask-cloudprovider cloud/azure/azure-vm-multi tools/dask-cloudprovider library/cudf library/cuml library/xgboost library/dask library/fil data-storage/azure-data-lake dataset/nyc-taxi workflow/xgboost # index.html.md # Azure Kubernetes Service RAPIDS can be deployed on Azure via the [Azure Kubernetes Service](https://azure.microsoft.com/en-us/products/kubernetes-service/) (AKS). To run RAPIDS you’ll need a Kubernetes cluster with GPUs available. ## Prerequisites First you’ll need to have the [`az` CLI tool](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), etc for managing Kubernetes. Ensure you are logged into the `az` CLI. ```bash $ az login ``` ## Create the Kubernetes cluster Now we can launch a GPU enabled AKS cluster. First launch an AKS cluster. ```bash $ az aks create -g -n rapids \ --enable-managed-identity \ --node-count 1 \ --enable-addons monitoring \ --enable-msi-auth-for-monitoring \ --generate-ssh-keys ``` Once the cluster has created we need to pull the credentials into our local config. ```console $ az aks get-credentials -g --name rapids Merged "rapids" as current context in ~/.kube/config ``` Next we need to add an additional node group with GPUs which you can [learn more about in the Azure docs](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster). #### NOTE You will need the `GPUDedicatedVHDPreview` feature enabled so that NVIDIA drivers are installed automatically. You can check if this is enabled with: ```console $ az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/GPUDedicatedVHDPreview')].{Name:name,State:properties.state}" Name State ------------------------------------------------- ------------- Microsoft.ContainerService/GPUDedicatedVHDPreview NotRegistered ``` ### If you see NotRegistered follow these instructions If it is not registered for you you’ll need to register it which can take a few minutes. ```console $ az feature register --name GPUDedicatedVHDPreview --namespace Microsoft.ContainerService Once the feature 'GPUDedicatedVHDPreview' is registered, invoking 'az provider register -n Microsoft.ContainerService' is required to get the change propagated Name ------------------------------------------------- Microsoft.ContainerService/GPUDedicatedVHDPreview ``` Keep checking until it does into a registered state. ```console $ az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/GPUDedicatedVHDPreview')].{Name:name,State:properties.state}" Name State ------------------------------------------------- ----------- Microsoft.ContainerService/GPUDedicatedVHDPreview Registered ``` When the status shows as registered, refresh the registration of the `Microsoft.ContainerService` resource provider by using the `az provider register` command: ```bash $ az provider register --namespace Microsoft.ContainerService ``` Then install the aks-preview CLI extension, use the following Azure CLI commands: ```bash $ az extension add --name aks-preview ``` ```bash $ az aks nodepool add \ --resource-group \ --cluster-name rapids \ --name gpunp \ --node-count 1 \ --node-vm-size Standard_NC48ads_A100_v4 \ --enable-cluster-autoscaler \ --min-count 1 \ --max-count 3 ``` Here we have added a new pool made up of `Standard_NC48ads_A100_v4` instances which each have two A100 GPUs. We’ve also enabled autoscaling between one and three nodes on the pool. Then we can install the NVIDIA drivers. ```bash $ helm install --wait --generate-name --repo https://helm.ngc.nvidia.com/nvidia \ -n gpu-operator --create-namespace \ gpu-operator \ --set operator.runtimeClass=nvidia-container-runtime ``` Once our new pool has been created and configured, we can test the cluster. Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected. ```bash cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04" resources: limits: nvidia.com/gpu: 1 EOF ``` ```console $ kubectl logs pod/cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that Pod. ```console $ kubectl delete pod cuda-vectoradd pod "cuda-vectoradd" deleted ``` we should be able to test that we can schedule GPU pods. ## Install RAPIDS Now that you have a GPU enables Kubernetes cluster on AKS you can install RAPIDS with [any of the supported methods](../../platforms/kubernetes.md). ## Clean up You can also delete the AKS cluster to stop billing with the following command. ```console $ az aks delete -g -n rapids / Running .. ``` # index.html.md # Compute Engine Instance ## Create Virtual Machine Create a new [Compute Engine Instance](https://cloud.google.com/compute/docs/instances) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). NVIDIA maintains a [Virtual Machine Image (VMI) that pre-installs NVIDIA drivers and container runtimes](https://console.cloud.google.com/marketplace/product/nvidia-ngc-public/nvidia-gpu-optimized-vmi), we recommend using this image. 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 2. Select **Create Instance**. 3. Select the **Create VM from..** option at the top. 4. Select **Marketplace**. 5. Search for “nvidia” and select **NVIDIA GPU-Optimized VMI**, then select **Launch**. 6. In the **New NVIDIA GPU-Optimized VMI deployment** interface, fill in the name and any required information for the vm (the defaults should be fine for most users). 7. **Read and accept** the Terms of Service 8. Select **Deploy** to start the virtual machine. ## Allow network access To access Jupyter and Dask we will need to set up some firewall rules to open up some ports. ### Create the firewall rule 1. Open [**VPC Network**](https://console.cloud.google.com/networking/networks/list). 2. Select **Firewall** and **Create firewall rule** 3. Give the rule a name like `rapids` and ensure the network matches the one you selected for the VM. 4. Add a tag like `rapids` which we will use to assign the rule to our VM. 5. Set your source IP range. We recommend you restrict this to your own IP address or your corporate network rather than `0.0.0.0/0` which will allow anyone to access your VM. 6. Under **Protocols and ports** allow TCP connections on ports `22,8786,8787,8888`. ### Assign it to the VM 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 2. Select your VM and press **Edit**. 3. Scroll down to **Networking** and add the `rapids` network tag you gave your firewall rule. 4. Select **Save**. ## Connect to the VM Next we need to connect to the VM. 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 2. Locate your VM and press the **SSH** button which will open a new browser tab with a terminal. 3. **Read and accept** the NVIDIA installer prompts. ## Install RAPIDS There are a selection of methods you can use to install RAPIDS which you can see via the [RAPIDS release selector](https://docs.rapids.ai/install#selector). For this example we are going to run the RAPIDS Docker container so we need to know the name of the most recent container. On the release selector choose **Docker** in the **Method** column. Then copy the commands shown: ```bash docker pull rapidsai/notebooks:25.12a-cuda12-py3.13 docker run --gpus all --rm -it \ --shm-size=1g --ulimit memlock=-1 \ -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/notebooks:25.12a-cuda12-py3.13 ``` #### NOTE If you see a “docker socket permission denied” error while running these commands try closing and reconnecting your SSH window. This happens because your user was added to the `docker` group only after you signed in. ## Test RAPIDS To access Jupyter, navigate to `:8888` in the browser. In a Python notebook, check that you can import and use RAPIDS libraries like `cudf`. ```ipython In [1]: import cudf In [2]: df = cudf.datasets.timeseries() In [3]: df.head() Out[3]: id name x y timestamp 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 ``` Open `cudf/10min.ipynb` and execute the cells to explore more of how `cudf` works. When running a Dask cluster you can also visit `:8787` to monitor the Dask cluster status. ## Clean up Once you are finished head back to the [Deployments](https://console.cloud.google.com/dm/deployments) page and delete the marketplace deployment you created. ### Related Examples Measuring Performance with the One Billion Row Challenge tools/dask-cuda data-format/csv library/cudf library/cupy library/dask library/pandas cloud/aws/ec2 cloud/aws/sagemaker cloud/azure/azure-vm cloud/azure/ml cloud/gcp/compute-engine cloud/gcp/vertex-ai HPO with dask-ml and cuml dataset/airline library/numpy library/pandas library/xgboost library/dask library/dask-cuda library/dask-ml library/cuml cloud/aws/ec2 cloud/azure/azure-vm cloud/gcp/compute-engine cloud/ibm/virtual-server library/sklearn data-storage/s3 workflow/hpo # index.html.md # Vertex AI RAPIDS can be deployed on [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench). ## Create a new Notebook Instance 1. From the Google Cloud UI, navigate to [**Vertex AI**](https://console.cloud.google.com/vertex-ai/workbench/user-managed) -> Notebook -> **Workbench** 2. Select **Instances** and select **+ CREATE NEW**. 3. In the **Details** section give the instance a name. 4. Check the “Attach 1 NVIDIA T4 GPU” option. 5. After customizing any other aspects of the machine you wish, click **CREATE**. ## Install RAPIDS Once the instance has started select **OPEN JUPYTER LAB** and at the top of a notebook install the RAPIDS libraries you wish to use. #### WARNING Installing RAPIDS via `pip` in the default environment is [not currently possible](https://github.com/rapidsai/deployment/issues/517), for now you must create a new `conda` environment. Vertex AI currently ships with CUDA Toolkit 11 system packages as of the [Jan 2025 Vertex AI release](https://cloud.google.com/vertex-ai/docs/release-notes#January_31_2025). The default Python environment also contains the `cupy-cuda12x` package. This means it’s not possible to install RAPIDS package like `cudf` via `pip` as `cudf-cu12` will conflict with the CUDA Toolkit version but `cudf-cu11` will conflict with the `cupy` version. You can find out your current system CUDA Toolkit version by running `ls -ld /usr/local/cuda*`. You can create a new RAPIDS conda environment and register it with `ipykernel` for use in Jupyter Lab. Open a new terminal in Jupyter and run the following commands. ```bash # Create a new environment $ conda create -y -n rapids \ -c rapidsai-nightly -c conda-forge -c nvidia \ rapids=25.12 python=3.13 'cuda-version>=12.0,<=12.9' \ ipykernel ``` ```bash # Activate the environment $ conda activate rapids ``` ```bash # Register the environment with Jupyter $ python -m ipykernel install --prefix "${DL_ANACONDA_HOME}/envs/rapids" --name rapids --display-name rapids ``` Then refresh the Jupyter Lab page and open the launcher. You will see a new “rapids” kernel available. ![Screenshot of the Jupyter Lab launcher showing the RAPIDS kernel](images/vertex-ai-launcher.png) ## Test RAPIDS You should now be able to open a notebook and use RAPIDS. For example we could import and use RAPIDS libraries like `cudf`. ```ipython In [1]: import cudf In [2]: df = cudf.datasets.timeseries() In [3]: df.head() Out[3]: id name x y timestamp 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 ``` ### Related Examples Measuring Performance with the One Billion Row Challenge tools/dask-cuda data-format/csv library/cudf library/cupy library/dask library/pandas cloud/aws/ec2 cloud/aws/sagemaker cloud/azure/azure-vm cloud/azure/ml cloud/gcp/compute-engine cloud/gcp/vertex-ai # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # Google Kubernetes Engine RAPIDS can be deployed on Google Cloud via the [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) (GKE). To run RAPIDS you’ll need a Kubernetes cluster with GPUs available. ## Prerequisites First you’ll need to have the [`gcloud` CLI tool](https://cloud.google.com/sdk/gcloud) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), etc for managing Kubernetes. Ensure you are logged into the `gcloud` CLI. ```bash $ gcloud init ``` ## Create the Kubernetes cluster Now we can launch a GPU enabled GKE cluster. ```bash $ gcloud container clusters create rapids-gpu-kubeflow \ --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g \ --zone us-central1-c --release-channel stable ``` With this command, you’ve launched a GKE cluster called `rapids-gpu-kubeflow`. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs. #### NOTE After creating your cluster, if you get a message saying ```text CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin ``` you will need to install the `gke-gcloud-auth-plugin` to be able to get the credentials. To do so, ```bash $ gcloud components install gke-gcloud-auth-plugin ``` ## Get the cluster credentials ```bash $ gcloud container clusters get-credentials rapids-gpu-kubeflow \ --region=us-central1-c ``` With this command, your `kubeconfig` is updated with credentials and endpoint information for the `rapids-gpu-kubeflow` cluster. ## Install drivers Next, [install the NVIDIA drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) onto each node. ```console $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml daemonset.apps/nvidia-driver-installer created ``` Verify that the NVIDIA drivers are successfully installed. ```console $ kubectl get po -A --watch | grep nvidia kube-system nvidia-gpu-device-plugin-medium-cos-h5kkz 2/2 Running 0 3m42s kube-system nvidia-gpu-device-plugin-medium-cos-pw89w 2/2 Running 0 3m42s kube-system nvidia-gpu-device-plugin-medium-cos-wdnm9 2/2 Running 0 3m42s ``` After your drivers are installed, you are ready to test your cluster. Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected. ```bash cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04" resources: limits: nvidia.com/gpu: 1 EOF ``` ```console $ kubectl logs pod/cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that Pod. ```console $ kubectl delete pod cuda-vectoradd pod "cuda-vectoradd" deleted ``` ## Install RAPIDS Now that you have a GPU enables Kubernetes cluster on GKE you can install RAPIDS with [any of the supported methods](../../platforms/kubernetes.md). ## Clean up You can also delete the GKE cluster to stop billing with the following command. ```console $ gcloud container clusters delete rapids-gpu-kubeflow --zone us-central1-c Deleting cluster rapids...⠼ ``` ### Related Examples Autoscaling Multi-Tenant Kubernetes Deep-Dive cloud/gcp/gke tools/dask-operator library/cuspatial library/dask library/cudf data-format/parquet data-storage/gcs platforms/kubernetes Perform Time Series Forecasting on Google Kubernetes Engine with NVIDIA GPUs cloud/gcp/gke tools/dask-operator workflow/hpo workflow/xgboost library/dask library/dask-cuda library/xgboost library/optuna data-storage/gcs platforms/kubernetes # index.html.md # Dataproc RAPIDS can be deployed on Google Cloud Dataproc using Dask. For more details, see our **[detailed instructions and helper scripts.](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids)** **0. Copy initialization actions to your own Cloud Storage bucket.** Don’t create clusters that reference initialization actions located in `gs://goog-dataproc-initialization-actions-REGION` public buckets. These scripts are provided as reference implementations and are synchronized with ongoing [GitHub repository](https://github.com/GoogleCloudDataproc/initialization-actions) changes. It is strongly recommended that you copy the initialization scripts into your own Storage bucket to prevent unintended upgrades from upstream in the cluster: ```bash $ REGION= ``` ```bash $ GCS_BUCKET= ``` ```bash $ gcloud storage buckets create gs://$GCS_BUCKET ``` ```bash $ gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh gs://$GCS_BUCKET ``` ```bash $ gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/dask/dask.sh gs://$GCS_BUCKET ``` ```bash $ gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh gs://$GCS_BUCKET ``` **1. Create Dataproc cluster with Dask RAPIDS.** Use the gcloud command to create a new cluster. Because of an Anaconda version conflict, script deployment on older images is slow, we recommend using Dask with Dataproc 2.0+. #### WARNING At the time of writing [Dataproc only supports RAPIDS version 23.12 and earlier with CUDA<=11.8 and Ubuntu 18.04](https://github.com/GoogleCloudDataproc/initialization-actions/issues/1137). Please ensure that your setup complies with this compatibility requirement. Using newer RAPIDS versions may result in unexpected behavior or errors. ```bash $ CLUSTER_NAME= ``` ```bash $ DASK_RUNTIME=yarn ``` ```bash $ RAPIDS_VERSION=23.12 ``` ```bash $ CUDA_VERSION=11.8 ``` ```bash $ gcloud dataproc clusters create $CLUSTER_NAME\ --region $REGION\ --image-version 2.0-ubuntu18\ --master-machine-type n1-standard-32\ --master-accelerator type=nvidia-tesla-t4,count=2\ --worker-machine-type n1-standard-32\ --worker-accelerator type=nvidia-tesla-t4,count=2\ --initialization-actions=gs://$GCS_BUCKET/install_gpu_driver.sh,gs://$GCS_BUCKET/dask.sh,gs://$GCS_BUCKET/rapids.sh\ --initialization-action-timeout 60m\ --optional-components=JUPYTER\ --metadata gpu-driver-provider=NVIDIA,dask-runtime=$DASK_RUNTIME,rapids-runtime=DASK,rapids-version=$RAPIDS_VERSION,cuda-version=$CUDA_VERSION\ --enable-component-gateway ``` [GCS_BUCKET] = name of the bucket to use.
\\\\ [CLUSTER_NAME] = name of the cluster.
\\\\ [REGION] = name of region where cluster is to be created.
\\\\ [DASK_RUNTIME] = Dask runtime could be set to either yarn or standalone. **2. Run Dask RAPIDS Workload.** Once the cluster has been created, the Dask scheduler listens for workers on `port 8786`, and its status dashboard is on `port 8787` on the Dataproc master node. To connect to the Dask web interface, you will need to create an SSH tunnel as described in the [Dataproc web interfaces documentation.](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces) You can also connect using the Dask Client Python API from a Jupyter notebook, or from a Python script or interpreter session. # index.html.md # SageMaker RAPIDS can be used in a few ways with [AWS SageMaker](https://aws.amazon.com/sagemaker/). ## SageMaker Notebooks To get started head to [the SageMaker console](https://console.aws.amazon.com/sagemaker/) and create a [new SageMaker Notebook Instance](https://console.aws.amazon.com/sagemaker/home#/notebook-instances/create). Choose `Applications and IDEs > Notebooks > Create notebook instance`. ### Select your instance If a field is not mentioned below, leave the default values: - **Notebook instance name** = Name of the notebook instance - **Notebook instance type** = Type of notebook instance. Select a RAPIDS-compatible GPU ([see the RAPIDS docs](https://docs.rapids.ai/install#system-req)) as the SageMaker Notebook instance type (e.g., `ml.p3.2xlarge`). - **Platform identifier** = ‘Amazon Linux 2, Jupyter Lab 4’ ![Screenshot of the create new notebook screen with a ml.p3.2xlarge selected](images/sagemaker-create-notebook-instance.png) ### Create a RAPIDS lifecycle configuration [SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) can be augmented with a RAPIDS conda environment. We can add a RAPIDS conda environment to the set of Jupyter ipython kernels available in our SageMaker notebook instance by installing in a [lifecycle configuration script](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html). Create a new lifecycle configuration (via the ‘Additional Configuration’ dropdown). ![Screenshot of the create lifecycle configuration screen](images/sagemaker-create-lifecycle-configuration.png) Give your configuration a name like `rapids` and paste the following script into the “start notebook” script. ```bash #!/bin/bash set -e sudo -u ec2-user -i <<'EOF' mamba create -y -n rapids -c rapidsai -c conda-forge -c nvidia rapids=24.12 python=3.12 cuda-version=12.4 \ boto3 \ ipykernel \ 'sagemaker-python-sdk>=2.239.0' conda activate rapids python -m ipykernel install --user --name rapids echo "kernel install completed" EOF ``` #### WARNING RAPIDS `>24.12` will not be installable on SageMaker Notebook Instances until those instances support Amazon Linux 2023 or other Linux distributions with GLIBC of at least 2.28. For more details, see [rapidsai/deployment#520](https://github.com/rapidsai/deployment/issues/520). Set the volume size to at least `15GB`, to accommodate the conda environment. Then launch the instance. ### Select the RAPIDS environment Once your Notebook Instance is `InService` select “Open JupyterLab” #### NOTE If you see Pending to the right of the notebook instance in the Status column, your notebook is still being created. The status will change to InService when the notebook is ready for use. Then in Jupyter select the `rapids` kernel when working with a new notebook. ![Screenshot of Jupyter with the rapids kernel highlighted](images/sagemaker-choose-rapids-kernel.png) ### Run the Example Notebook Once inside JupyterLab you should be able to upload the [Running RAPIDS hyperparameter experiments at scale](../../examples/rapids-sagemaker-higgs/notebook.md) example notebook and continue following those instructions. ## SageMaker Estimators RAPIDS can also be used in [SageMaker Estimators](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). Estimators allow you to launch training jobs on ephemeral VMs which SageMaker manages for you. With this option, your Notebook Instance doesn’t need to have a GPU… you are only charged for GPU instances for the time that your training job is running. All you’ll need to do is bring in your RAPIDS training script and libraries as a Docker container image and ask Amazon SageMaker to run copies of it in parallel on a specified number of GPU instances. Let’s take a closer look at how this works through a step-by-step approach: - Training script should accept hyperparameters as command line arguments. Starting with the base RAPIDS container (pulled from [Docker Hub](https://hub.docker.com/u/rapidsai)), use a `Dockerfile` to augment it by copying your training code and set `WORKDIR` path to the code. - Install [sagemaker-training toolkit](https://github.com/aws/sagemaker-training-toolkit) to make the container compatible with Sagemaker. Add other packages as needed for your workflow needs e.g. python, flask (model serving), dask-ml etc. - Push the image to a container registry (ECR). - Having built our container and custom logic, we can now assemble all components into an Estimator. We can now test the Estimator and run parallel hyperparameter optimization tuning jobs. Estimators follow an API roughly like this: ```python # set up configuration for the estimator estimator = sagemaker.estimator.Estimator( image_uri, role, instance_type, instance_count, input_mode, output_path, use_spot_instances, max_run=86400, sagemaker_session, ) # launch a single remote training job estimator.fit(inputs=s3_data_input, job_name=job_name) # set up configuration for HyperparameterTuner hpo = sagemaker.tuner.HyperparameterTuner( estimator, metric_definitions, objective_metric_name, objective_type="Maximize", hyperparameter_ranges, strategy, max_jobs, max_parallel_jobs, ) # launch multiple training jobs (one per combination of hyperparameters) hpo.fit(inputs=s3_data_input, job_name=tuning_job_name, wait=True, logs="All") ``` For a hands-on demo of this, try [“Deep Dive into running Hyper Parameter Optimization on AWS SageMaker”]/examples/rapids-sagemaker-higgs/notebook). ## Further reading We’ve also written a **[detailed blog post](https://medium.com/rapids-ai/running-rapids-experiments-at-scale-using-amazon-sagemaker-d516420f165b)** on how to use SageMaker with RAPIDS. ### Related Examples Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker cloud/aws/sagemaker workflow/hpo library/cudf library/cuml library/scikit-learn data-format/csv data-storage/s3 Measuring Performance with the One Billion Row Challenge tools/dask-cuda data-format/csv library/cudf library/cupy library/dask library/pandas cloud/aws/ec2 cloud/aws/sagemaker cloud/azure/azure-vm cloud/azure/ml cloud/gcp/compute-engine cloud/gcp/vertex-ai Deep Dive into Running Hyper Parameter Optimization on AWS SageMaker cloud/aws/sagemaker workflow/hpo library/xgboost library/cuml library/cupy library/cudf library/dask data-storage/s3 data-format/parquet # index.html.md # Elastic Container Service (ECS) RAPIDS can be deployed on a multi-node ECS cluster using Dask’s dask-cloudprovider management tools. For more details, see our **[blog post on deploying on ECS.](https://medium.com/rapids-ai/getting-started-with-rapids-on-aws-ecs-using-dask-cloud-provider-b1adfdbc9c6e)** ## Run from within AWS The following steps assume you are running from within the same AWS VPC. One way to ensure this is to use [AWS EC2 Single Instance](https://docs.rapids.ai/deployment/stable/cloud/aws/ec2.html) as your development environment. ### Setup AWS credentials First, you will need AWS credentials to interact with the AWS CLI. If someone else manages your AWS account, you will need to get these keys from them.
You can provide these credentials to dask-cloudprovider in a number of ways, but the easiest is to setup your local environment using the AWS command line tools: ```shell $ pip install awscli $ aws configure ``` ### Install dask-cloudprovider To install, you will need to run the following: ```shell $ pip install dask-cloudprovider[aws] ``` ## Create an ECS cluster In the AWS console, visit the ECS dashboard and on the left-hand side, click “Clusters” then **Create Cluster** Give the cluster a name e.g.`rapids-cluster` For Networking, select the default VPC and all the subnets available in that VPC Select “Amazon EC2 instances” for the Infrastructure type and configure your settings: - Operating system: must be Linux-based architecture - EC2 instance type: must support RAPIDS-compatible GPUs ([see the RAPIDS docs](https://docs.rapids.ai/install#system-req)) - Desired capacity: number of maximum instances to launch (default maximum 5) - SSH Key pair Review your settings then click on the “Create” button and wait for the cluster creation to complete. ## Create a Dask cluster Get the Amazon Resource Name (ARN) for the cluster you just created. Set `AWS_REGION` environment variable to your **[default region](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions)**, for instance `us-east-1` ```shell AWS_REGION=[REGION] ``` Create the ECSCluster object in your Python session: ```python from dask_cloudprovider.aws import ECSCluster cluster = ECSCluster( cluster_arn= "", n_workers=, worker_gpu=, skip_cleaup=True, scheduler_timeout="20 minutes", ) ``` #### NOTE When you call this command for the first time, `ECSCluster()` will automatically create a **security group** with the same name as the ECS cluster you created above.. However, if the Dask cluster creation fails or you’d like to reuse the same ECS cluster for subsequent runs of `ECSCluster()`, then you will need to provide this security group value. ```shell security_groups=["sg-0fde781be42651"] ``` [**cluster_arn**] = ARN of an existing ECS cluster to use for launching tasks
[**num_workers**] = number of workers to start on cluster creation
[**num_gpus**] = number of GPUs to expose to the worker, this must be less than or equal to the number of GPUs in the instance type you selected for the ECS cluster (e.g `1` for `p3.2xlarge`).
[**skip_cleanup**] = if True, Dask workers won’t be automatically terminated when cluster is shut down
[**execution_role_arn**] = ARN of the IAM role that allows the Dask cluster to create and manage ECS resources
[**task_role_arn**] = ARN of the IAM role that the Dask workers assume when they run
[**scheduler_timeout**] = maximum time scheduler will wait for workers to connect to the cluster ## Test RAPIDS Create a distributed client for our cluster: ```python from dask.distributed import Client client = Client(cluster) ``` Load sample data and test the cluster! ```python import dask, cudf, dask_cudf ddf = dask.datasets.timeseries() gdf = ddf.map_partitions(cudf.from_pandas) gdf.groupby("name").id.count().compute().head() ``` ```shell Out[34]: Xavier 99495 Oliver 100251 Charlie 99354 Zelda 99709 Alice 100106 Name: id, dtype: int64 ``` ## Cleanup You can scale down or delete the Dask cluster, but the ECS cluster will continue to run (and incur charges!) until you also scale it down or shut down altogether.
If you are planning to use the ECS cluster again soon, it is probably preferable to reduce the nodes to zero. # index.html.md # AWS Elastic Kubernetes Service (EKS) RAPIDS can be deployed on AWS via the [Elastic Kubernetes Service](https://aws.amazon.com/eks/) (EKS). To run RAPIDS you’ll need a Kubernetes cluster with GPUs available. ## Prerequisites First you’ll need to have the [`aws` CLI tool](https://aws.amazon.com/cli/) and [`eksctl` CLI tool](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), for managing Kubernetes. Ensure you are logged into the `aws` CLI. ```bash $ aws configure ``` ## Create the Kubernetes cluster Now we can launch a GPU enabled EKS cluster with `eksctl`. #### NOTE 1. You will need to create or import a public SSH key to be able to execute the following command. In your aws console under `EC2` in the side panel under Network & Security > Key Pairs, you can create a key pair or import (see “Actions” dropdown) one you’ve created locally. 2. If you are not using your default AWS profile, add `--profile ` to the following command. 3. The `--ssh-public-key` argument is the name assigned during creation of your key in AWS console. ```bash $ eksctl create cluster rapids \ --version 1.30 \ --nodes 3 \ --node-type=g4dn.xlarge \ --timeout=40m \ --ssh-access \ --ssh-public-key \ --region us-east-1 \ --zones=us-east-1c,us-east-1b,us-east-1d \ --auto-kubeconfig ``` With this command, you’ve launched an EKS cluster called `rapids`. You’ve specified that it should use nodes of type `g4dn.xlarge`. We also specified that we don’t want to install the NVIDIA drivers as we will do that with the NVIDIA operator. To access the cluster we need to pull down the credentials. Add `--profile ` if you are not using the default profile. ```bash $ aws eks --region us-east-1 update-kubeconfig --name rapids ``` ## Install drivers As we selected a GPU node type EKS will automatically install drivers for us. We can verify this by listing the NVIDIA driver plugin Pods. ```console $ kubectl get po -n kube-system -l name=nvidia-device-plugin-ds NAME READY STATUS RESTARTS AGE nvidia-device-plugin-daemonset-kv7t5 1/1 Running 0 52m nvidia-device-plugin-daemonset-rhmvx 1/1 Running 0 52m nvidia-device-plugin-daemonset-thjhc 1/1 Running 0 52m ``` #### NOTE By default this plugin will install the latest version on the NVIDIA drivers on every Node. If you need more control over your driver installation we recommend that when creating your cluster you set `eksctl create cluster --install-nvidia-plugin=false ...` and then install drivers yourself using the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html). After you have confirmed your drivers are installed, you are ready to test your cluster. Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected. ```bash cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04" resources: limits: nvidia.com/gpu: 1 EOF ``` ```console $ kubectl logs pod/cuda-vectoradd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that Pod. ```console $ kubectl delete pod cuda-vectoradd pod "cuda-vectoradd" deleted ``` ## Install RAPIDS Now that you have a GPU enabled Kubernetes cluster on EKS you can install RAPIDS with [any of the supported methods](../../platforms/kubernetes.md). ## Clean up You can also delete the EKS cluster to stop billing with the following command. ```console $ eksctl delete cluster --region=us-east-1 --name=rapids Deleting cluster rapids...⠼ ``` # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # EC2 Cluster (via Dask) To launch a multi-node cluster on AWS EC2 we recommend you use [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/), a native cloud integration for Dask. It helps manage Dask clusters on different cloud platforms. ## Local Environment Setup Before running these instructions, ensure you have installed RAPIDS. #### NOTE This method of deploying RAPIDS effectively allows you to burst beyond the node you are on into a cluster of EC2 VMs. This does come with the caveat that you are on a RAPIDS capable environment with GPUs. If you are using a machine with an NVIDIA GPU then follow the [local install instructions](https://docs.rapids.ai/install). Alternatively if you do not have a GPU locally consider using a remote environment like a [SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html). ### Install the AWS CLI Install the AWS CLI tools following the [official instructions](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). ### Install Dask Cloud Provider Also install `dask-cloudprovider` and ensure you select the `aws` optional extras. ```bash $ pip install "dask-cloudprovider[aws]" ``` ## Cluster setup We’ll now setup the [EC2Cluster](https://cloudprovider.dask.org/en/latest/aws.html#elastic-compute-cloud-ec2) from Dask Cloud Provider. To do this, you’ll first need to run `aws configure` and ensure the credentials are updated. [Learn more about the setup](https://cloudprovider.dask.org/en/latest/aws.html#authentication). The API also expects a security group that allows access to ports 8786-8787 and all traffic between instances in the security group. If you do not pass a group here, `dask-cloudprovider` will create one for you. ```python from dask_cloudprovider.aws import EC2Cluster cluster = EC2Cluster( instance_type="g4dn.12xlarge", # 4 T4 GPUs docker_image="rapidsai/base:25.12a-cuda12-py3.13", worker_class="dask_cuda.CUDAWorker", worker_options={"rmm-managed-memory": True}, security_groups=[""], docker_args="--shm-size=256m", n_workers=3, security=False, availability_zone="us-east-1a", region="us-east-1", ) ``` #### WARNING Instantiating this class can take upwards of 30 minutes. See the [Dask docs](https://cloudprovider.dask.org/en/latest/packer.html) on prebuilding AMIs to speed this up. ### If you have non-default credentials you may need to pass your credentials manually. Here’s a small utility for parsing credential profiles. ```python import os import configparser import contextlib def get_aws_credentials(*, aws_profile="default"): parser = configparser.RawConfigParser() parser.read(os.path.expanduser("~/.aws/config")) config = parser.items( f"profile {aws_profile}" if aws_profile != "default" else "default" ) parser.read(os.path.expanduser("~/.aws/credentials")) credentials = parser.items(aws_profile) all_credentials = {key.upper(): value for key, value in [*config, *credentials]} with contextlib.suppress(KeyError): all_credentials["AWS_REGION"] = all_credentials.pop("REGION") return all_credentials ``` ```python cluster = EC2Cluster(..., env_vars=get_aws_credentials(aws_profile="foo")) ``` ## Connecting a client Once your cluster has started you can connect a Dask client to submit work. ```python from dask.distributed import Client client = Client(cluster) ``` ```python import cudf import dask_cudf df = dask_cudf.from_cudf(cudf.datasets.timeseries(), npartitions=2) df.x.mean().compute() ``` ## Clean up When you create your cluster Dask Cloud Provider will register a finalizer to shutdown the cluster. So when your Python process exits the cluster will be cleaned up. You can also explicitly shutdown the cluster with: ```python client.close() cluster.close() ``` ### Related Examples Multi-node Multi-GPU Example on AWS using dask-cloudprovider cloud/aws/ec2-multi library/cuml library/dask library/numpy library/dask-ml library/cudf workflow/randomforest tools/dask-cloudprovider data-format/csv data-storage/gcs # index.html.md # Elastic Compute Cloud (EC2) ## Create Instance Create a new [EC2 Instance](https://aws.amazon.com/ec2/) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). NVIDIA maintains an [Amazon Machine Image (AMI) that pre-installs NVIDIA drivers and container runtimes](https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq), we recommend using this image as the starting point. ### via AWS Console 1. Open the [**EC2 Dashboard**](https://console.aws.amazon.com/ec2/home). 2. Select **Launch Instance**. 3. In the AMI selection box search for “nvidia”, then switch to the **AWS Marketplace AMIs** tab. 4. Select **NVIDIA GPU-Optimized AMI** and click “Select”. Then, in the new popup, select **Subscribe on Instance Launch**. 5. In **Key pair** select your SSH keys (create these first if you haven’t already). 6. Under network settings create a security group (or choose an existing) with inbound rules that allows SSH access on port `22` and also allow ports `8888,8786,8787` to access Jupyter and Dask. For outbound rules, allow all traffic. 7. Select **Launch**. ### via AWS CLI 1. Set the following environment variables first. Edit any of them to match your preferred region, instance type, or naming convention. ```bash REGION=us-east-1 INSTANCE_TYPE=g5.xlarge KEY_NAME=rapids-ec2-key SG_NAME=rapids-ec2-sg VM_NAME=rapids-ec2 ``` 2. Accept the NVIDIA Marketplace subscription before using the AMI: open the [NVIDIA GPU-Optimized AMI listing](https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq), choose **Continue to Subscribe**, then select **Accept Terms**. Wait for the status to show as active. 3. Find the most recent NVIDIA Marketplace AMI ID in `us-east-1`. ```bash AMI_ID=$(aws ec2 describe-images \ --region "$REGION" \ --filters "Name=name,Values=*NVIDIA*VMI*Base*" "Name=state,Values=available" \ --query 'Images | sort_by(@, &CreationDate)[-1].ImageId' \ --output text) echo "$AMI_ID" ``` 4. Create an SSH key pair and secure it locally (if you already have a key, update `KEY_NAME` and skip this step). ```bash aws ec2 create-key-pair --region "$REGION" --key-name "$KEY_NAME" \ --query 'KeyMaterial' --output text > "${KEY_NAME}.pem" chmod 400 "${KEY_NAME}.pem" ``` 5. Create a security group that allows SSH on `22` plus the Jupyter (`8888`) and Dask (`8786`, `8787`) ports, and keep outbound traffic open. Replace `ALLOWED_CIDR` with something more restrictive if you want to limit inbound access. Use `ALLOWED_CIDR="$(curl ifconfig.co)/0"` to restrict access to your current IP address ```bash ALLOWED_CIDR=0.0.0.0/0 ``` ```bash VPC_ID=$(aws ec2 describe-vpcs \ --region "$REGION" \ --filters Name=isDefault,Values=true \ --query 'Vpcs[0].VpcId' \ --output text) echo "$VPC_ID" SG_ID=$(aws ec2 create-security-group \ --region "$REGION" \ --group-name "$SG_NAME" \ --description "RAPIDS EC2 security group" \ --vpc-id "$VPC_ID" \ --query 'GroupId' \ --output text) echo "$SG_ID" SUBNET_ID=$(aws ec2 describe-subnets \ --region "$REGION" \ --filters "Name=vpc-id,Values=$VPC_ID" \ --query 'Subnets[0].SubnetId' \ --output text) echo "$SUBNET_ID" ``` ```bash aws ec2 authorize-security-group-ingress --region "$REGION" --group-id "$SG_ID" \ --protocol tcp --port 22 --no-cli-pager --cidr "$ALLOWED_CIDR" aws ec2 authorize-security-group-ingress --region "$REGION" --group-id "$SG_ID" \ --protocol tcp --port 8888 --no-cli-pager --cidr "$ALLOWED_CIDR" aws ec2 authorize-security-group-ingress --region "$REGION" --group-id "$SG_ID" \ --protocol tcp --port 8786 --no-cli-pager --cidr "$ALLOWED_CIDR" aws ec2 authorize-security-group-ingress --region "$REGION" --group-id "$SG_ID" \ --protocol tcp --port 8787 --no-cli-pager --cidr "$ALLOWED_CIDR" ``` 6. Launch an EC2 instance with the NVIDIA AMI. ```bash INSTANCE_ID=$(aws ec2 run-instances \ --region "$REGION" \ --image-id "$AMI_ID" \ --count 1 \ --instance-type "$INSTANCE_TYPE" \ --key-name "$KEY_NAME" \ --security-group-ids "$SG_ID" \ --subnet-id "$SUBNET_ID" \ --associate-public-ip-address \ --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$VM_NAME}]" \ --query 'Instances[0].InstanceId' \ --output text) echo "$INSTANCE_ID" ``` ## Connect to the instance Next we need to connect to the instance. ### via AWS Console 1. Open the [**EC2 Dashboard**](https://console.aws.amazon.com/ec2/home). 2. Locate your VM and note the **Public IP Address**. 3. In your terminal run `ssh ubuntu@`. #### NOTE If you use the AWS Console, please use the default `ubuntu` user to ensure the NVIDIA driver installs on the first boot. ### via AWS CLI 1. Wait for the instance to pass health checks. ```bash aws ec2 wait instance-status-ok --region "$REGION" --instance-ids "$INSTANCE_ID" ``` 2. Retrieve the public IP address and use it to connect via SSH ```bash PUBLIC_IP=$(aws ec2 describe-instances \ --region "$REGION" \ --instance-ids "$INSTANCE_ID" \ --query 'Reservations[0].Instances[0].PublicIpAddress' \ --output text) echo "$PUBLIC_IP" ``` 3. Connect over SSH using the key created earlier. ```bash ssh -i "${KEY_NAME}.pem" ubuntu@"$PUBLIC_IP" ``` #### NOTE If you see `WARNING: UNPROTECTED PRIVATE KEY FILE!`, run `chmod 400 rapids-ec2-key.pem` before retrying. ## Install RAPIDS There are a selection of methods you can use to install RAPIDS which you can see via the [RAPIDS release selector](https://docs.rapids.ai/install#selector). For this example we are going to run the RAPIDS Docker container so we need to know the name of the most recent container. On the release selector choose **Docker** in the **Method** column. Then copy the commands shown: ```bash docker pull rapidsai/notebooks:25.12a-cuda12-py3.13 docker run --gpus all --rm -it \ --shm-size=1g --ulimit memlock=-1 \ -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/notebooks:25.12a-cuda12-py3.13 ``` #### NOTE If you see a “docker socket permission denied” error while running these commands try closing and reconnecting your SSH window. This happens because your user was added to the `docker` group only after you signed in. #### NOTE If you see a “modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-1011-aws” while first connecting to the EC2 instance, try logging out and reconnecting again. ## Test RAPIDS To access Jupyter, navigate to `:8888` in the browser. In a Python notebook, check that you can import and use RAPIDS libraries like `cudf`. ```ipython In [1]: import cudf In [2]: df = cudf.datasets.timeseries() In [3]: df.head() Out[3]: id name x y timestamp 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 ``` Open `cudf/10min.ipynb` and execute the cells to explore more of how `cudf` works. When running a Dask cluster you can also visit `:8787` to monitor the Dask cluster status. ## Clean up ### via AWS Console 1. In the **EC2 Dashboard**, select your instance, choose **Instance state** → **Terminate**, and confirm. 2. Under **Key Pairs**, delete the key pair if you generated one and you no longer need it. 3. Under **Security Groups**, find the group you created (for example `rapids-ec2-sg`), choose **Actions** → **Delete security group**. ### via AWS CLI 1. Terminate the instance and wait until it is fully shut down. ```bash aws ec2 terminate-instances --region "$REGION" --instance-ids "$INSTANCE_ID" --no-cli-pager aws ec2 wait instance-terminated --region "$REGION" --instance-ids "$INSTANCE_ID" ``` 2. Delete the key pair and remove the local `.pem` file if you created it just for this guide. ```bash aws ec2 delete-key-pair --region "$REGION" --key-name "$KEY_NAME" rm -f "${KEY_NAME}.pem" ``` 3. Delete the security group. ```bash aws ec2 delete-security-group --region "$REGION" --group-id "$SG_ID" ``` ### Related Examples HPO Benchmarking with RAPIDS and Dask cloud/aws/ec2 data-storage/s3 workflow/randomforest workflow/hpo workflow/xgboost library/dask library/dask-cuda library/xgboost library/optuna library/sklearn library/dask-ml Measuring Performance with the One Billion Row Challenge tools/dask-cuda data-format/csv library/cudf library/cupy library/dask library/pandas cloud/aws/ec2 cloud/aws/sagemaker cloud/azure/azure-vm cloud/azure/ml cloud/gcp/compute-engine cloud/gcp/vertex-ai HPO with dask-ml and cuml dataset/airline library/numpy library/pandas library/xgboost library/dask library/dask-cuda library/dask-ml library/cuml cloud/aws/ec2 cloud/azure/azure-vm cloud/gcp/compute-engine cloud/ibm/virtual-server library/sklearn data-storage/s3 workflow/hpo # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # NVIDIA Brev The [NVIDIA Brev](https://brev.nvidia.com/) platform provides you a one stop menu of available GPU instances across many cloud providers, including [Amazon Web Services](https://aws.amazon.com/) and [Google Cloud](https://cloud.google.com), with CUDA, Python, Jupyter Lab, all set up. ## Brev Instance Setup There are two options to get you up and running with RAPIDS in a few steps, thanks to the Brev RAPIDS quickstart: 1. Brev GPU Instances - quickly get the GPU, across most clouds, to get your work done. 2. Brev Launchables - quickly create one-click starting, reusable instances that you customized to your MLOps needs. ### Option 1. Setting up your Brev GPU Instance 1. Navigate to the [Brev bash](https://brev.nvidia.com/org) and click on “Create new instance”. ![Screenshot of the "Create your first instance" UI](_static/images/platforms/brev/brev1.png) 1. Choose a compute type. #### HINT New users commonly choose `L4` GPUs for trying things out. ![Screenshot of the "Choose a compute type" UI](_static/images/platforms/brev/brev-compute.png) 1. Select the button to change the container or runtime (the default is “VM Mode w/ Jupyter”) ![Screenshot of the "Changing container or runtime" UI](_static/images/platforms/brev/brev-runtime.png) 1. Select “Featured Containers”. ![Screenshot showing "Featured Containers" highlighted](_static/images/platforms/brev/brev-featured-containers.png) 1. Attach the “NVIDIA RAPIDS” Container and choose “Apply”. ![Screenshot showing the "NVIDIA RAPIDS" container highlighted](_static/images/platforms/brev/brev-rapids-container.png) 1. Give your instance a name and hit “Deploy”. ![Screenshot of the instance creation summary screen with the deploy button highlighted](_static/images/platforms/brev/brev-deploy.png) ### Option 2. Setting up your Brev Launchable Brev Launchables are shareable environment configurations that combine code, containers, and compute into a single portable recipe. This option is most applicable if you want to set up a custom environment for a blueprint, like our [Single-cell Analysis Blueprint](https://github.com/NVIDIA-AI-Blueprints/single-cell-analysis-blueprint/). However, you can use this to create quick-start templates for many different kinds of projects when you want users to drop into an environment that is ready to go (e.g. tutorials, workshops, demos, etc.). You can read more about Brev Launchables in the [Getting Started Guide](https://docs.nvidia.com/brev/latest/launchables-getting-started.html). 1. Go to [Brev’s Launchable Creator](https://brev.nvidia.com/launchables/create) (requires account) 2. When asked **How would you like to provide your code files?**. - Select “I have code files in a git repository”, and provide the link to a GitHub repository, if you have one that you’d like to be mounted in the instance once is up. - Otherwise, select “I don’t have any code files”. 3. When asked **What type of runtime environment do you need?** select “With container(s)”, and proceed. ![Screenshot showing the Brev launchable setup with container](_static/images/platforms/brev/brev-launchable-setup-start.png) 1. When prompted to **Choose a Container Configuration**, you have two options: 1. **“Featured Container”** and select the “NVIDIA RAPIDS” container: For a ready to go environment with the entire RAPIDS stack and Jupyter configured. - Select your desired compute environment. Make sure you select sufficient disk size to download the datasets you want to work with. Note, you will not be able to resize the instance once created. - Create a name for your launchable, and deploy. 2. **Docker Compose**: For a custom container that you can tailor to your needs. - You can provide a `docker-compose.yaml` via url o from a local file. In the following template, make sure to replace `` in the `volumes` path, with the name of your repository if you have one. Otherwise, remove the `volumes`entry. ```yaml services: backend: image: "rapidsai/notebooks:25.12a-cuda12-py3.13" pull_policy: always ulimits: memlock: -1 stack: 67108864 shm_size: 1g deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: EXTRA_CONDA_PACKAGES: "hdbscan>=0.8.39 umap-learn>=0.5.7" # example of packages ports: - "8888:8888" # Expose JupyterLab volumes: - /home/ubuntu/:/notebooks/ # e.g tutorial if repo at https://github.com/rapidsai-community/tutorial user: root working_dir: /notebooks entrypoint: ["/home/rapids/entrypoint.sh"] command: python -m jupyter lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.password='' --notebook-dir=/notebooks restart: unless-stopped ``` - Click “Validate”. - Select your desired compute environment. Make sure you select sufficient disk size to download the datasets you want to work with. Note, you will not be able to resize the instance once created. - On the next page, when asked **Do you want a Jupyter Notebook experience?** select **No, I don’t want Jupyter**. This is because the RAPIDS notebook container already have Jupyter setup. For convenience name the Secure Link to jupyter. ![Screenshot showing the Brev launchable Jupyter experience setup](_static/images/platforms/brev/brev-launchable-jupyter-setup-docker-compose.png) - Create a name for your launchable, and deploy. ## Accessing your instance There are a few ways to access your instance: 1. Directly access Jupyter Lab from the Brev GUI 2. Using the Brev CLI to connect to your instance…. 3. Using Visual Studio Code 4. Using SSH via your terminal 5. Access using the Brev tunnel 6. Sharing a service with others ### 1. Jupyter Notebook To create and use a Jupyter Notebook, click “Open Notebook” at the top right after the page has deployed. ![Screenshot of the instance UI with the "Open Notebook" button highlighted](_static/images/platforms/brev/brev8.png) ### 2. Brev CLI Install If you want to access your launched Brev instance(s) via Visual Studio Code or SSH using terminal, you need to install the [Brev CLI according to these instructions](https://docs.nvidia.com/brev/latest/brev-cli.html) or this code below: ```bash $ sudo bash -c "$(curl -fsSL https://raw.githubusercontent.com/brevdev/brev-cli/main/bin/install-latest.sh)" && brev login ``` #### 2.1 Brev CLI using Visual Studio Code To connect to your Brev instance from VS Code open a new VS Code window and run: ```bash $ brev open ``` It will automatically open a new VS Code window for you to use with RAPIDS. #### 2.2 Brev CLI using SSH via your Terminal To access your Brev instance from the terminal run: ```bash $ brev shell ``` ##### Forwarding a Port Locally Assuming your Jupyter Notebook is running on port `8888` in your Brev environment, you can forward this port to your local machine using the following SSH command: ```bash $ ssh -L 8888:localhost:8888 @ -p 22 ``` This command forwards port `8888` on your local machine to port `8888` on the remote Brev environment. Or for port `2222` (default port). ```bash $ ssh @ -p 2222 ``` Replace `username` with your username and `ip` with the ip listed if it’s different. ##### Accessing the Service After running the command, open your web browser and navigate to your local host. You will be able to access the Jupyter Notebook running in your Brev environment as if it were running locally. #### 3. Access the Jupyter Notebook via the Tunnel The “Deployments” section will show that your Jupyter Notebook is running on port `8888`, and it is accessible via a shareable URL Ex: `jupyter0-i55ymhsr8.brevlab.com`. Click on the link or copy and paste the URL into your web browser’s address bar to access the Jupyter Notebook interface directly. ##### 4. Share the Service If you want to share access to this service with others, you can click on the “Share a Service” button. You can also manage access by clicking “Edit Access” to control who has the ability to use this service. ### Check that your notebook has GPU Capabilities You can verify that you have your requested GPU by running the `nvidia-smi` command. ![Screenshot of a notebook terminal running the command nvidia-smi and showing the NVIDIA T4 GPU in the output](_static/images/platforms/brev/brev6.png) ## Testing your RAPIDS Instance You can verify your RAPIDS installation is working by importing `cudf` and creating a GPU dataframe. ```python import cudf gdf = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) print(gdf) ``` ## Resources and tips - [Brev Docs](https://brev.dev/) - Please note: Git is not preinstalled in the RAPIDS container, but can be installed into the container when it is running using ```bash $ apt update ``` ```bash $ apt install git -y ``` # index.html.md Compute Engine Instance Launch a Compute Engine instance and run RAPIDS. single-node Vertex AI Launch the RAPIDS container in Vertex AI managed notebooks. single-node Google Kubernetes Engine (GKE) Launch a RAPIDS cluster on managed Kubernetes. multi-node Dataproc Launch a RAPIDS cluster on Dataproc. multi-node # index.html.md IBM Virtual Server Launch a virtual server and run RAPIDS. single-node # index.html.md GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md Azure Virtual Machine Launch an Azure VM instance and run RAPIDS. single-node Azure Kubernetes Service (AKS) Launch a RAPIDS cluster on managed Kubernetes. multi-node Azure Cluster via Dask Launch a RAPIDS cluster on Azure VMs or Azure ML with Dask. multi-node Azure Machine Learning (Azure ML) Launch RAPIDS Experiment on Azure ML. single-node multi-node # index.html.md Elastic Compute Cloud (EC2) Launch an EC2 instance and run RAPIDS. single-node EC2 Cluster (with Dask) Launch a RAPIDS cluster on EC2 with Dask. multi-node Elastic Kubernetes Service (EKS) Launch a RAPIDS cluster on managed Kubernetes. multi-node Elastic Container Service (ECS) Launch a RAPIDS cluster on managed container service. multi-node Sagemaker Launch the RAPIDS container as a Sagemaker notebook. single-node multi-node # index.html.md Brev.dev Deploy and run RAPIDS on NVIDIA Brev single-node # index.html.md # GitHub Actions GitHub Actions is a popular way to automatically run tests against code hosted on GitHub. GitHub’s free tier includes basic runners (the machines that will run your code) and the paid tier includes support for [hosted runners with NVIDIA GPUs](https://github.blog/changelog/2024-07-08-github-actions-gpu-hosted-runners-are-now-generally-available/). This allows GPU specific code to be exercised as part of a CI workflow. ## Cost As GPU runners are not included in the free tier projects will have to pay for GPU CI resources. Typically GPU runners cost a few cents per minute, check out the [GitHub documentation](https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#per-minute-rates-for-gpu-powered-larger-runners) for more information. We recommend that projects set a [spending limit](https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#about-spending-limits) on their account/organization. That way your monthly bill will never be a surprise. We also recommend that you only run GPU CI intentionally rather than on every pull request from every contributor. Check out the best practices section for more information. ## Getting started ### Setting up your GPU runners First you’ll need to set up a way to pay GitHub. You can do this by [adding a payment method](https://docs.github.com/en/billing/managing-your-github-billing-settings/adding-or-editing-a-payment-method) to your organisation. While you’re in your billing settings you should also decide what the maximum is that you wish to spend on GPU CI functionality and then set a [spending limit](https://docs.github.com/en/billing/managing-billing-for-github-actions/managing-your-spending-limit-for-github-actions) on your account. Next you can go into the GitHub Actions settings for your account and configure a [larger runner](https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/about-larger-runners). You can find this settings page by visiting `https://github.com/organizations//settings/actions/runners`. ![Screenshot of the GitHub Actions runner configuration page with the new runner button highlighted](_static/images/developer/ci/github-actions/new-hosted-runner.png) Next you need to give your runner a name, for example `linux-nvidia-gpu`, you’ll need to remember this for configuring your workflows later. Then you need to choose your runner settings: - Under “Platform” select “Linux x64” - Under “Image” switch to the “Partner” tab and choose “NVIDIA GPU-Optimized Image for AI and HPC” - Under “Size” switch to the “GPU-powered” tab and select your preferred NVIDIA hardware ![Screenshot of the GitHub Actions runner configuration page a new GPU runner configured](_static/images/developer/ci/github-actions/new-runner-config.png) Then set your preferred maximum concurrency and then choose “Create runner”. ### Configuring your workflows To configure your workflow to use your new GPU runners you need to set the `runs-on` property to match the name you gave the runner group. ```yaml name: GitHub Actions GPU Demo run-name: ${{ github.actor }} is testing out GPU GitHub Actions 🚀 on: [push] jobs: gpu-workflow: runs-on: linux-nvidia-gpu steps: - name: Check GPU is available run: nvidia-smi ``` ## Best practices Adding GitHub Actions runners to your project that cost money requires you to put some extra thought into when you want those workflows to run. Setting a spending cap allows you to keep control of how much you are spending, but you still want to get the most for your money. Here are some tips on when to effectively use GPU runners in your projects. ### Use labels to trigger workloads Instead of always triggering your GPU workflows on every push or pull request you can use labels to trigger workflows. This is a great option if your project is public and anyone can make a pull request with any arbitrary code. You may want to have a mechanism for a trusted maintainer or collaborator to trigger the GPU workflow manually. The scikit-learn project solved this by having a label that triggers the workflow. ```yaml name: NVIDIA GPU workflow on: pull_request: types: - labeled jobs: tests: if: contains(github.event.pull_request.labels.*.name, 'GPU CI') runs-on: group: linux-nvidia-gpu steps: ... ``` The above config specifies a workflow should only be run when the `GPU CI` label is added to the pull request. They then have a second [label remover workflow](https://github.com/scikit-learn/scikit-learn/blob/9d39f57399d6f1f7d8e8d4351dbc3e9244b98d28/.github/workflows/cuda-label-remover.yml) which removed the label again, which allows a maintainer to add it again in the future to trigger the GPU CI workflow any number of times during the review of the pull request. ### Run nightly Some projects might not need to run GPU tests for every pull request, but instead might prefer to run a nightly regression test to ensure that nothing that has been merged has broken GPU functionality. You can configure a GitHub Actions workflow to run on a schedule and use [an action](https://github.com/marketplace/actions/failed-build-issue) to open an issue if the workflow fails. ```yaml name: Nightly GPU Tests on: schedule: - cron: "0 0 * * *" # Run every day at 00:00 UTC jobs: tests: name: GPU Tests runs-on: linux-nvidia-gpu steps: - uses: actions/checkout@v4 - name: Run tests run: | # Run tests here - name: Notify failed build uses: jayqi/failed-build-issue-action@v1 if: failure() && github.event.pull_request == null with: github-token: ${{ secrets.GITHUB_TOKEN }} ``` ### Run only on certain codepaths You may also want to only run your GPU CI tests when code at certain paths has been modified. To do this you can use the [`on.push.paths` filter](https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#example-including-paths). ```yaml name: GPU Tests on: push: paths: - "src/gpu_submodule/**/*.py" jobs: tests: name: GPU Tests runs-on: linux-nvidia-gpu steps: ... ``` ## Further Reading - [Blog from scikit-learn Developers on their experiences](https://betatim.github.io/posts/github-action-with-gpu/) # index.html.md # Continuous Integration GitHub Actions Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs. single-node # index.html.md # How to Setup InfiniBand on Azure [Azure GPU optimized virtual machines](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) provide a low latency and high bandwidth InfiniBand network. This guide walks through the steps to enable InfiniBand to optimize network performance. ## Build a Virtual Machine Start by creating a GPU optimized VM from the Azure portal. Below is an example that we will use for demonstration. - Create new VM instance. - Select `East US` region. - Change `Availability options` to `Availability set` and create a set. - If building multiple instances put additional instances in the same set. - Use the 2nd Gen Ubuntu 24.04 image. - Search all images for `Ubuntu Server 24.04` and choose the second one down on the list. - Change size to `ND40rs_v2`. - Set password login with credentials. - User `someuser` - Password `somepassword` - Leave all other options as default. Then connect to the VM using your preferred method. ## Install Software Before installing the drivers ensure the system is up to date. ```shell sudo apt-get update sudo apt-get upgrade -y ``` ### NVIDIA Drivers The commands below should work for Ubuntu. See the [CUDA Toolkit documentation](https://docs.nvidia.com/cuda/index.html#installation-guides) for details on installing on other operating systems. ```shell sudo apt-get install -y linux-headers-$(uname -r) distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-drivers ``` Restart VM instance ```shell sudo reboot ``` Once the VM boots, reconnect and run `nvidia-smi` to verify driver installation. ```shell nvidia-smi ``` ```shell Mon Nov 14 20:32:39 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000001:00:00.0 Off | 0 | | N/A 34C P0 41W / 300W | 445MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000002:00:00.0 Off | 0 | | N/A 37C P0 43W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000003:00:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000004:00:00.0 Off | 0 | | N/A 35C P0 44W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000005:00:00.0 Off | 0 | | N/A 35C P0 41W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000006:00:00.0 Off | 0 | | N/A 36C P0 43W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000007:00:00.0 Off | 0 | | N/A 37C P0 44W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000008:00:00.0 Off | 0 | | N/A 38C P0 44W / 300W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1396 G /usr/lib/xorg/Xorg 427MiB | | 0 N/A N/A 1762 G /usr/bin/gnome-shell 16MiB | | 1 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 4 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 5 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 6 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | | 7 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ ``` ### InfiniBand Driver On Ubuntu 24.04 ```shell sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \ libgfortran5 bison chrpath flex graphviz gfortran tk quilt swig tcl ibverbs-utils ``` Check install ```shell ibv_devinfo ``` ```shell hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 16.28.4000 node_guid: 0015:5dff:fe33:ff2c sys_image_guid: 0c42:a103:00b3:2f68 vendor_id: 0x02c9 vendor_part_id: 4120 hw_ver: 0x0 board_id: MT_0000000010 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 7 port_lid: 115 port_lmc: 0x00 link_layer: InfiniBand hca_id: rdmaP36305p0s2 transport: InfiniBand (0) fw_ver: 2.43.7008 node_guid: 6045:bdff:feed:8445 sys_image_guid: 043f:7203:0003:d583 vendor_id: 0x02c9 vendor_part_id: 4100 hw_ver: 0x0 board_id: MT_1090111019 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet ``` #### Enable IPoIB ```shell sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf ``` Reboot and reconnect. ```shell sudo reboot ``` #### Check IB ```shell ip addr show ``` ```shell 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff inet 10.6.0.5/24 brd 10.6.0.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::6245:bdff:fea7:42cc/64 scope link valid_lft forever preferred_lft forever 3: eth1: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:15:5d:33:ff:16 brd ff:ff:ff:ff:ff:ff 4: enP44906s1: mtu 1500 qdisc mq master eth0 state UP group default qlen 1000 link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff altname enP44906p0s2 5: ibP59423s2: mtu 4092 qdisc noop state DOWN group default qlen 256 link/infiniband 00:00:09:27:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:16 brd 00:ff:ff:ff:ff:12:40:1b:80:1d:00:00:00:00:00:00:ff:ff:ff:ff altname ibP59423p0s2 ``` ```shell nvidia-smi topo -m ``` ```shell GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity NUMA Affinity GPU0 X NV2 NV1 NV2 NODE NODE NV1 NODE NODE 0-19 0 GPU1 NV2 X NV2 NV1 NODE NODE NODE NV1 NODE 0-19 0 GPU2 NV1 NV2 X NV1 NV2 NODE NODE NODE NODE 0-19 0 GPU3 NV2 NV1 NV1 X NODE NV2 NODE NODE NODE 0-19 0 GPU4 NODE NODE NV2 NODE X NV1 NV1 NV2 NODE 0-19 0 GPU5 NODE NODE NODE NV2 NV1 X NV2 NV1 NODE 0-19 0 GPU6 NV1 NODE NODE NODE NV1 NV2 X NV2 NODE 0-19 0 GPU7 NODE NV1 NODE NODE NV2 NV1 NV2 X NODE 0-19 0 mlx5_0 NODE NODE NODE NODE NODE NODE NODE NODE X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ``` ### Install UCXX and tools ```shell wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh bash Mambaforge-Linux-x86_64.sh ``` Accept the default and allow conda init to run. ```shell ~/mambaforge/bin/conda init ``` Then start a new shell. Create a conda environment (see [UCXX](https://docs.rapids.ai/api/ucxx/nightly/install/) docs) ```shell mamba create -n ucxx -c rapidsai-nightly -c conda-forge -c nvidia rapids=25.12 python=3.13 'cuda-version>=12.0,<=12.9' ipython dask distributed distributed-ucxx numpy cupy pytest pynvml -y mamba activate ucxx ``` Clone UCXX repo locally ```shell git clone https://github.com/rapidsai/ucxx.git cd ucxx ``` ### Run Tests Start by running the UCXX test suite, from within the `ucxx` repo: ```shell pytest -vs python/ucxx/ucxx/_lib/tests pytest -vs python/ucxx/ucxx/_lib_async/tests ``` Now check to see if InfiniBand works, for that you can run some of the benchmarks that we include in UCXX, for example: ```shell # cd out of the ucxx directory cd .. # Let UCX pick the best transport (expecting NVLink when available, # otherwise InfiniBand, or TCP in worst case) on devices 0 and 1 python -m ucxx.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB # Force TCP-only on devices 0 and 1 UCX_TLS=tcp,cuda_copy python -m ucxx.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB ``` We expect the first case above to have much higher bandwidth than the second. If you happen to have both NVLink and InfiniBand connectivity, then you may limit to the specific transport by specifying `UCX_TLS`, e.g.: ```shell # NVLink (if available) or TCP UCX_TLS=tcp,cuda_copy,cuda_ipc # InfiniBand (if available) or TCP UCX_TLS=tcp,cuda_copy,rc ``` ## Run Benchmarks Finally, let’s run the [merge benchmark](https://github.com/rapidsai/dask-cuda/blob/HEAD/dask_cuda/benchmarks/local_cudf_merge.py) from `dask-cuda`. This benchmark uses Dask to perform a merge of two dataframes that are distributed across all the available GPUs on your VM. Merges are a challenging benchmark in a distributed setting since they require communication-intensive shuffle operations of the participating dataframes (see the [Dask documentation](https://docs.dask.org/en/stable/dataframe-best-practices.html#avoid-full-data-shuffling) for more on this type of operation). To perform the merge, each dataframe is shuffled such that rows with the same join key appear on the same GPU. This results in an [all-to-all](https://en.wikipedia.org/wiki/All-to-all_(parallel_pattern)) communication pattern which requires a lot of communication between the GPUs. As a result, network performance will be very important for the throughput of the benchmark. Below we are running for devices 0 through 7 (inclusive), you will want to adjust that for the number of devices available on your VM, the default is to run on GPU 0 only. Additionally, `--chunk-size 100_000_000` is a safe value for 32GB GPUs, you may adjust that proportional to the size of the GPU you have (it scales linearly, so `50_000_000` should be good for 16GB or `150_000_000` for 48GB). ```shell # Default Dask TCP communication protocol python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --no-show-p2p-bandwidth ``` ```shell Merge benchmark -------------------------------------------------------------------------------- Backend | dask Merge type | gpu Rows-per-chunk | 100000000 Base-chunks | 8 Other-chunks | 8 Broadcast | default Protocol | tcp Device(s) | 0,1,2,3,4,5,6,7 RMM Pool | True Frac-match | 0.3 Worker thread(s) | 1 Data processed | 23.84 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 48.51 s | 503.25 MiB/s 47.85 s | 510.23 MiB/s 41.20 s | 592.57 MiB/s ================================================================================ Throughput | 532.43 MiB/s +/- 22.13 MiB/s Bandwidth | 44.76 MiB/s +/- 0.93 MiB/s Wall clock | 45.85 s +/- 3.30 s ``` ```shell # UCX protocol python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx --no-show-p2p-bandwidth ``` ```shell Merge benchmark -------------------------------------------------------------------------------- Backend | dask Merge type | gpu Rows-per-chunk | 100000000 Base-chunks | 8 Other-chunks | 8 Broadcast | default Protocol | ucx Device(s) | 0,1,2,3,4,5,6,7 RMM Pool | True Frac-match | 0.3 TCP | None InfiniBand | None NVLink | None Worker thread(s) | 1 Data processed | 23.84 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 9.57 s | 2.49 GiB/s 6.01 s | 3.96 GiB/s 9.80 s | 2.43 GiB/s ================================================================================ Throughput | 2.82 GiB/s +/- 341.13 MiB/s Bandwidth | 159.89 MiB/s +/- 8.96 MiB/s Wall clock | 8.46 s +/- 1.73 s ``` # index.html.md # Dask Operator Many libraries in RAPIDS can leverage Dask to scale out computation onto multiple GPUs and multiple nodes. [Dask has an operator for Kubernetes](https://kubernetes.dask.org/en/latest/) which allows you to launch Dask clusters as native Kubernetes resources. With the operator and associated Custom Resource Definitions (CRDs) you can create `DaskCluster`, `DaskWorkerGroup` and `DaskJob` resources that describe your Dask components and the operator will create the appropriate Kubernetes resources like `Pods` and `Services` to launch the cluster. ## Installation Your Kubernetes cluster must have GPU nodes and have [up to date NVIDIA drivers installed](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html). To install the Dask operator follow the [instructions in the Dask documentation](https://kubernetes.dask.org/en/latest/installing.html). ## Configuring a RAPIDS `DaskCluster` To configure the `DaskCluster` resource to run RAPIDS you need to set a few things: - The container image must contain RAPIDS, the [official RAPIDS container images](https://docs.rapids.ai/install/#docker) are a good choice for this. - The Dask workers must be configured with one or more NVIDIA GPU resources. - The worker command must be set to `dask-cuda-worker`. ## Example using `kubectl` Here is an example resource manifest for launching a RAPIDS Dask cluster. ```yaml # rapids-dask-cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: rapids-dask-cluster labels: dask.org/cluster-name: rapids-dask-cluster spec: worker: replicas: 2 spec: containers: - name: worker image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" args: - dask-cuda-worker - --name - $(DASK_WORKER_NAME) resources: limits: nvidia.com/gpu: "1" scheduler: spec: containers: - name: scheduler image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" env: args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 15 periodSeconds: 20 service: type: ClusterIP selector: dask.org/cluster-name: rapids-dask-cluster dask.org/component: scheduler ports: - name: tcp-comm protocol: TCP port: 8786 targetPort: "tcp-comm" - name: http-dashboard protocol: TCP port: 8787 targetPort: "http-dashboard" ``` You can create this cluster with `kubectl`. ```bash $ kubectl apply -f rapids-dask-cluster.yaml ``` ### Manifest breakdown Let’s break this manifest down section by section. #### Metadata At the top we see the `DaskCluster` resource type and general metadata. ```yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: rapids-dask-cluster labels: dask.org/cluster-name: rapids-dask-cluster spec: worker: # ... scheduler: # ... ``` Then inside the `spec` we have `worker` and `scheduler` sections. #### Worker The worker contains a `replicas` option to set how many workers you need and a `spec` that describes what each worker Pod should look like. The spec is a nested [`Pod` spec](https://kubernetes.io/docs/concepts/workloads/pods/) that the operator will use when creating new `Pod` resources. ```yaml # ... spec: worker: replicas: 2 spec: containers: - name: worker image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" args: - dask-cuda-worker - --name - $(DASK_WORKER_NAME) resources: limits: nvidia.com/gpu: "1" scheduler: # ... ``` Inside our Pod spec we are configuring one container that uses the `rapidsai/base` container image. It also sets the `args` to start the `dask-cuda-worker` and configures one NVIDIA GPU. #### Scheduler Next we have a `scheduler` section that also contains a `spec` for the scheduler Pod and a `service` which will be used by the operator to create a `Service` resource to expose the scheduler. ```yaml # ... spec: worker: # ... scheduler: spec: containers: - name: scheduler image: "rapidsai/base:25.12a-cuda12-py3.13" imagePullPolicy: "IfNotPresent" args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 15 periodSeconds: 20 service: # ... ``` For the scheduler Pod we are also setting the `rapidsai/base` container image, mainly to ensure our Dask versions match between the scheduler and workers. We ensure that the `dask-scheduler` command is configured. Then we configure both the Dask communication port on `8786` and the Dask dashboard on `8787` and add some probes so that Kubernetes can monitor the health of the scheduler. #### NOTE The ports must have the `tcp-` and `http-` prefixes if your Kubernetes cluster uses [Istio](https://istio.io/) to ensure the [Envoy proxy](https://www.envoyproxy.io/) doesn’t mangle the traffic. Then we configure the `Service`. ```yaml # ... spec: worker: # ... scheduler: spec: # ... service: type: ClusterIP selector: dask.org/cluster-name: rapids-dask-cluster dask.org/component: scheduler ports: - name: tcp-comm protocol: TCP port: 8786 targetPort: "tcp-comm" - name: http-dashboard protocol: TCP port: 8787 targetPort: "http-dashboard" ``` This example shows using a `ClusterIP` service which will not expose the Dask cluster outside of Kubernetes. If you prefer you could set this to [`LoadBalancer`](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) or [`NodePort`](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport) to make this externally accessible. It has a `selector` that matches the scheduler Pod and the same ports configured. ### Accessing your Dask cluster Once you have created your `DaskCluster` resource we can use `kubectl` to check the status of all the other resources it created for us. ```console $ kubectl get all -l dask.org/cluster-name=rapids-dask-cluster NAME READY STATUS RESTARTS AGE pod/rapids-dask-cluster-default-worker-group-worker-0c202b85fd 1/1 Running 0 4m13s pod/rapids-dask-cluster-default-worker-group-worker-ff5d376714 1/1 Running 0 4m13s pod/rapids-dask-cluster-scheduler 1/1 Running 0 4m14s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rapids-dask-cluster-service ClusterIP 10.96.223.217 8786/TCP,8787/TCP 4m13s ``` Here you can see our scheduler Pod and two worker Pods along with the scheduler service. If you have a Python session running within the Kubernetes cluster (like the [example one on the Kubernetes page](../../platforms/kubernetes.md)) you should be able to connect a Dask distributed client directly. ```python from dask.distributed import Client client = Client("rapids-dask-cluster-scheduler:8786") ``` Alternatively if you are outside of the Kubernetes cluster you can change the `Service` to use [`LoadBalancer`](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) or [`NodePort`](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport) or use `kubectl` to port forward the connection locally. ```console $ kubectl port-forward svc/rapids-dask-cluster-service 8786:8786 Forwarding from 127.0.0.1:8786 -> 8786 ``` ```python from dask.distributed import Client client = Client("localhost:8786") ``` ## Example using `KubeCluster` In addition to creating clusters via `kubectl` you can also do so from Python with [`dask_kubernetes.operator.KubeCluster`](https://kubernetes.dask.org/en/latest/operator_kubecluster.html#dask_kubernetes.operator.KubeCluster). This class implements the Dask Cluster Manager interface and under the hood creates and manages the `DaskCluster` resource for you. ```python from dask_kubernetes.operator import KubeCluster cluster = KubeCluster( name="rapids-dask", image="rapidsai/base:25.12a-cuda12-py3.13", n_workers=3, resources={"limits": {"nvidia.com/gpu": "1"}}, worker_command="dask-cuda-worker", ) ``` If we check with `kubectl` we can see the above Python generated the same `DaskCluster` resource as the `kubectl` example above. ```console $ kubectl get daskclusters NAME AGE rapids-dask-cluster 3m28s $ kubectl get all -l dask.org/cluster-name=rapids-dask-cluster NAME READY STATUS RESTARTS AGE pod/rapids-dask-cluster-default-worker-group-worker-07d674589a 1/1 Running 0 3m30s pod/rapids-dask-cluster-default-worker-group-worker-a55ed88265 1/1 Running 0 3m30s pod/rapids-dask-cluster-default-worker-group-worker-df785ab050 1/1 Running 0 3m30s pod/rapids-dask-cluster-scheduler 1/1 Running 0 3m30s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/rapids-dask-cluster-service ClusterIP 10.96.200.202 8786/TCP,8787/TCP 3m30s ``` With this cluster object in Python we can also connect a client to it directly without needing to know the address as Dask will discover that for us. It also automatically sets up port forwarding if you are outside of the Kubernetes cluster. ```python from dask.distributed import Client client = Client(cluster) ``` This object can also be used to scale the workers up and down. ```python cluster.scale(5) ``` And to manually close the cluster. ```python cluster.close() ``` #### NOTE By default the `KubeCluster` command registers an exit hook so when the Python process exits the cluster is deleted automatically. You can disable this by setting `KubeCluster(..., shutdown_on_close=False)` when launching the cluster. This is useful if you have a multi-stage pipeline made up of multiple Python processes and you want your Dask cluster to persist between them. You can also connect a `KubeCluster` object to your existing cluster with `cluster = KubeCluster.from_name(name="rapids-dask")` if you wish to use the cluster or manually call `cluster.close()` in the future. # index.html.md # Dask Helm Chart Dask has a [Helm Chart](https://github.com/dask/helm-chart) that creates the following resources: - 1 x Jupyter server (preconfigured to access the Dask cluster) - 1 x Dask scheduler - 3 x Dask workers that connect to the scheduler (scalable) This helm chart can be configured to run RAPIDS by providing GPUs to the Jupyter server and Dask workers and by using container images with the RAPIDS libraries available. ## Configuring RAPIDS Built on top of the Dask Helm Chart, `rapids-config.yaml` file contains additional configurations required to setup RAPIDS environment. ```yaml # rapids-config.yaml scheduler: image: repository: "rapidsai/base" tag: "25.12a-cuda12-py3.13" worker: image: repository: "rapidsai/base" tag: "25.12a-cuda12-py3.13" dask_worker: "dask_cuda_worker" replicas: 3 resources: limits: nvidia.com/gpu: 1 jupyter: image: repository: "rapidsai/notebooks" tag: "25.12a-cuda12-py3.13" servicePort: 8888 # Default password hash for "rapids" password: "argon2:$argon2id$v=19$m=10240,t=10,p=8$TBbhubLuX7efZGRKQqIWtw$RG+jCBB2KYF2VQzxkhMNvHNyJU9MzNGTm2Eu2/f7Qpc" resources: limits: nvidia.com/gpu: 1 ``` `[jupyter|scheduler|worker].image.*` is updated with the RAPIDS “runtime” image from the stable release, which includes environment necessary to launch run accelerated libraries in RAPIDS, and scaling up and down via dask. Note that all scheduler, worker and jupyter Pods are required to use the same image. This ensures that dask scheduler and worker versions match. `[jupyter|worker].resources` explicitly requests a GPU for each worker Pod and the Jupyter Pod, required by many accelerated libraries in RAPIDS. `worker.dask_worker` is the launch command for dask worker inside worker Pod. To leverage the GPUs assigned to each Pod the [`dask_cuda_worker`](https://docs.rapids.ai/api/dask-cuda/nightly/index.html) command is launched in place of the regular `dask_worker`. If desired to have a different jupyter notebook password than default, compute the hash for `` and update `jupyter.password`. You can compute password hash by following the [jupyter notebook guide](https://jupyter-notebook.readthedocs.io/en/stable/public_server.html?highlight=passwd#preparing-a-hashed-password). ### Installing the Helm Chart ```bash $ helm install rapids-release --repo https://helm.dask.org dask -f rapids-config.yaml ``` This will deploy the cluster with the same topography as dask helm chart, see [dask helm chart documentation for detail](https://artifacthub.io/packages/helm/dask/dask). #### NOTE By default, the Dask Helm Chart will not create an `Ingress` resource. A custom `Ingress` may be configured to consume external traffic and redirect to corresponding services. For simplicity, this guide will setup access to the Jupyter server via port forwarding. ## Running Rapids Notebook First, setup port forwarding from the cluster to external port: ```bash # For the Jupyter server $ kubectl port-forward --address 127.0.0.1 service/rapids-release-dask-jupyter 8888:8888 ``` ```bash # For the Dask dashboard $ kubectl port-forward --address 127.0.0.1 service/rapids-release-dask-scheduler 8787:8787 ``` Open a browser and visit `localhost:8888` to access Jupyter, and `localhost:8787` for the dask dashboard. Enter the password (default is `rapids`) and access the notebook environment. ### Notebooks and Cluster Scaling Now we can verify that everything is working correctly by running some of the example notebooks. Open the `10 Minutes to cuDF and Dask-cuDF` notebook under `cudf/10-min.ipynb`. Add a new cell at the top to connect to the Dask cluster. Conveniently, the helm chart preconfigures the scheduler address in client’s environment. So you do not need to pass any config to the `Client` object. ```python from dask.distributed import Client client = Client() client ``` By default, we can see 3 workers are created and each has 1 GPU assigned. ![dask worker](_static/daskworker.PNG) Walk through the examples to validate that the dask cluster is setup correctly, and that GPUs are accessible for the workers. Worker metrics can be examined in dask dashboard. ![dask worker](_static/workingdask.PNG) In case you want to scale up the cluster with more GPU workers, you may do so via `kubectl` or via `helm upgrade`. ```bash $ kubectl scale deployment rapids-release-dask-worker --replicas=8 ``` or ```bash $ helm upgrade --set worker.replicas=8 rapids-release dask/dask ``` ![dask worker](_static/eightworkers.PNG) # index.html.md # Measuring Performance with the One Billion Row Challenge *January, 2024* The [One Billion Row Challenge](https://www.morling.dev/blog/one-billion-row-challenge/) is a programming competition aimed at Java developers to write the most efficient code to process a one billion line text file and calculate some metrics. The challenge has inspired solutions in many languages beyond Java including [Python](https://github.com/gunnarmorling/1brc/discussions/62). In this notebook we will explore how we can use RAPIDS to build an efficient solution in Python and how we can use dashboards to understand how performant our code is. ## The Problem The input data of the challenge is a ~13GB text file containing one billion lines of temperature measurements. The file is structured with one measurement per line with the name of the weather station and the measurement separated by a semicolon. ```text Hamburg;12.0 Bulawayo;8.9 Palembang;38.8 St. John's;15.2 Cracow;12.6 ... ``` Our goal is to calculate the min, mean, and max temperature per weather station sorted alphabetically by station name as quickly as possible. ## Reference Implementation A reference implementation written with popular PyData tools would likely be something along the lines of the following Pandas code (assuming you have enough RAM to fit the data into memory). ```python import pandas as pd df = pd.read_csv( "measurements.txt", sep=";", header=None, names=["station", "measure"], engine='pyarrow' ) df = df.groupby("station").agg(["min", "max", "mean"]) df.columns = df.columns.droplevel() df = df.sort_values("station") ``` Here we use `pandas.read_csv()` to open the text file and specify the `;` separator and also set some column names. We also set the engine to `pyarrow` to give us some extra performance out of the box. Then we group the measurements by their station name and calculate the min, max and mean. Finally we sort the grouped dataframe by the station name. Running this on a workstation with a 12-core CPU completes the task in around **4 minutes**. ## Deploying RAPIDS To run this notebook we will need a machine with one or more GPUs. There are many ways you can get this: - Have a laptop, desktop or workstation with GPUs. - Run a VM on the cloud using [AWS EC2](../../cloud/aws/ec2.md), [Google Compute Engine](../../cloud/gcp/compute-engine.md), [Azure VMs](../../cloud/azure/azure-vm.md), etc. - Use a managed notebook service like [SageMaker](../../cloud/aws/sagemaker.md), [Vertex AI](../../cloud/gcp/vertex-ai.md), [Azure ML](../../cloud/azure/azureml.md) or [Databricks](../../platforms/databricks.md). - Run a container in a [Kubernetes cluster with GPUs](../../platforms/kubernetes.md). Once you have a GPU machine you will need to [install RAPIDS](https://docs.rapids.ai/install/). You can do this with [pip](https://docs.rapids.ai/install#pip), [conda](https://docs.rapids.ai/install#conda) or [docker](https://docs.rapids.ai/install#docker). We are also going to use Jupyter Lab with the RAPIDS [nvdashboard extension](https://github.com/rapidsai/jupyterlab-nvdashboard) and the [Dask Lab Extension](https://github.com/dask/dask-labextension) so that we can understand what our machine is doing. If you are using the Docker container these will already be installed for you, otherwise you will need to install them yourself. ### Dashboards Once you have Jupyter up and running with the extensions installed and this notebook downloaded you can open some performance dashboards so we can monitor our hardware as our code runs. Let’s start with nvdashboard which has the GPU icon in the left toolbar. ![](_static/images/examples/rapids-1brc-single-node/nvdashboard-sidebar.png) Start by opening the “Machine Resources” table, “GPU Utilization” graph and “GPU Memory” graph and moving them over to the right hand side. ![](_static/images/examples/rapids-1brc-single-node/nvdashboard-resources.png) ## Data Generation Before we get started with our problem we need to generate the input data. The 1BRC repo has a [Java implementation](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CreateMeasurements.java) which takes around 15 minutes to generate the file. If you were to run the Java implementation you would see the CPU get busy but disk bandwidth remain low, suggesting this is a compute bound problem. We can accelerate this on the GPU using cuDF and CuPy. Download the [`lookup.csv`](./lookup.csv) table of stations and their mean temperatures as we will use this to generate our data file containing `n` rows of random temperatures. To generate each row we choose a random station from the lookup table, then generate a random temperature measurement from a normal distribution around the mean temp. We assume the standard deviation is `10.0` for all stations. ```ipython3 import time from pathlib import Path import cudf import cupy as cp ``` ```ipython3 def generate_chunk(filename, chunksize, std, lookup_df): """Generate some sample data based on the lookup table.""" df = cudf.DataFrame( { # Choose a random station from the lookup table for each row in our output "station": cp.random.randint(0, len(lookup_df) - 1, int(chunksize)), # Generate a normal distribution around zero for each row in our output # Because the std is the same for every station we can adjust the mean for each row afterwards "measure": cp.random.normal(0, std, int(chunksize)), } ) # Offset each measurement by the station's mean value df.measure += df.station.map(lookup_df.mean_temp) # Round the temperature to one decimal place df.measure = df.measure.round(decimals=1) # Convert the station index to the station name df.station = df.station.map(lookup_df.station) # Append this chunk to the output file with open(filename, "a") as fh: df.to_csv(fh, sep=";", chunksize=10_000_000, header=False, index=False) ``` ### Configuration ```ipython3 n = 1_000_000_000 # Number of rows of data to generate lookup_df = cudf.read_csv( "lookup.csv" ) # Load our lookup table of stations and their mean temperatures std = 10.0 # We assume temperatures are normally distributed with a standard deviation of 10 chunksize = 2e8 # Set the number of rows to generate in one go (reduce this if you run into GPU RAM limits) filename = Path("measurements.txt") # Choose where to write to filename.unlink() if filename.exists() else None # Delete the file if it exists already ``` ### Run the data generation ```ipython3 %%time # Loop over chunks and generate data start = time.time() for i in range(int(n / chunksize)): # Generate a chunk generate_chunk(filename, chunksize, std, lookup_df) # Update the progress bar percent_complete = int(((i + 1) * chunksize) / n * 100) time_taken = int(time.time() - start) time_remaining = int((time_taken / percent_complete) * 100) - time_taken print( ( f"Writing {int(n / 1e9)} billion rows to {filename}: {percent_complete}% " f"in {time_taken}s ({time_remaining}s remaining)" ), end="\r", ) print() ``` ```myst-ansi Writing 1 billion rows to measurements.txt: 100% in 25s (0s remaining) CPU times: user 10.1 s, sys: 18 s, total: 28.2 s Wall time: 25.3 s ``` If you watch the graphs while this cell is running you should see a burts of GPU utilization when the GPU generates the random numbers followed by a burst of Disk IO when that data is written to disk. This pattern will happen for each chunk that is generated. #### NOTE We could improve performance even further here by generating the next chunk while the current chunk is writing to disk, but a 30x speedup seems optimal enough for now. ### Check the files Now we can verify our dataset is the size we expected and contains rows that follow the format needed by the challenge. ```ipython3 !ls -lh {filename} ``` ```myst-ansi -rw-r--r-- 1 rapids conda 13G Jan 22 16:54 measurements.txt ``` ```ipython3 !head {filename} ``` ```myst-ansi Guatemala City;17.3 Launceston;24.3 Bulawayo;8.7 Tbilisi;9.5 Napoli;26.8 Sarajevo;27.5 Chihuahua;29.2 Ho Chi Minh City;8.4 Johannesburg;19.2 Cape Town;16.3 ``` ## GPU Solution with RAPIDS Now let’s look at using RAPIDS to speed up our Pandas implementation of the challenge. If you directly convert the reference implementation from Pandas to cuDF you will run into some [limitations cuDF has with string columns](https://github.com/rapidsai/cudf/issues/13733). Also depending on your GPU you may run into memory limits as cuDF will read the whole dataset into memory and machines typically have less GPU memory than CPU memory. Therefore to solve this with RAPIDS we also need to use [Dask](https://dask.org) to partition the dataset and stream it through GPU memory, then cuDF can process each partition in a performant way. ### Deploying Dask We are going to use [dask-cuda](../../tools/dask-cuda.md) to start a GPU Dask cluster. ```ipython3 from dask.distributed import Client from dask_cuda import LocalCUDACluster client = Client(LocalCUDACluster()) ``` Creating a `LocalCUDACluster()` inspects the machine and starts one Dask worker for each detected GPU. We then pass that to a Dask client which means that all following code in the notebook will leverage the GPU workers. ### Dask Dashboard We can also make use of the [Dask Dashboard](https://docs.dask.org/en/latest/dashboard.html) to see what is going on. If you select the Dask logo from the left-hand toolbar and then click the search icon it should detect our `LocalCUDACluster` automatically and show us a long list of graphs to choose from. ![](_static/images/examples/rapids-1brc-single-node/dask-labextension-graphs.png) When working with GPUs the “GPU Utilization” and “GPU Memory” will show us the same as the nvdashboard plots but for all machines in our Dask cluster. This is very helpful when working on a multi-node cluster but doesn’t help us in thie single-node configuration. To see what Dask is doing in this challenge you should open the “Progress” and “Task Stream” graphs which will show all of the operations being performed. But feel free to open other graphs and explore all of the different metrics Dask can give you. ### Dask + cuDF Solution Now that we have our input data and a Dask cluster we can write some Dask code that leverages cuDF under the hood to perform the compute operations. First we need to import `dask.dataframe` and tell it to use the `cudf` backend. ```ipython3 import dask import dask.dataframe as dd dask.config.set({"dataframe.backend": "cudf"}) ``` Now we can run our Dask code, which is almost identical to the Pandas code we used before. ```ipython3 %%timeit -n 3 -r 4 df = dd.read_csv("measurements.txt", sep=";", header=None, names=["station", "measure"]) df = df.groupby("station").agg(["min", "max", "mean"]) df.columns = df.columns.droplevel() # We need to switch back to Pandas for the final sort at the time of writing due to rapidsai/cudf#14794 df = df.compute().to_pandas() df = df.sort_values("station") ``` ```myst-ansi 4.59 s ± 124 ms per loop (mean ± std. dev. of 4 runs, 3 loops each) ``` Running this notebook on a desktop workstation with two NVIDIA RTX 8000 GPUs completes the challenge in around **4 seconds** (a **60x speedup** over Pandas). Watching the progress bars you should see them fill and reset a total of 12 times as our `%%timeit` operation is solving the challenge multiple times to get an average speed. ![](_static/images/examples/rapids-1brc-single-node/dask-labextension-processing.png) In the above screenshot you can see that on a dual-GPU system Dask was leveraging both GPUs. But it’s also interesting to note that the GPU utilization never reaches 100%. This is because the SSD in the machine has now become the bottleneck. The GPUs are performing the calculations so efficiently that we can’t read data from disk fast enough to fully saturate them. # Conclusion RAPIDS can accelerate existing workflows written with libraries like Pandas with little to no code changes. GPUs can accelerate computations by orders of magnitude which can move performance bottlenecks to other parts of the system. Using dashboarding tools like nvdashboard and the Dask dashboard allow you to see and understand how your system is performing. Perhaps in this example upgrading the SSD is the next step to achieving even more performance. # index.html.md # HPO Benchmarking with RAPIDS and Dask *August, 2023* Hyper-Parameter Optimization (HPO) helps to find the best version of a model by exploring the space of possible configurations. While generally desirable, this search is computationally expensive and time-consuming. In the notebook demo below, we compare benchmarking results to show how GPU can accelerate HPO tuning jobs relative to CPU. For instance, we find a 48x speedup in wall clock time (0.71 hrs vs 34.6 hrs) for XGBoost and 16x (3.86 hrs vs 63.2 hrs) for RandomForest when comparing between `p3.8xlarge` Tesla V100 GPUs and `c5.24xlarge` CPU EC2 instances on 100 HPO trials of the 3-year Airline Dataset. **Preamble** You can set up local environment but it is recommended to launch a Virtual Machine service (Azure, AWS, GCP, etc). For the purposes of this notebook, we will be utilizing the [Amazon Machine Image (AMI)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-tensorflow-2-12-amazon-linux-2/) as the starting point. **Python ML Workflow** In order to work with RAPIDS container, the entrypoint logic should parse arguments, load, preprocess and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting. Let’s have a step-by-step look at each stage of the ML workflow: Dataset We leverage the `Airline` dataset, which is a large public tracker of US domestic flight logs which we offer in various sizes (1 year, 3 year, and 10 year) and in [Parquet](https://parquet.apache.org/) (compressed column storage) format. The machine learning objective with this dataset is to predict whether flights will be more than 15 minutes late arriving to their destination. We host the demo dataset in public S3 demo buckets in both the `us-east-1` or `us-west-2`. To optimize performance, we recommend that you access the s3 bucket in the same region as your EC2 instance to reduce network latency and data transfer costs. For this demo, we are using the **`3_year`** dataset, which includes the following features to mention a few: * Date and distance ( Year, Month, Distance ) * Airline / carrier ( Flight_Number_Reporting_Airline ) * Actual departure and arrival times ( DepTime and ArrTime ) * Difference between scheduled & actual times ( ArrDelay and DepDelay ) * Binary encoded version of late, aka our target variable ( ArrDelay15 ) Configure aws credentials for access to S3 storage ```default aws configure ``` Download dataset from S3 bucket to your current working dir ```default aws s3 cp --recursive s3://sagemaker-rapids-hpo-us-west-2/3_year/ ./data/ ``` Algorithm From a ML/algorithm perspective, we offer `XGBoost` and `RandomForest`. You are free to switch between these algorithm choices and everything in the example will continue to work. ```python parser = argparse.ArgumentParser() parser.add_argument( "--model-type", type=str, required=True, choices=["XGBoost", "RandomForest"] ) ``` We can also optionally increase robustness via reshuffles of the train-test split (i.e., cross-validation folds). Typical values are between 3 and 10 folds. We will use ```python n_cv_folds = 5 ``` Dask Cluster To maximize on efficiency, we launch a Dask `LocalCluster` for cpu or `LocalCUDACluster` that utilizes GPUs for distributed computing. Then connect a Dask Client to submit and manage computations on the cluster. We can then ingest the data, and “persist” it in memory using dask as follows: ```python if args.mode == "gpu": cluster = LocalCUDACluster() else: # mode == "cpu" cluster = LocalCluster(n_workers=os.cpu_count()) with Client(cluster) as client: dataset = ingest_data(mode=args.mode) client.persist(dataset) ``` Search Range One of the most important choices when running HPO is to choose the bounds of the hyperparameter search process. In this notebook, we leverage the power of `Optuna`, a widely used Python library for hyperparameter optimization. Here’s the quick steps on getting started with Optuna: 1. Define the Objective Function, which represents the model training and evaluation process. It takes hyperparameters as inputs and returns a metric to optimize (e.g, accuracy in our case,). Refer to `train_xgboost()` and `train_randomforest()` in `hpo.py` 1. Specify the search space using the `Trial` object’s methods to define the hyperparameters and their corresponding value ranges or distributions. For example: ```python "max_depth": trial.suggest_int("max_depth", 4, 8), "max_features": trial.suggest_float("max_features", 0.1, 1.0), "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.1, log=True), "min_samples_split": trial.suggest_int("min_samples_split", 2, 1000, log=True), ``` 1. Create an Optuna study object to keep track of trials and their corresponding hyperparameter configurations and evaluation metrics. ```python study = optuna.create_study( sampler=RandomSampler(seed=args.seed), direction="maximize" ) ``` 1. Select an optimization algorithm to determine how Optuna explores and exploits the search space to find optimal configurations. For instance, the `RandomSampler` is an algorithm provided by the Optuna library that samples hyperparameter configurations randomly from the search space. 2. Run the Optimization by calling the Optuna’s `optimize()` function on the study object. You can specify the number of trials or number of parallel jobs to run. ```python study.optimize(lambda trial: train_xgboost( trial, dataset=dataset, client=client, mode=args.mode ), n_trials=100, n_jobs=1, ) ``` **Run HPO** Let’s try this out! The example file `hpo.py` included here implements the patterns described above. First make sure you have the correct CUDAtoolkit version by running `nvidia-smi`. See the RAPIDS installation docs ([link](https://docs.rapids.ai/install/#system-req)) for details on the supported range of GPUs and drivers. ```ipython3 !nvidia-smi ``` Executing benchmark tests can be an arduous and time-consuming procedure that may extend over multiple days. By using a tool like [tmux](https://www.redhat.com/sysadmin/introduction-tmux-linux), you can maintain active terminal sessions, ensuring that your tasks continue running even if the SSH connection is interrupted. ```default tmux ``` Run the following to run hyper-parameter optimization in a Docker container. If you don’t yet have that image locally, the first time this runs it might take a few minutes to pull it. After that, startup should be very fast. Here’s what the arguments in that command below are doing: * `--gpus all` = make all GPUs on the system available to processes in the container * `--env EXTRA_CONDA_PACKAGES` = install `optuna` and `optuna-integration` conda packages - *the image already comes with all of the RAPIDS libraries and their dependencies installed* * `-p 8787:8787` = forward between port port 8787 on the host and 8787 on the container - *navigate to \`{public IP of box}:8787 to see the Dask dashboard!* * `-v / -w` = mount the current directory from the host machine into the container - *this allows processes in the container to read the data you downloaded to the `./data` directory earlier* - *it also means that any changes made to these files from inside the container will be reflected back on the host* Piping to a file called `xgboost_hpo_logs.txt` is helpful, as it preserves all the logs for later inspection. ```ipython3 !docker run \ --gpus all \ --env EXTRA_CONDA_PACKAGES="optuna optuna-integration" \ -p 8787:8787 \ -v $(pwd):/home/rapids/xgboost-hpo-example \ -w /home/rapids/xgboost-hpo-example \ -it rapidsai/base:25.12a-cuda12-py3.13 \ /bin/bash -c "python ./hpo.py --model-type 'XGBoost' --target 'gpu'" \ > ./xgboost_hpo_logs.txt 2>&1 ``` **Try Some Modifications** Now that you’ve run this example, try some modifications! For example: * use `--model-type "RandomForest"` to see how a random forest model compares to XGBoost * use `--target "cpu"` to estimate the speedup from GPU-accelerated training * modify the pipeline in `hpo.py` with other customizations # index.html.md # Scaling up Hyperparameter Optimization with Kubernetes and XGBoost GPU Algorithm *January, 2023* Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on a Kubernetes cluster. # Prerequisites Please follow instructions in [Dask Operator: Installation](../../tools/kubernetes/dask-operator.md) to install the Dask operator on top of a GPU-enabled Kubernetes cluster. (For the purpose of this example, you may ignore other sections of the linked document.) ## Optional: Kubeflow Kubeflow gives you a nice notebook environment to run this notebook within the k8s cluster. Install Kubeflow by following instructions in [Installing Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/). You may choose any method; we tested this example after installing Kubeflow from manifests. # Install system packages We’ll need extra Python packages. In particular, we need an unreleased version of Optuna: ```ipython3 !pip install dask_kubernetes optuna optuna-integration ``` # Set up Dask cluster Let us set up a Dask cluster using the `KubeCluster` class. Fill in the following variables, depending on the configuration of your Kubernetes cluster. Here how you can get `n_workers`, assuming that you are using all the nodes in the Kubernetes cluster. Let `N` be the number of nodes. * On AWS Elastic Kubernetes Service (EKS): `n_workers = N - 2` * On Google Cloud Kubernetes: `n_workers = N - 1` ```ipython3 # Choose the same RAPIDS image you used for launching the notebook session rapids_image = "rapidsai/base:25.12a-cuda12-py3.13" # Use the number of worker nodes in your Kubernetes cluster. n_workers = 4 ``` ```ipython3 from dask_kubernetes.operator import KubeCluster cluster = KubeCluster( name="rapids-dask", image=rapids_image, worker_command="dask-cuda-worker", n_workers=n_workers, resources={"limits": {"nvidia.com/gpu": "1"}}, env={"EXTRA_PIP_PACKAGES": "optuna"}, ) ``` ```myst-ansi Unclosed client session client_session: ``` ```ipython3 cluster ``` ```ipython3 from dask.distributed import Client client = Client(cluster) ``` # Perform hyperparameter optimization with a toy example Now we can run hyperparameter optimization. The workers will run multiple training jobs in parallel. ```ipython3 def objective(trial): x = trial.suggest_uniform("x", -10, 10) return (x - 2) ** 2 ``` ```ipython3 import optuna from dask.distributed import wait # Number of hyperparameter combinations to try in parallel n_trials = 100 # Optimize in parallel on your Dask cluster backend_storage = optuna.storages.InMemoryStorage() dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) study = optuna.create_study(direction="minimize", storage=dask_storage) futures = [] for i in range(0, n_trials, n_workers * 4): iter_range = (i, min([i + n_workers * 4, n_trials])) futures.append( { "range": iter_range, "futures": [ client.submit(study.optimize, objective, n_trials=1, pure=False) for _ in range(*iter_range) ], } ) for partition in futures: iter_range = partition["range"] print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}") _ = wait(partition["futures"]) ``` ```myst-ansi /tmp/ipykernel_75/1194069379.py:9: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future. dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) ``` ```myst-ansi Testing hyperparameter combinations 0..16 Testing hyperparameter combinations 16..32 Testing hyperparameter combinations 32..48 Testing hyperparameter combinations 48..64 Testing hyperparameter combinations 64..80 Testing hyperparameter combinations 80..96 Testing hyperparameter combinations 96..100 ``` ```ipython3 study.best_params ``` ```ipython3 study.best_value ``` # Perform hyperparameter optimization with XGBoost GPU algorithm Now let’s try optimizing hyperparameters for an XGBoost model. ```ipython3 import xgboost as xgb from optuna.samplers import RandomSampler from sklearn.datasets import load_breast_cancer from sklearn.model_selection import KFold, cross_val_score def objective(trial): X, y = load_breast_cancer(return_X_y=True) params = { "n_estimators": 10, "verbosity": 0, "tree_method": "gpu_hist", # L2 regularization weight. "lambda": trial.suggest_float("lambda", 1e-8, 100.0, log=True), # L1 regularization weight. "alpha": trial.suggest_float("alpha", 1e-8, 100.0, log=True), # sampling according to each tree. "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), "max_depth": trial.suggest_int("max_depth", 2, 10, step=1), # minimum child weight, larger the term more conservative the tree. "min_child_weight": trial.suggest_float( "min_child_weight", 1e-8, 100, log=True ), "learning_rate": trial.suggest_float("learning_rate", 1e-8, 1.0, log=True), # defines how selective algorithm is. "gamma": trial.suggest_float("gamma", 1e-8, 1.0, log=True), "grow_policy": "depthwise", "eval_metric": "logloss", } clf = xgb.XGBClassifier(**params) fold = KFold(n_splits=5, shuffle=True, random_state=0) score = cross_val_score(clf, X, y, cv=fold, scoring="neg_log_loss") return score.mean() ``` ```ipython3 # Number of hyperparameter combinations to try in parallel n_trials = 250 # Optimize in parallel on your Dask cluster backend_storage = optuna.storages.InMemoryStorage() dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) study = optuna.create_study( direction="maximize", sampler=RandomSampler(seed=0), storage=dask_storage ) futures = [] for i in range(0, n_trials, n_workers * 4): iter_range = (i, min([i + n_workers * 4, n_trials])) futures.append( { "range": iter_range, "futures": [ client.submit(study.optimize, objective, n_trials=1, pure=False) for _ in range(*iter_range) ], } ) for partition in futures: iter_range = partition["range"] print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}") _ = wait(partition["futures"]) ``` ```myst-ansi /tmp/ipykernel_75/1634478960.py:6: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future. dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) ``` ```myst-ansi Testing hyperparameter combinations 0..16 Testing hyperparameter combinations 16..32 Testing hyperparameter combinations 32..48 Testing hyperparameter combinations 48..64 Testing hyperparameter combinations 64..80 Testing hyperparameter combinations 80..96 Testing hyperparameter combinations 96..112 Testing hyperparameter combinations 112..128 Testing hyperparameter combinations 128..144 Testing hyperparameter combinations 144..160 Testing hyperparameter combinations 160..176 Testing hyperparameter combinations 176..192 Testing hyperparameter combinations 192..208 Testing hyperparameter combinations 208..224 Testing hyperparameter combinations 224..240 Testing hyperparameter combinations 240..250 ``` ```ipython3 study.best_params ``` ```ipython3 study.best_value ``` Let’s visualize the progress made by hyperparameter optimization. ```ipython3 from optuna.visualization.matplotlib import ( plot_optimization_history, plot_param_importances, ) ``` ```ipython3 plot_optimization_history(study) ``` ```myst-ansi /tmp/ipykernel_75/3324289224.py:1: ExperimentalWarning: plot_optimization_history is experimental (supported from v2.2.0). The interface can change in the future. plot_optimization_history(study) ``` ```ipython3 plot_param_importances(study) ``` ```myst-ansi /tmp/ipykernel_75/3836449081.py:1: ExperimentalWarning: plot_param_importances is experimental (supported from v2.2.0). The interface can change in the future. plot_param_importances(study) ``` # index.html.md # Getting Started with cuML’s accelerator mode (cuml.accel) in Snowflake Notebooks *July, 2025* cuML is a Python GPU library for accelerating machine learning models using a scikit-learn-like API. cuML now has an accelerator mode (cuml.accel) which allows you to bring accelerated computing to existing workflows with zero code changes required. In addition to scikit-learn, cuml.accel also provides acceleration to algorithms found in umap-learn (UMAP) and hdbscan (HDBSCAN). This notebook is a brief introduction to cuml.accel. # ⚠️ Verify your setup First, we’ll verify that we are running on an NVIDIA GPU: ```ipython3 !nvidia-smi # this should display information about available GPUs ``` With classical machine learning, there is a wide range of interesting problems we can explore. In this tutorial we’ll examine 3 of the more popular use cases: classification, clustering, and dimensionality reduction. # Classification Let’s load a dataset and see how we can use scikit-learn to classify that data. For this example we’ll use the Coverage Type dataset, which contains a number of features that can be used to predict forest cover type, such as elevation, aspect, slope, and soil-type. More information on this dataset can be found at https://archive.ics.uci.edu/dataset/31/covertype. ```ipython3 import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report from sklearn.model_selection import train_test_split ``` ```ipython3 url = ( "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz" ) # Column names for the dataset (from UCI Covertype description) columns = [ "Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points", "Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4", "Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4", "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9", "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14", "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19", "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24", "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29", "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34", "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39", "Soil_Type40", "Cover_Type", ] data = pd.read_csv(url, header=None) data.columns = columns ``` ```ipython3 data.shape ``` Next, we’ll separate out the classification variable (Cover_Type) from the rest of the data. This is what we will aim to predict with our classification model. We can also split our dataset into training and test data using the scikit-learn train_test_split function. ```ipython3 X, y = data.drop("Cover_Type", axis=1), data["Cover_Type"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ``` Now that we have our dataset split, we’re ready to run a model. To start, we will just run the model using the sklearn library with a starting max depth of 5 and all of the features. Note that we can set n_jobs=-1 to utilize all available CPU cores for fitting the trees – this will ensure we get the best performance possible on our system’s CPU. ```ipython3 import time ``` ```ipython3 # Start timing cpu start_time_cpu = time.time() clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1) clf.fit(X_train, y_train) # End timing end_time_cpu = time.time() ``` ```ipython3 # Report CPU duration print(f"CPU Training completed in {end_time_cpu - start_time_cpu:.2f} seconds") ``` In about 38 seconds, we were able to fit our tree model using scikit-learn. This is not bad! Let’s use the model we just trained to predict coverage types in our test dataset and take a look at the accuracy of our model. ```ipython3 y_pred = clf.predict(X_test) accuracy_score(y_test, y_pred) ``` We can also print out a full classification report to better understand how we predicted different Coverage_Type categories. ```ipython3 print(classification_report(y_test, y_pred)) ``` With scikit-learn, we built a model that was able to be trained in just less than a minute. From the accuracy report, we can see that we predicted the correct class around 70% of the time, which is not bad but could certainly be improved. Now let’s load cuml.accel and try running the same code again to see what kind of acceleration we can get. ```ipython3 import cuml.accel cuml.accel.install() ``` **IMPORTANT:** After installing cuml.accel, we need to import the scikit-learn estimators we wish to use again. ```ipython3 from sklearn.ensemble import RandomForestClassifier ``` ```ipython3 # Start timing gpu start_time_gpu = time.time() clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1) clf.fit(X_train, y_train) # End timing end_time_gpu = time.time() ``` ```ipython3 # Report GPU duration print(f"GPU Training completed in {end_time_gpu - start_time_gpu:.2f} seconds") ``` That was much faster! Using cuML we’re able to train this random forest model in just 3.5 seconds, that’s more than 10X speedup. One thing to note is that cuML’s implementation of `RandomForestClassifier` doesn’t utilize the `n_jobs` parameter like scikit-learn, but we still accept it which makes it easier to use this accelerator with zero code changes. Let’s take a look at the same accuracy score and classification report to compare the model’s performance. ```ipython3 y_pred = clf.predict(X_test) cr = classification_report(y_test, y_pred) print(cr) ``` Out of the box, the model performed about the same as the scikit-learn implementation. Because this model ran so much faster, we can quickly iterate on the hyperparameter configuration and find a model that performs better with excellent speedups. ```ipython3 # Start timing gpu max_depth 30 start_time_gpu_md30 = time.time() clf = RandomForestClassifier( n_estimators=100, max_depth=30, max_features=1.0, n_jobs=-1 ) clf.fit(X_train, y_train) # End timing end_time_gpu_md30 = time.time() # Report GPU duration print( f"GPU Training with max_depth=30 completed in {end_time_gpu_md30 - start_time_gpu_md30:.2f} seconds" ) ``` ```ipython3 y_pred = clf.predict(X_test) print(classification_report(y_test, y_pred)) ``` We just run a model in a few seconds, and got a better accuracy. With a model that runs in just seconds, we can perform hyperparameter optimization using a method like the grid search shown above, and have results in just minutes instead of hours. ## Resources For more information on getting started with `cuml.accel`, check out [RAPIDS.ai](https://rapids.ai/cuml-accel/) or the [cuML Docs](https://docs.rapids.ai/api/cuml/stable/). Find more examples of usage in this [cuml_sklearn_demo](https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/cuml_sklearn_colab_demo.ipynb) # index.html.md # Getting Started with Optuna and RAPIDS for HPO *March, 2023* Hyperparameter optimization (HPO) automates the process of picking values for the hyperparameters of a machine learning algorithm to improve model performance. This can help boost the model accuracy, but can be resource-intensive, as it may require training the model for hundreds of hyperparameter combinations. Let’s take a look at how we can use Optuna and RAPIDS to make HPO less time-consuming. ## RAPIDS The RAPIDS framework provides a suite of libraries to execute end-to-end data science pipelines entirely on GPUs. One of the libraries in this framework is cuML, which implements common machine learning models with a scikit-learn-compatible API and a GPU-accelerated backend. You can learn more about RAPIDS [here](https://rapids.ai/about.html). ## Optuna [Optuna](https://optuna.readthedocs.io/en/stable/) is a lightweight framework for automatic hyperparameter optimization. It provides a define-by-run API, which makes it easy to adapt to any already existing code that we have and enables high modularity along with the flexibility to construct hyperparameter spaces dynamically. By simply wrapping the objective function with Optuna, we can perform a parallel-distributed HPO search over a search space as we’ll see in this notebook. In this notebook, we’ll use BNP Paribas Cardif Claims Management dataset from Kaggle to predict if a claim will receive accelerated approval or not. We’ll explore how to use Optuna with RAPIDS in combination with Dask to run multi-GPU HPO experiments that can yield results faster than CPU. ```ipython3 ## Run this cell to install optuna #!pip install optuna optuna-integration ``` ```ipython3 import cudf import optuna from cuml import LogisticRegression from cuml.metrics import log_loss from cuml.model_selection import train_test_split from dask.distributed import Client, wait from dask_cuda import LocalCUDACluster ``` ## Set up CUDA Cluster We start a local cluster and keep it ready for running distributed tasks with dask. The dask scheduler can help leverage multiple nodes available on the cluster. [LocalCUDACluster](https://github.com/rapidsai/dask-cuda) launches one Dask worker for each GPU in the current systems. It’s developed as a part of the RAPIDS project. Learn More: - [Setting up Dask](https://docs.dask.org/en/latest/setup.html) - [Dask Client](https://distributed.dask.org/en/latest/client.html) ```ipython3 # This will use all GPUs on the local host by default cluster = LocalCUDACluster(threads_per_worker=1, ip="", dashboard_address="8081") c = Client(cluster) # Query the client for all connected workers workers = c.has_what().keys() n_workers = len(workers) c ``` ```myst-ansi [I 2024-08-06 09:41:38,254] A new study created in memory with name: dask_optuna_lr_log_loss_tpe ``` # Loading the Data ## Data Acquisition Dataset can be acquired from Kaggle: [BNP Paribas Cardif Claims Management](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data). To download the dataset: 1. Follow the instructions here to: [Set-up the Kaggle API](https://github.com/Kaggle/kaggle-api) 2. Run the following to download the data ```shell mkdir -p ./data kaggle competitions download \ -c bnp-paribas-cardif-claims-management \ --path ./data unzip \ -d ./data \ ./data/bnp-paribas-cardif-claims-management.zip ``` This is an anonymized dataset containing categorical and numerical values for claims received by BNP Paribas Cardif. The “target” column in the train set is the variable to predict. It is equal to 1 for claims suitable for an accelerated approval. The task is to predict whether a claim will be suitable for accelerated approval or not. We’ll only use the `train.csv.zip` file as `test.csv.zip` does not have a target column. ```ipython3 import os file_name = "train.csv.zip" data_dir = "data/" INPUT_FILE = os.path.join(data_dir, file_name) ``` Select the `N_TRIALS` for the number of runs of HPO trials. ```ipython3 N_TRIALS = 150 df = cudf.read_csv(INPUT_FILE) # Drop ID column df = df.drop("ID", axis=1) # Drop non-numerical data and fill NaNs before passing to cuML RF CAT_COLS = list(df.select_dtypes("object").columns) df = df.drop(CAT_COLS, axis=1) df = df.fillna(0) df = df.astype("float32") X, y = df.drop(["target"], axis=1), df["target"].astype("int32") study_name = "dask_optuna_lr_log_loss_tpe" ``` # Training and Evaluation The `train_and_eval` function accepts the different parameters to try out. This function should look very similar to any ML workflow. We’ll use this function within the Optuna `objective` function to show how easily we can fit an existing workflow into the Optuna work. ```ipython3 def train_and_eval( X_param, y_param, penalty="l2", C=1.0, l1_ratio=None, fit_intercept=True ): """ Splits the given data into train and test split to train and evaluate the model for the params parameters. Params ______ X_param: DataFrame. The data to use for training and testing. y_param: Series. The label for training penalty, C, l1_ratio, fit_intercept: The parameter values for Logistic Regression. Returns score: log loss of the fitted model """ X_train, X_valid, y_train, y_valid = train_test_split( X_param, y_param, random_state=42 ) classifier = LogisticRegression( penalty=penalty, C=C, l1_ratio=l1_ratio, fit_intercept=fit_intercept, max_iter=10000, ) classifier.fit(X_train, y_train) y_pred = classifier.predict(X_valid) score = log_loss(y_valid, y_pred) return score ``` For a baseline number, let’s see what the default performance of the model is. ```ipython3 print("Score with default parameters : ", train_and_eval(X, y)) ``` ```myst-ansi [W] [09:34:11.132560] L-BFGS line search failed (code 3); stopping at the last valid step Score with default parameters : 8.24908383066997 ``` ## Objective Function We will optimize the objective function using [Optuna Study](https://optuna.readthedocs.io/en/stable/reference/study.html). The objective function tries out specified values for the parameters that we are tuning and returns the score obtained with those parameters. These results will be aggregated in `study.trials_dataframes()`. Let’s define the objective function for this HPO task by making use of the `train_and_eval()`. You can see that we simply choose a value for the parameters and call the `train_and_eval` method, making Optuna very easy to use in an existing workflow. The objective function does not need to be changed when switching to different [samplers](https://optuna.readthedocs.io/en/stable/reference/samplers.html), which are built-in options in Optuna to enable the selection of different sampling algorithms that optuna provides. Some of the available ones include - GridSampler, RandomSampler, TPESampler, etc. We’ll use TPESampler for this demo, but feel free to try different samplers to notice the changes in performance. [Tree-Structured Parzen Estimators](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.TPESampler.html#optuna.samplers.TPESampler) or TPE works by fitting two Gaussian Mixture Model during each trial - one to the set of parameter values associated with the best objective values, and another to the remaining parameter values. It chooses the parameter value that maximizes the ratio between the two GMMs ```ipython3 def objective(trial, X_param, y_param): C = trial.suggest_float("C", 0.01, 100.0, log=True) penalty = trial.suggest_categorical("penalty", ["none", "l1", "l2"]) fit_intercept = trial.suggest_categorical("fit_intercept", [True, False]) score = train_and_eval( X_param, y_param, penalty=penalty, C=C, fit_intercept=fit_intercept ) return score ``` ## HPO Trials and Study Optuna uses [studies](https://optuna.readthedocs.io/en/stable/reference/study.html) and [trials](https://optuna.readthedocs.io/en/stable/reference/trial.html) to keep track of the HPO experiments. Put simply, a trial is a single call of the objective function while a set of trials make up a study. We will pick the best observed trial from a study to get the best parameters that were used in that run. Here, `DaskStorage` class is used to set up a storage shared by all workers in the cluster. Learn more about what storages can be used [here](https://optuna.readthedocs.io/en/stable/reference/storages.html) `optuna.create_study` is used to set up the study. As you can see, it specifies the study name, sampler to be used, the direction of the study, and the storage. With just a few lines of code, we have set up a distributed HPO experiment. ```ipython3 storage = optuna.integration.DaskStorage() study = optuna.create_study( sampler=optuna.samplers.TPESampler(seed=142), study_name=study_name, direction="minimize", storage=storage, ) # Optimize in parallel on your Dask cluster # # Submit `n_workers` optimization tasks, where each task runs about 40 optimization trials # for a total of about N_TRIALS trials in all futures = [ c.submit( study.optimize, lambda trial: objective(trial, X, y), n_trials=N_TRIALS // n_workers, pure=False, ) for _ in range(n_workers) ] wait(futures) print(f"Best params: {study.best_params}") print("Number of finished trials: ", len(study.trials)) ``` You should see logs like the following. ```text [I 2024-08-06 09:41:40,161] Trial 1 finished with value: 8.238207899472073 and parameters: {'C': 40.573838784392514, 'penalty': 'l2', 'fit_intercept': True}. Best is trial 1 with value: 8.238207899472073. ... [I 2024-08-06 09:41:58,423] Trial 143 finished with value: 8.210414278942531 and parameters: {'C': 0.3152731188939818, 'penalty': 'l1', 'fit_intercept': True}. Best is trial 52 with value: 8.205579602300705. Best params: {'C': 1.486491072441749, 'penalty': 'l2', 'fit_intercept': True} Number of finished trials: 144 ``` ## Visualization Optuna provides an easy way to visualize the trials via builtin graphs. Read more about visualizations [here](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/005_visualization.html). ## Concluding Remarks This notebook shows how RAPIDS and Optuna can be used along with dask to run multi-GPU HPO jobs, and can be used as a starting point for anyone wanting to get started with the framework. We have seen how by just adding a few lines of code we were able to integrate the libraries for a muli-GPU HPO runs. This can also be scaled to multiple nodes. ## Next Steps This is done on a small dataset, you are encouraged to test out on larger data with more range for the parameters too. These experiments can yield performance improvements. Refer to other examples in the [rapidsai/cloud-ml-examples](https://github.com/rapidsai/cloud-ml-examples) repository. ## Resources [Hyperparameter Tuning in Python](https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624) [Overview of Hyperparameter tuning](https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview) [How to make your model awesome with Optuna](https://towardsdatascience.com/how-to-make-your-model-awesome-with-optuna-b56d490368af) # index.html.md # Training XGBoost with Dask RAPIDS in Databricks *January, 2024* This notebook shows how to deploy Dask RAPIDS workflow in Databricks. We will focus on the HIGGS dataset, a moderately sized classification problem from the [UCI Machine Learning repository.](https://archive.ics.uci.edu/dataset/280/higgs) In the following sections, we will begin by loading the dataset from [Delta Lake](https://delta.io/) and performing preprocessing with [Dask](https://github.com/dask/dask). Then train an [XGBoost](https://xgboost.readthedocs.io/en/stable/) model with various configurations and explore techniques for optimizing inference. ## Launch multi-node Dask Cluster This workflow example can be ran on GPU, and you don’t even need to have the GPU locally since Databricks can provide one for you. Whereas Dask enables users to easily distribute or scale up computation tasks within a single GPU or across multiple GPUs. Dask recently introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) (available via [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this CLI tool, the `dask databricks run --cuda` command will launch a Dask scheduler in the driver node and [`cuda` workers](https://docs.rapids.ai/api/dask-cuda/nightly) in the remaining nodes. From a high level, we could break down this section into the following steps: * Create a new [init script](https://docs.databricks.com/en/init-scripts/index.html) that installs [RAPIDS](https://rapids.ai/) and runs `dask-databricks` * Create a new multi-node cluster that uses the init script * Once the cluster is running upload this notebook to Databricks and continue running these cells on there ## Import packages Once your cluster has launched, start by importing all necessary libraries and dependencies. ```ipython3 import os import dask_cudf import dask_databricks import dask_deltatable as ddt import numpy as np import xgboost as xgb from dask_ml.model_selection import train_test_split from distributed import wait from xgboost import dask as dxgb ``` ## Connect to Dask Client Connect to the client (and optionally Dashboard) to submit tasks. ```ipython3 client = dask_databricks.get_client() client ``` ## Download dataset First we download the dataset to Databrick File Storage (DBFS). Alternatively, you could also use cloud storage ([S3](https://aws.amazon.com/s3/), [Google Cloud](https://cloud.google.com/storage?hl=en), [Azure Data Lake](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) Refer to [docs](https://docs.databricks.com/en/storage/index.html#:~:text=Databricks%20uses%20cloud%20object%20storage,storage%20locations%20in%20your%20account.) for more information ```ipython3 import subprocess # Define the directory and file paths directory_path = "/dbfs/databricks/rapids" file_path = f"{directory_path}/HIGGS.csv.gz" # Check if directory already exists if not os.path.exists(directory_path): os.makedirs(directory_path) # Check if the file already exists if not os.path.exists(file_path): # If not, download dataset to the directory data_url = ( "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz" ) download_command = f"curl {data_url} --output {file_path}" subprocess.run(download_command, shell=True) # decompress the csv file decompress_command = f"gunzip {file_path}" subprocess.run(decompress_command, shell=True) ``` Next we load the data into GPUs. Because the data is loaded multiple times during parameter tuning, we convert the original CSV file into Parquet format for better performance. This can be easily done using delta lake as shown in the next steps. ## Integrating Dask and Delta Lake [**Delta Lake**](https://docs.databricks.com/en/delta/index.html) is an optimized storage layer within the Databricks lakehouse that provides a foundational platform for storing data and tables. This open-source software extends Parquet data files by incorporating a file-based transaction log to support [ACID transactions](https://docs.databricks.com/en/lakehouse/acid.html) and scalable metadata handling. Delta Lake is the default storage format for all operations on Databricks, i.e (unless otherwise specified, all tables on Databricks are Delta tables). Check out [tutorial for examples with basic Delta Lake operations](https://docs.databricks.com/en/delta/tutorial.html). Let’s explore step-by-step how we can leverage Delta Lake tables with Dask to accelerate data pre-processing with RAPIDS. ## Read from Delta table with Dask With Dask’s [**dask-deltatable**](https://github.com/dask-contrib/dask-deltatable/tree/main), we can write the `.csv` file into a Delta table using [**Spark**](https://spark.apache.org/docs/latest/) then read and parallelize with [**Dask**](https://docs.dask.org/en/stable/). ```ipython3 delta_table_name = "higgs_delta_table" # Check if the Delta table already exists if spark.catalog.tableExists(delta_table_name): # If it exists, print a message print(f"The Delta table '{delta_table_name}' already exists.") else: # If not, Load csv file into a Spark dataframe then # Write the spark dataframe into delta table data = spark.read.csv(file_path, header=True, inferSchema=True) data.write.saveAsTable(delta_table_name) print(f"The Delta table '{delta_table_name}' has been created.") ``` ```myst-ansi The Delta table 'higgs_delta_table' already exists. ``` ```ipython3 display(spark.sql("DESCRIBE DETAIL higgs_delta_table")) ``` Calling `dask_deltalake.read_deltalake()` will return a `dask dataframe`. However, our objective is to utilize GPU acceleration for the entire ML pipeline, including data processing, model training and inference. For this reason, we will read the dask dataframe into a `cuDF dask-dataframe` using `dask_cudf.from_dask_dataframe()` **Note** that these operations will automatically leverage the Dask client we created, ensuring optimal performance boost through parallelism with dask. ```ipython3 # Read the Delta Lake into a Dask DataFrame using `dask-deltatable` df = ddt.read_deltalake("/dbfs/user/hive/warehouse/higgs_delta_table") # Convert Dask DataFrame to Dask cuDF for GPU acceleration ddf = dask_cudf.from_dask_dataframe(df) ddf.head() ``` ```ipython3 colnames = ["label"] + [f"feature-{i:02d}" for i in range(1, 29)] ddf.columns = colnames ddf.head() ``` ## Split data In the preceding step, we used [**`dask-cudf`**](https://docs.rapids.ai/api/dask-cudf/nightly/) for loading data from the Delta table, now use `train_test_split()` function from [**`dask-ml`**](https://ml.dask.org/modules/api.html) to split up the dataset. Most of the time, the GPU backend of Dask works seamlessly with utilities in `dask-ml` and we can accelerate the entire ML pipeline as such: ```ipython3 def load_higgs( ddf, ) -> tuple[ dask_cudf.core.DataFrame, dask_cudf.core.Series, dask_cudf.core.DataFrame, dask_cudf.core.Series, ]: y = ddf["label"] X = ddf[ddf.columns.difference(["label"])] X_train, X_valid, y_train, y_valid = train_test_split( X, y, test_size=0.33, random_state=42 ) X_train, X_valid, y_train, y_valid = client.persist( [X_train, X_valid, y_train, y_valid] ) wait([X_train, X_valid, y_train, y_valid]) return X_train, X_valid, y_train, y_valid ``` ```ipython3 X_train, X_valid, y_train, y_valid = load_higgs(ddf) ``` ```myst-ansi /databricks/python/lib/python3.10/site-packages/dask_ml/model_selection/_split.py:462: FutureWarning: The default value for 'shuffle' must be specified when splitting DataFrames. In the future DataFrames will automatically be shuffled within blocks prior to splitting. Specify 'shuffle=True' to adopt the future behavior now, or 'shuffle=False' to retain the previous behavior. warnings.warn( ``` ```ipython3 X_train.head() ``` ```ipython3 y_train.head() ``` ```myst-ansi Out[14]: 0 1.0 1 1.0 3 1.0 10 0.0 11 1.0 Name: label, dtype: float64 ``` ## Model training There are two things to notice here. Firstly, we specify the number of rounds to trigger early stopping for training. [XGBoost](https://xgboost.readthedocs.io/en/release_1.7.0/) will stop the training process once the validation metric fails to improve in consecutive X rounds, where **X** is the number of rounds specified for early stopping. Secondly, we use a data type called `DaskDeviceQuantileDMatrix` for training but `DaskDMatrix` for validation. `DaskDeviceQuantileDMatrix` is a drop-in replacement of `DaskDMatrix` for GPU-based training inputs that avoids extra data copies. ```ipython3 def fit_model_es(client, X, y, X_valid, y_valid) -> dxgb.Booster: early_stopping_rounds = 5 Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y) Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid) # train the model booster = dxgb.train( client, { "objective": "binary:logistic", "eval_metric": "error", "tree_method": "gpu_hist", }, Xy, evals=[(Xy_valid, "Valid")], num_boost_round=1000, early_stopping_rounds=early_stopping_rounds, )["booster"] return booster ``` ```ipython3 booster = fit_model_es(client, X=X_train, y=y_train, X_valid=X_valid, y_valid=y_valid) booster ``` ```myst-ansi /databricks/python/lib/python3.10/site-packages/xgboost/dask.py:703: FutureWarning: Please use `DaskQuantileDMatrix` instead. warnings.warn("Please use `DaskQuantileDMatrix` instead.", FutureWarning) ``` ```myst-ansi Out[16]: ``` ## Train with Customized objective and evaluation metric In the example below the XGBoost model is trained using a custom logistic regression-based objective function (`logit`) and a custom evaluation metric (`error`) along with early stopping. Note that the function returns both gradient and hessian, which XGBoost uses to optimize the model. Also, the parameter named `metric_name` needs to be specified in our callback. It is used to inform XGBoost that the custom error function should be used for evaluating early stopping criteria. ```ipython3 def fit_model_customized_objective(client, X, y, X_valid, y_valid) -> dxgb.Booster: def logit(predt: np.ndarray, Xy: xgb.DMatrix) -> tuple[np.ndarray, np.ndarray]: predt = 1.0 / (1.0 + np.exp(-predt)) labels = Xy.get_label() grad = predt - labels hess = predt * (1.0 - predt) return grad, hess def error(predt: np.ndarray, Xy: xgb.DMatrix) -> tuple[str, float]: label = Xy.get_label() r = np.zeros(predt.shape) predt = 1.0 / (1.0 + np.exp(-predt)) gt = predt > 0.5 r[gt] = 1 - label[gt] le = predt <= 0.5 r[le] = label[le] return "CustomErr", float(np.average(r)) # Use early stopping with custom objective and metric. early_stopping_rounds = 5 # Specify the metric we want to use for early stopping. es = xgb.callback.EarlyStopping( rounds=early_stopping_rounds, save_best=True, metric_name="CustomErr" ) Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y) Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid) booster = dxgb.train( client, {"eval_metric": "error", "tree_method": "gpu_hist"}, Xy, evals=[(Xy_valid, "Valid")], num_boost_round=1000, obj=logit, # pass the custom objective feval=error, # pass the custom metric callbacks=[es], )["booster"] return booster ``` ```ipython3 booster_custom = fit_model_customized_objective( client, X=X_train, y=y_train, X_valid=X_valid, y_valid=y_valid ) booster_custom ``` ```myst-ansi /databricks/python/lib/python3.10/site-packages/xgboost/dask.py:703: FutureWarning: Please use `DaskQuantileDMatrix` instead. warnings.warn("Please use `DaskQuantileDMatrix` instead.", FutureWarning) ``` ```myst-ansi Out[18]: ``` ## Running inference After some tuning, we arrive at the final model for performing inference on new data. ```ipython3 def predict(client, model, X): predt = dxgb.predict(client, model, X) return predt ``` ```ipython3 preds = predict(client, booster, X_train) preds.head() ``` ```myst-ansi Out[20]: 0 0.843650 1 0.975618 3 0.378462 10 0.293985 11 0.966303 Name: 0, dtype: float32 ``` ## Clean up When finished, be sure to destroy your cluster to avoid incurring extra costs for idle resources. **Note** If you forget to destroy the cluster manually, it’s important to note that Databricks clusters will automatically time out after a period (specified during cluster creation). ```ipython3 client.close() ``` # index.html.md # Scaling up Hyperparameter Optimization with Multi-GPU Workload on Kubernetes *June, 2024* Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. In this notebook, we will speed up hyperparameter optimization by running multiple training jobs in parallel on a Kubernetes cluster. We handle larger data sets by splitting the data into multiple GPU devices. ## Prerequisites Please follow instructions in [Dask Operator: Installation](../../tools/kubernetes/dask-operator.md) to install the Dask operator on top of a GPU-enabled Kubernetes cluster. (For the purpose of this example, you may ignore other sections of the linked document. ### Optional: Kubeflow Kubeflow gives you a nice notebook environment to run this notebook within the k8s cluster. Install Kubeflow by following instructions in [Installing Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/). You may choose any method; we tested this example after installing Kubeflow from manifests. ## Install extra Python modules We’ll need a few extra Python modules. ```ipython3 !pip install dask_kubernetes optuna ``` ```myst-ansi Collecting dask_kubernetes Downloading dask_kubernetes-2024.5.0-py3-none-any.whl.metadata (4.2 kB) Collecting optuna Downloading optuna-3.6.1-py3-none-any.whl.metadata (17 kB) Requirement already satisfied: dask>=2022.08.1 in /opt/conda/lib/python3.11/site-packages (from dask_kubernetes) (2024.1.1) Requirement already satisfied: distributed>=2022.08.1 in /opt/conda/lib/python3.11/site-packages (from dask_kubernetes) (2024.1.1) Collecting kopf>=1.35.3 (from dask_kubernetes) Downloading kopf-1.37.2-py3-none-any.whl.metadata (9.7 kB) Collecting kr8s==0.14.* (from dask_kubernetes) Downloading kr8s-0.14.4-py3-none-any.whl.metadata (6.7 kB) Collecting kubernetes-asyncio>=12.0.1 (from dask_kubernetes) Downloading kubernetes_asyncio-29.0.0-py3-none-any.whl.metadata (1.3 kB) Collecting kubernetes>=12.0.1 (from dask_kubernetes) Downloading kubernetes-29.0.0-py2.py3-none-any.whl.metadata (1.5 kB) Collecting pykube-ng>=22.9.0 (from dask_kubernetes) Downloading pykube_ng-23.6.0-py3-none-any.whl.metadata (8.0 kB) Requirement already satisfied: rich>=12.5.1 in /opt/conda/lib/python3.11/site-packages (from dask_kubernetes) (13.7.1) Requirement already satisfied: anyio>=3.7.0 in /opt/conda/lib/python3.11/site-packages (from kr8s==0.14.*->dask_kubernetes) (4.3.0) Collecting asyncache>=0.3.1 (from kr8s==0.14.*->dask_kubernetes) Downloading asyncache-0.3.1-py3-none-any.whl.metadata (2.0 kB) Collecting cryptography>=35 (from kr8s==0.14.*->dask_kubernetes) Downloading cryptography-42.0.7-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (5.3 kB) Requirement already satisfied: exceptiongroup>=1.2.0 in /opt/conda/lib/python3.11/site-packages (from kr8s==0.14.*->dask_kubernetes) (1.2.0) Collecting httpx-ws>=0.5.1 (from kr8s==0.14.*->dask_kubernetes) Downloading httpx_ws-0.6.0-py3-none-any.whl.metadata (7.8 kB) Requirement already satisfied: httpx>=0.24.1 in /opt/conda/lib/python3.11/site-packages (from kr8s==0.14.*->dask_kubernetes) (0.27.0) Collecting python-box>=7.0.1 (from kr8s==0.14.*->dask_kubernetes) Downloading python_box-7.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.8 kB) Collecting python-jsonpath>=0.7.1 (from kr8s==0.14.*->dask_kubernetes) Downloading python_jsonpath-1.1.1-py3-none-any.whl.metadata (5.3 kB) Requirement already satisfied: pyyaml>=6.0 in /opt/conda/lib/python3.11/site-packages (from kr8s==0.14.*->dask_kubernetes) (6.0.1) Collecting alembic>=1.5.0 (from optuna) Downloading alembic-1.13.1-py3-none-any.whl.metadata (7.4 kB) Collecting colorlog (from optuna) Downloading colorlog-6.8.2-py3-none-any.whl.metadata (10 kB) Requirement already satisfied: numpy in /opt/conda/lib/python3.11/site-packages (from optuna) (1.26.4) Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.11/site-packages (from optuna) (24.0) Collecting sqlalchemy>=1.3.0 (from optuna) Downloading SQLAlchemy-2.0.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB) Requirement already satisfied: tqdm in /opt/conda/lib/python3.11/site-packages (from optuna) (4.66.2) Collecting Mako (from alembic>=1.5.0->optuna) Downloading Mako-1.3.3-py3-none-any.whl.metadata (2.9 kB) Requirement already satisfied: typing-extensions>=4 in /opt/conda/lib/python3.11/site-packages (from alembic>=1.5.0->optuna) (4.11.0) Requirement already satisfied: click>=8.1 in /opt/conda/lib/python3.11/site-packages (from dask>=2022.08.1->dask_kubernetes) (8.1.7) Requirement already satisfied: cloudpickle>=1.5.0 in /opt/conda/lib/python3.11/site-packages (from dask>=2022.08.1->dask_kubernetes) (3.0.0) Requirement already satisfied: fsspec>=2021.09.0 in /opt/conda/lib/python3.11/site-packages (from dask>=2022.08.1->dask_kubernetes) (2024.3.1) Requirement already satisfied: partd>=1.2.0 in /opt/conda/lib/python3.11/site-packages (from dask>=2022.08.1->dask_kubernetes) (1.4.1) Requirement already satisfied: toolz>=0.10.0 in /opt/conda/lib/python3.11/site-packages (from dask>=2022.08.1->dask_kubernetes) (0.12.1) Requirement already satisfied: importlib-metadata>=4.13.0 in /opt/conda/lib/python3.11/site-packages (from dask>=2022.08.1->dask_kubernetes) (7.1.0) Requirement already satisfied: jinja2>=2.10.3 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (3.1.3) Requirement already satisfied: locket>=1.0.0 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (1.0.0) Requirement already satisfied: msgpack>=1.0.0 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (1.0.7) Requirement already satisfied: psutil>=5.7.2 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (5.9.8) Requirement already satisfied: sortedcontainers>=2.0.5 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (2.4.0) Requirement already satisfied: tblib>=1.6.0 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (3.0.0) Requirement already satisfied: tornado>=6.0.4 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (6.4) Requirement already satisfied: urllib3>=1.24.3 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (1.26.18) Requirement already satisfied: zict>=3.0.0 in /opt/conda/lib/python3.11/site-packages (from distributed>=2022.08.1->dask_kubernetes) (3.0.0) Requirement already satisfied: python-json-logger in /opt/conda/lib/python3.11/site-packages (from kopf>=1.35.3->dask_kubernetes) (2.0.7) Collecting iso8601 (from kopf>=1.35.3->dask_kubernetes) Downloading iso8601-2.1.0-py3-none-any.whl.metadata (3.7 kB) Requirement already satisfied: aiohttp in /opt/conda/lib/python3.11/site-packages (from kopf>=1.35.3->dask_kubernetes) (3.9.5) Requirement already satisfied: certifi>=14.05.14 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=12.0.1->dask_kubernetes) (2024.2.2) Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=12.0.1->dask_kubernetes) (1.16.0) Requirement already satisfied: python-dateutil>=2.5.3 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=12.0.1->dask_kubernetes) (2.9.0) Collecting google-auth>=1.0.1 (from kubernetes>=12.0.1->dask_kubernetes) Downloading google_auth-2.29.0-py2.py3-none-any.whl.metadata (4.7 kB) Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=12.0.1->dask_kubernetes) (1.8.0) Requirement already satisfied: requests in /opt/conda/lib/python3.11/site-packages (from kubernetes>=12.0.1->dask_kubernetes) (2.31.0) Collecting requests-oauthlib (from kubernetes>=12.0.1->dask_kubernetes) Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB) Collecting oauthlib>=3.2.2 (from kubernetes>=12.0.1->dask_kubernetes) Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB) Requirement already satisfied: setuptools>=21.0.0 in /opt/conda/lib/python3.11/site-packages (from kubernetes-asyncio>=12.0.1->dask_kubernetes) (69.5.1) Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.11/site-packages (from rich>=12.5.1->dask_kubernetes) (3.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.11/site-packages (from rich>=12.5.1->dask_kubernetes) (2.17.2) Collecting greenlet!=0.4.17 (from sqlalchemy>=1.3.0->optuna) Downloading greenlet-3.0.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB) Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.11/site-packages (from aiohttp->kopf>=1.35.3->dask_kubernetes) (1.3.1) Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.11/site-packages (from aiohttp->kopf>=1.35.3->dask_kubernetes) (23.2.0) Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.11/site-packages (from aiohttp->kopf>=1.35.3->dask_kubernetes) (1.4.1) Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.11/site-packages (from aiohttp->kopf>=1.35.3->dask_kubernetes) (6.0.5) Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.11/site-packages (from aiohttp->kopf>=1.35.3->dask_kubernetes) (1.9.4) Requirement already satisfied: idna>=2.8 in /opt/conda/lib/python3.11/site-packages (from anyio>=3.7.0->kr8s==0.14.*->dask_kubernetes) (3.7) Requirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.11/site-packages (from anyio>=3.7.0->kr8s==0.14.*->dask_kubernetes) (1.3.1) Requirement already satisfied: cachetools<6.0.0,>=5.2.0 in /opt/conda/lib/python3.11/site-packages (from asyncache>=0.3.1->kr8s==0.14.*->dask_kubernetes) (5.3.3) Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.11/site-packages (from cryptography>=35->kr8s==0.14.*->dask_kubernetes) (1.16.0) Collecting pyasn1-modules>=0.2.1 (from google-auth>=1.0.1->kubernetes>=12.0.1->dask_kubernetes) Downloading pyasn1_modules-0.4.0-py3-none-any.whl.metadata (3.4 kB) Collecting rsa<5,>=3.1.4 (from google-auth>=1.0.1->kubernetes>=12.0.1->dask_kubernetes) Downloading rsa-4.9-py3-none-any.whl.metadata (4.2 kB) Requirement already satisfied: httpcore==1.* in /opt/conda/lib/python3.11/site-packages (from httpx>=0.24.1->kr8s==0.14.*->dask_kubernetes) (1.0.5) Requirement already satisfied: h11<0.15,>=0.13 in /opt/conda/lib/python3.11/site-packages (from httpcore==1.*->httpx>=0.24.1->kr8s==0.14.*->dask_kubernetes) (0.14.0) Collecting wsproto (from httpx-ws>=0.5.1->kr8s==0.14.*->dask_kubernetes) Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.11/site-packages (from importlib-metadata>=4.13.0->dask>=2022.08.1->dask_kubernetes) (3.17.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.11/site-packages (from jinja2>=2.10.3->distributed>=2022.08.1->dask_kubernetes) (2.1.5) Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich>=12.5.1->dask_kubernetes) (0.1.2) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.11/site-packages (from requests->kubernetes>=12.0.1->dask_kubernetes) (3.3.2) Requirement already satisfied: pycparser in /opt/conda/lib/python3.11/site-packages (from cffi>=1.12->cryptography>=35->kr8s==0.14.*->dask_kubernetes) (2.22) Collecting pyasn1<0.7.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth>=1.0.1->kubernetes>=12.0.1->dask_kubernetes) Downloading pyasn1-0.6.0-py2.py3-none-any.whl.metadata (8.3 kB) Downloading dask_kubernetes-2024.5.0-py3-none-any.whl (157 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.2/157.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m [?25hDownloading kr8s-0.14.4-py3-none-any.whl (60 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m [?25hDownloading optuna-3.6.1-py3-none-any.whl (380 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m [?25hDownloading alembic-1.13.1-py3-none-any.whl (233 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m [?25hDownloading kopf-1.37.2-py3-none-any.whl (207 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.8/207.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m [?25hDownloading kubernetes-29.0.0-py2.py3-none-any.whl (1.6 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m [?25hDownloading kubernetes_asyncio-29.0.0-py3-none-any.whl (2.0 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m83.5 MB/s[0m eta [36m0:00:00[0m [?25hDownloading pykube_ng-23.6.0-py3-none-any.whl (26 kB) Downloading SQLAlchemy-2.0.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m122.4 MB/s[0m eta [36m0:00:00[0m [?25hDownloading colorlog-6.8.2-py3-none-any.whl (11 kB) Downloading asyncache-0.3.1-py3-none-any.whl (3.7 kB) Downloading cryptography-42.0.7-cp39-abi3-manylinux_2_28_x86_64.whl (3.8 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m125.8 MB/s[0m eta [36m0:00:00[0m [?25hDownloading google_auth-2.29.0-py2.py3-none-any.whl (189 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.2/189.2 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m [?25hDownloading greenlet-3.0.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (620 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m620.0/620.0 kB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m [?25hDownloading httpx_ws-0.6.0-py3-none-any.whl (13 kB) Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.7/151.7 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m [?25hDownloading python_box-7.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m131.1 MB/s[0m eta [36m0:00:00[0m [?25hDownloading python_jsonpath-1.1.1-py3-none-any.whl (51 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.5/51.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m [?25hDownloading iso8601-2.1.0-py3-none-any.whl (7.5 kB) Downloading Mako-1.3.3-py3-none-any.whl (78 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.8/78.8 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m [?25hDownloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB) Downloading pyasn1_modules-0.4.0-py3-none-any.whl (181 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.2/181.2 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m [?25hDownloading rsa-4.9-py3-none-any.whl (34 kB) Downloading wsproto-1.2.0-py3-none-any.whl (24 kB) Downloading pyasn1-0.6.0-py2.py3-none-any.whl (85 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.3/85.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m [?25hInstalling collected packages: wsproto, python-jsonpath, python-box, pyasn1, oauthlib, Mako, iso8601, greenlet, colorlog, asyncache, sqlalchemy, rsa, requests-oauthlib, pykube-ng, pyasn1-modules, cryptography, kubernetes-asyncio, kopf, httpx-ws, google-auth, alembic, optuna, kubernetes, kr8s, dask_kubernetes Successfully installed Mako-1.3.3 alembic-1.13.1 asyncache-0.3.1 colorlog-6.8.2 cryptography-42.0.7 dask_kubernetes-2024.5.0 google-auth-2.29.0 greenlet-3.0.3 httpx-ws-0.6.0 iso8601-2.1.0 kopf-1.37.2 kr8s-0.14.4 kubernetes-29.0.0 kubernetes-asyncio-29.0.0 oauthlib-3.2.2 optuna-3.6.1 pyasn1-0.6.0 pyasn1-modules-0.4.0 pykube-ng-23.6.0 python-box-7.1.1 python-jsonpath-1.1.1 requests-oauthlib-2.0.0 rsa-4.9 sqlalchemy-2.0.30 wsproto-1.2.0 ``` ## Import Python modules ```ipython3 import threading import warnings import cupy as cp import cuspatial import dask_cudf import optuna from cuml.dask.common import utils as dask_utils from dask.distributed import Client, wait from dask_kubernetes.operator import KubeCluster from dask_ml.metrics import mean_squared_error from dask_ml.model_selection import KFold from xgboost import dask as dxgb ``` ## Set up multiple Dask clusters To run multi-GPU training jobs in parallel, we will create multiple Dask clusters each controlling its share of GPUs. It’s best to think of each Dask cluster as a portion of the compute resource of the Kubernetes cluster. Fill in the following variables: ```ipython3 # Number of nodes in the Kubernetes cluster. # Each node is assumed to have a single NVIDIA GPU attached n_nodes = 7 # Number of worker nodes to be assigned to each Dask cluster n_worker_per_dask_cluster = 2 # Number of nodes to be assigned to each Dask cluster # 1 is added since the Dask cluster's scheduler process needs to be mapped to its own node n_node_per_dask_cluster = n_worker_per_dask_cluster + 1 # Number of Dask clusters to be created # Subtract 1 to account for the notebook Pod (it requires its own node) n_clusters = (n_nodes - 1) // n_node_per_dask_cluster print(f"{n_clusters=}") if n_clusters == 0: raise ValueError( "No cluster can be created. Reduce `n_worker_per_dask_cluster` or create more compute nodes" ) print(f"{n_worker_per_dask_cluster=}") print(f"{n_node_per_dask_cluster=}") n_node_active = n_clusters * n_node_per_dask_cluster + 1 if n_node_active != n_nodes: n_idle = n_nodes - n_node_active warnings.warn(f"{n_idle} node(s) will not be used", stacklevel=2) ``` ```myst-ansi n_clusters=2 n_worker_per_dask_cluster=2 n_node_per_dask_cluster=3 ``` Once we’ve determined the number of Dask clusters and their size, we are now ready to launch them: ```ipython3 # Choose the same RAPIDS image you used for launching the notebook session rapids_image = "" ``` ```ipython3 clusters = [] for i in range(n_clusters): print(f"Launching cluster {i}...") clusters.append( KubeCluster( name=f"rapids-dask{i}", image=rapids_image, worker_command="dask-cuda-worker", n_workers=2, resources={"limits": {"nvidia.com/gpu": "1"}}, env={"EXTRA_PIP_PACKAGES": "optuna"}, ) ) ``` ```myst-ansi Launching cluster 0... ``` ```myst-ansi Launching cluster 1... ``` ## Set up Hyperparameter Optimization Task with NYC Taxi data Anaconda has graciously made some of the NYC Taxi dataset available in a public Google Cloud Storage bucket. We’ll use our Cluster of GPUs to process it and train a model that predicts the fare amount. We’ll use our Dask clusters to process it and train a model that predicts the fare amount. ```ipython3 col_dtype = { "VendorID": "int32", "tpep_pickup_datetime": "datetime64[ms]", "tpep_dropoff_datetime": "datetime64[ms]", "passenger_count": "int32", "trip_distance": "float32", "pickup_longitude": "float32", "pickup_latitude": "float32", "RatecodeID": "int32", "store_and_fwd_flag": "int32", "dropoff_longitude": "float32", "dropoff_latitude": "float32", "payment_type": "int32", "fare_amount": "float32", "extra": "float32", "mta_tax": "float32", "tip_amount": "float32", "total_amount": "float32", "tolls_amount": "float32", "improvement_surcharge": "float32", } must_haves = { "pickup_datetime": "datetime64[ms]", "dropoff_datetime": "datetime64[ms]", "passenger_count": "int32", "trip_distance": "float32", "pickup_longitude": "float32", "pickup_latitude": "float32", "rate_code": "int32", "dropoff_longitude": "float32", "dropoff_latitude": "float32", "fare_amount": "float32", } def compute_haversine_distance(df): pickup = cuspatial.GeoSeries.from_points_xy( df[["pickup_longitude", "pickup_latitude"]].interleave_columns() ) dropoff = cuspatial.GeoSeries.from_points_xy( df[["dropoff_longitude", "dropoff_latitude"]].interleave_columns() ) df["haversine_distance"] = cuspatial.haversine_distance(pickup, dropoff) df["haversine_distance"] = df["haversine_distance"].astype("float32") return df def clean(ddf, must_haves): # replace the extraneous spaces in column names and lower the font type tmp = {col: col.strip().lower() for col in list(ddf.columns)} ddf = ddf.rename(columns=tmp) ddf = ddf.rename( columns={ "tpep_pickup_datetime": "pickup_datetime", "tpep_dropoff_datetime": "dropoff_datetime", "ratecodeid": "rate_code", } ) ddf["pickup_datetime"] = ddf["pickup_datetime"].astype("datetime64[ms]") ddf["dropoff_datetime"] = ddf["dropoff_datetime"].astype("datetime64[ms]") for col in ddf.columns: if col not in must_haves: ddf = ddf.drop(columns=col) continue if ddf[col].dtype == "object": # Fixing error: could not convert arg to str ddf = ddf.drop(columns=col) else: # downcast from 64bit to 32bit types # Tesla T4 are faster on 32bit ops if "int" in str(ddf[col].dtype): ddf[col] = ddf[col].astype("int32") if "float" in str(ddf[col].dtype): ddf[col] = ddf[col].astype("float32") ddf[col] = ddf[col].fillna(-1) return ddf def prepare_data(client): taxi_df = dask_cudf.read_csv( "https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-02.csv", dtype=col_dtype, ) taxi_df = taxi_df.map_partitions(clean, must_haves, meta=must_haves) ## add features taxi_df["hour"] = taxi_df["pickup_datetime"].dt.hour.astype("int32") taxi_df["year"] = taxi_df["pickup_datetime"].dt.year.astype("int32") taxi_df["month"] = taxi_df["pickup_datetime"].dt.month.astype("int32") taxi_df["day"] = taxi_df["pickup_datetime"].dt.day.astype("int32") taxi_df["day_of_week"] = taxi_df["pickup_datetime"].dt.weekday.astype("int32") taxi_df["is_weekend"] = (taxi_df["day_of_week"] >= 5).astype("int32") # calculate the time difference between dropoff and pickup. taxi_df["diff"] = taxi_df["dropoff_datetime"].astype("int32") - taxi_df[ "pickup_datetime" ].astype("int32") taxi_df["diff"] = (taxi_df["diff"] / 1000).astype("int32") taxi_df["pickup_latitude_r"] = taxi_df["pickup_latitude"] // 0.01 * 0.01 taxi_df["pickup_longitude_r"] = taxi_df["pickup_longitude"] // 0.01 * 0.01 taxi_df["dropoff_latitude_r"] = taxi_df["dropoff_latitude"] // 0.01 * 0.01 taxi_df["dropoff_longitude_r"] = taxi_df["dropoff_longitude"] // 0.01 * 0.01 taxi_df = taxi_df.drop("pickup_datetime", axis=1) taxi_df = taxi_df.drop("dropoff_datetime", axis=1) taxi_df = taxi_df.map_partitions(compute_haversine_distance) X = ( taxi_df.drop(["fare_amount"], axis=1) .astype("float32") .to_dask_array(lengths=True) ) y = taxi_df["fare_amount"].astype("float32").to_dask_array(lengths=True) X._meta = cp.asarray(X._meta) y._meta = cp.asarray(y._meta) X, y = dask_utils.persist_across_workers(client, [X, y]) return X, y def train_model(params): cluster = get_cluster(threading.get_ident()) default_params = { "objective": "reg:squarederror", "eval_metric": "rmse", "verbosity": 0, "tree_method": "hist", "device": "cuda", } params = dict(default_params, **params) with Client(cluster) as client: X, y = prepare_data(client) wait([X, y]) scores = [] kfold = KFold(n_splits=5, shuffle=False) for train_index, test_index in kfold.split(X, y): dtrain = dxgb.DaskQuantileDMatrix(client, X[train_index, :], y[train_index]) dtest = dxgb.DaskQuantileDMatrix(client, X[test_index, :], y[test_index]) model = dxgb.train( client, params, dtrain, num_boost_round=10, verbose_eval=False, ) y_test_pred = dxgb.predict(client, model, dtest).to_backend("cupy") rmse_score = mean_squared_error(y[test_index], y_test_pred, squared=False) scores.append(rmse_score) return sum(scores) / len(scores) def objective(trial): params = { "n_estimators": trial.suggest_int("n_estimators", 2, 4), "learning_rate": trial.suggest_float("learning_rate", 0.5, 0.7), "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1), "colsample_bynode": trial.suggest_float("colsample_bynode", 0.5, 1), "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.5, 1), "reg_lambda": trial.suggest_float("reg_lambda", 0, 1), "max_depth": trial.suggest_int("max_depth", 1, 6), "max_leaves": trial.suggest_int("max_leaves", 0, 2), "max_cat_to_onehot": trial.suggest_int("max_cat_to_onehot", 1, 10), } return train_model(params) ``` To kick off multiple training jobs in parallel, we will launch multiple threads, so that each thread controls a Dask cluster. One important utility function is `get_cluster`, which returns the Dask cluster that’s mapped to a given thread. ```ipython3 # Map each thread's integer ID to a sequential number (0, 1, 2 ...) thread_id_map: dict[int, KubeCluster] = {} thread_id_map_lock = threading.Lock() def get_cluster(thread_id: int) -> KubeCluster: with thread_id_map_lock: try: return clusters[thread_id_map[thread_id]] except KeyError: seq_id = len(thread_id_map) thread_id_map[thread_id] = seq_id return clusters[seq_id] ``` Now we are ready to start hyperparameter optimization. ```ipython3 n_trials = ( 10 # set to a low number so that the demo finishes quickly. Feel free to adjust ) study = optuna.create_study(direction="minimize") ``` ```myst-ansi [I 2024-05-09 07:53:00,718] A new study created in memory with name: no-name-da830427-bce3-4e42-98e6-c98c0c3da0d7 ``` ```ipython3 # With n_jobs parameter, Optuna will launch [n_clusters] threads internally # Each thread will deploy a training job to a Dask cluster study.optimize(objective, n_trials=n_trials, n_jobs=n_clusters) ``` ```myst-ansi [I 2024-05-09 07:54:10,229] Trial 1 finished with value: 59.449462890625 and parameters: {'n_estimators': 4, 'learning_rate': 0.6399993857892183, 'colsample_bytree': 0.7020623988319513, 'colsample_bynode': 0.777468318546648, 'colsample_bylevel': 0.7890749134903386, 'reg_lambda': 0.4464953694744921, 'max_depth': 3, 'max_leaves': 0, 'max_cat_to_onehot': 9}. Best is trial 1 with value: 59.449462890625. [I 2024-05-09 07:54:19,507] Trial 0 finished with value: 57.77985763549805 and parameters: {'n_estimators': 4, 'learning_rate': 0.674087333032356, 'colsample_bytree': 0.557642421113256, 'colsample_bynode': 0.9719449711676733, 'colsample_bylevel': 0.6984302171973646, 'reg_lambda': 0.7201514298169174, 'max_depth': 4, 'max_leaves': 1, 'max_cat_to_onehot': 4}. Best is trial 0 with value: 57.77985763549805. [I 2024-05-09 07:54:59,524] Trial 2 finished with value: 57.77985763549805 and parameters: {'n_estimators': 2, 'learning_rate': 0.6894880267544121, 'colsample_bytree': 0.8171662437182604, 'colsample_bynode': 0.549527686217645, 'colsample_bylevel': 0.890212178266078, 'reg_lambda': 0.5847298606135033, 'max_depth': 2, 'max_leaves': 1, 'max_cat_to_onehot': 5}. Best is trial 0 with value: 57.77985763549805. [I 2024-05-09 07:55:22,013] Trial 3 finished with value: 55.01234817504883 and parameters: {'n_estimators': 4, 'learning_rate': 0.6597614733926671, 'colsample_bytree': 0.8437061126308156, 'colsample_bynode': 0.621479934699203, 'colsample_bylevel': 0.8330951489228277, 'reg_lambda': 0.7830102753448884, 'max_depth': 2, 'max_leaves': 2, 'max_cat_to_onehot': 2}. Best is trial 3 with value: 55.01234817504883. [I 2024-05-09 07:56:00,678] Trial 4 finished with value: 57.77985763549805 and parameters: {'n_estimators': 4, 'learning_rate': 0.5994587326401378, 'colsample_bytree': 0.9799078215504886, 'colsample_bynode': 0.9766955839079614, 'colsample_bylevel': 0.5088864363378924, 'reg_lambda': 0.18103184809548734, 'max_depth': 3, 'max_leaves': 1, 'max_cat_to_onehot': 4}. Best is trial 3 with value: 55.01234817504883. [I 2024-05-09 07:56:11,773] Trial 5 finished with value: 54.936126708984375 and parameters: {'n_estimators': 2, 'learning_rate': 0.5208827661289628, 'colsample_bytree': 0.866258912492528, 'colsample_bynode': 0.6368815844513638, 'colsample_bylevel': 0.9539603435186208, 'reg_lambda': 0.21390618865079458, 'max_depth': 4, 'max_leaves': 2, 'max_cat_to_onehot': 4}. Best is trial 5 with value: 54.936126708984375. [I 2024-05-09 07:56:48,737] Trial 6 finished with value: 57.77985763549805 and parameters: {'n_estimators': 2, 'learning_rate': 0.6137888371528442, 'colsample_bytree': 0.9621063205689744, 'colsample_bynode': 0.5306812468481084, 'colsample_bylevel': 0.8527827651989199, 'reg_lambda': 0.3315799968401767, 'max_depth': 6, 'max_leaves': 1, 'max_cat_to_onehot': 9}. Best is trial 5 with value: 54.936126708984375. [I 2024-05-09 07:56:59,261] Trial 7 finished with value: 55.204200744628906 and parameters: {'n_estimators': 3, 'learning_rate': 0.6831416027240611, 'colsample_bytree': 0.5311840770388268, 'colsample_bynode': 0.9572535535110238, 'colsample_bylevel': 0.6846894032354778, 'reg_lambda': 0.6091211134408249, 'max_depth': 3, 'max_leaves': 2, 'max_cat_to_onehot': 5}. Best is trial 5 with value: 54.936126708984375. [I 2024-05-09 07:57:37,674] Trial 8 finished with value: 54.93584442138672 and parameters: {'n_estimators': 4, 'learning_rate': 0.620742285616388, 'colsample_bytree': 0.7969398985157778, 'colsample_bynode': 0.9049707375663323, 'colsample_bylevel': 0.7209693969245297, 'reg_lambda': 0.6158847054585023, 'max_depth': 1, 'max_leaves': 0, 'max_cat_to_onehot': 10}. Best is trial 8 with value: 54.93584442138672. [I 2024-05-09 07:57:50,310] Trial 9 finished with value: 57.76123809814453 and parameters: {'n_estimators': 3, 'learning_rate': 0.5475197727057007, 'colsample_bytree': 0.5381502848057452, 'colsample_bynode': 0.8514705732161596, 'colsample_bylevel': 0.9139277684007088, 'reg_lambda': 0.5117732009332318, 'max_depth': 4, 'max_leaves': 0, 'max_cat_to_onehot': 5}. Best is trial 8 with value: 54.93584442138672. ``` # index.html.md # Multi-Node Multi-GPU XGBoost Example on Azure using dask-cloudprovider *November, 2023* [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud integration library for Dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use this package to set-up an Azure cluster and run a multi-node multi-GPU (MNMG) example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see in this notebook. For the purposes of this demo, we will use a part of the NYC Taxi Dataset (only the files of 2014 calendar year will be used here). The goal is to predict the fare amount for a given trip given the times and coordinates of the taxi trip. We will download the data from [Azure Open Datasets](https://docs.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets), where the dataset is publicly hosted by Microsoft. #### NOTE In this notebook, we will explore two possible ways to use `dask-cloudprovider` to run our workloads on Azure VM clusters: 1. [Option 1](#use-an-azure-marketplace-vm-image): Using an [Azure Marketplace image](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nvidia.ngc_azure_17_11?tab=overview) made available for free from NVIDIA. The RAPIDS container will be subsequently downloaded once the VMs start up. 2. [Option 2](#set-up-an-azure-customized-vm): Using [`packer`](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/build-image-with-packer) to create a custom VM image to be used in the cluster. This image will include the RAPIDS container, and having the container already inside the image should speed up the process of provisioning the cluster. **You can either use Option 1 or use Option 2** ## Step 0: Set up Azure credentials and CLI Before running the notebook, run the following commands in the terminal to setup Azure CLI ```default curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash az login ``` Then, follow the instructions on the prompt to finish setting up the account. If you are running the notebook from inside a Docker container, you can remove `sudo`. ```ipython3 !az login ``` ```myst-ansi [93mA web browser has been opened at https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize. Please continue the login in the web browser. If no web browser is available or if the web browser fails to open, use device code flow with `az login --use-device-code`.[0m [ { "cloudName": "AzureCloud", "homeTenantId": "43083d15-7273-40c1-b7db-39efd9ccc17a", "id": "fc4f4a6b-4041-4b1c-8249-854d68edcf62", "isDefault": true, "managedByTenants": [ { "tenantId": "2f4a9838-26b7-47ee-be60-ccc1fdec5953" } ], "name": "NV-AI-Infra", "state": "Enabled", "tenantId": "43083d15-7273-40c1-b7db-39efd9ccc17a", "user": { "name": "skirui@nvidia.com", "type": "user" } } ] ``` ## Step 1: Import necessary packages ```ipython3 # # Uncomment the following and install some libraries at the beginning. # If adlfs is not present, install adlfs to read from Azure data lake. ! pip install adlfs ! pip install "dask-cloudprovider[azure]" --upgrade ``` ```ipython3 import json from timeit import default_timer as timer import dask import dask_cudf import numpy as np import xgboost as xgb from cuml.metrics import mean_squared_error from dask.distributed import Client, wait from dask_cloudprovider.azure import AzureVMCluster from dask_ml.model_selection import train_test_split ``` ## Step 2: Set up the Azure VM Cluster We will now set up a Dask cluster on Azure Virtual machines using `AzureVMCluster` from Dask Cloud Provider following these [instructions](https://docs.rapids.ai/deployment/stable/cloud/azure/azure-vm-multi/). To do this, you will first need to set up a Resource Group, a Virtual Network and a Security Group on Azure. [Learn more about how you can set this up](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups). Note that you can also set it up using the Azure portal. Once you have set it up, you can now plug in the names of the entities you have created in the cell below. We need to pass in the docker argument `docker_args = '--shm-size=256m'` to allow larger shared memory for successfully running multiple docker containers in the same VM. This is the case when each VM has more than one worker. Even if you don’t have such a case, there is no harm in having a larger shared memory. Finally, note that we use the RAPIDS docker image to build the VM and use the `dask_cuda.CUDAWorker` to run within the VM. This will run the worker docker image with GPU capabilities instead of CPU. ```ipython3 location = "West US 2" resource_group = "rapidsai-deployment" vnet = "rapidsai-deployment-vnet" security_group = "rapidsaiclouddeploymenttest-nsg" vm_size = "Standard_NC12s_v3" # or choose a different GPU enabled VM type docker_image = "rapidsai/base:25.12a-cuda12-py3.13" docker_args = "--shm-size=256m" worker_class = "dask_cuda.CUDAWorker" ``` ### Option 1: Use an Azure Marketplace VM image In this method, we can use an Azure marketplace VM provided by NVIDIA for free. These VM images contain all the necessary dependencies and NVIDIA drivers preinstalled. These images are made available by NVIDIA as an out-of-the-box solution to decrease the cluster setup time for data scientists. Fortunately for us, `dask-cloudprovider` has made it simple to pass in information of a marketplace VM, and it will use the selected VM image instead of a vanilla image. We will use the following image: [NVIDIA GPU-Optimized Image for AI and HPC](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nvidia.ngc_azure_17_11?tab=overview). #### NOTE Please make sure you have [dask-cloudprovider](https://cloudprovider.dask.org/en/latest/) version 2021.6.0 or above. Marketplace VMs in Azure is not supported in older versions. #### Set up Marketplace VM information and clear default dask config. ```ipython3 dask.config.set( { "logging.distributed": "info", "cloudprovider.azure.azurevm.marketplace_plan": { "publisher": "nvidia", "name": "ngc-base-version-23_03_0", "product": "ngc_azure_17_11", "version": "23.03.0", }, } ) vm_image = "" config = dask.config.get("cloudprovider.azure.azurevm", {}) config ``` If necessary, you must uncomment and accept the Azure Marketplace image terms so that the image can be used to create VMs ```ipython3 ! az vm image terms accept --urn "nvidia:ngc_azure_17_11:ngc-base-version-23_03_0:23.03.0" --verbose ``` ```myst-ansi { "accepted": true, "id": "/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/providers/Microsoft.MarketplaceOrdering/offerTypes/Microsoft.MarketplaceOrdering/offertypes/publishers/nvidia/offers/ngc_azure_17_11/plans/ngc-base-version-23_03_0/agreements/current", "licenseTextLink": "https://mpcprodsa.blob.core.windows.net/legalterms/3E5ED_legalterms_NVIDIA%253a24NGC%253a5FAZURE%253a5F17%253a5F11%253a24NGC%253a2DBASE%253a2DVERSION%253a2D23%253a5F03%253a5F0%253a24KJVKRIWKTRQ3CIEPNL6YTG4AVORBHHPZCDQDVWX7JPPDEF6UM7R4XO76VDRHXCNTQYATKLGYYW3KA7DSIKTYXBZ3HJ2FMWYCINEY4WQ.txt", "marketplaceTermsLink": "https://mpcprodsa.blob.core.windows.net/marketplaceterms/3EDEF_marketplaceterms_VIRTUALMACHINE%253a24AAK2OAIZEAWW5H4MSP5KSTVB6NDKKRTUBAU23BRFTWN4YC2MQLJUB5ZEYUOUJBVF3YK34CIVPZL2HWYASPGDUY5O2FWEGRBYOXWZE5Y.txt", "name": "ngc-base-version-23_03_0", "plan": "ngc-base-version-23_03_0", "privacyPolicyLink": "https://www.nvidia.com/en-us/about-nvidia/privacy-policy/", "product": "ngc_azure_17_11", "publisher": "nvidia", "retrieveDatetime": "2023-10-02T08:17:40.3203275Z", "signature": "SWCKS7PPTL3XIBGBE2IZCMF43KBRDLSIZ7XLXXTLI6SXDCPCXY53BAISH6DNIELVV63GPZ44AOMMMZ6RV2AL5ARNM6XWHXRJ4HDNTJI", "systemData": { "createdAt": "2023-10-02T08:17:43.219827+00:00", "createdBy": "fc4f4a6b-4041-4b1c-8249-854d68edcf62", "createdByType": "ManagedIdentity", "lastModifiedAt": "2023-10-02T08:17:43.219827+00:00", "lastModifiedBy": "fc4f4a6b-4041-4b1c-8249-854d68edcf62", "lastModifiedByType": "ManagedIdentity" }, "type": "Microsoft.MarketplaceOrdering/offertypes" } [32mCommand ran in 7.879 seconds (init: 0.159, invoke: 7.720)[0m ``` Now that you have set up the necessary configurations to use the NVIDIA VM image, directly move to [Step 2.1](#start-the-vm-cluster-in-azure) to start the AzureVMCluster. ### Option 2: Set up an Azure Customized VM If you already have a customized VM and you know its resource id, jump to [Step f. of Option 2](#set-up-customized-vm-information-and-clear-default-dask-config) In general, if we use a generic image to create a cluster, we would have to wait till the new VMs are provisioned fully with all dependencies. The provisioning step does several things such as set the VM up with required libraries, set up Docker, install the NVIDIA drivers and also pull and decompress the RAPIDS container etc. This usually takes around 10-15 minutes of time depending on the cloud provider. If the user wants to fire up a cluster quickly, setting up a VM from a generic image every time may not be optimal. Further, as detailed in Option 1, we can also choose to use a custom Marketplace VM from NVIDIA. However, we will still have to download and decompress the RAPIDS container. So the setup time to start the workers and the scheduler would still be around 8-10 minutes. Luckily we can improve on this. We can make our own customized VM bundled with all the necessary packages, drivers, containers and dependencies. This way, firing up the cluster using the customized VM will take minimal time. In this example, we will be using a tool called [packer](https://www.packer.io/) to create our customized virtual machine image. Packer automates the process of building and customizing VMs across all major cloud providers. Now, to create a customized VM image, follow steps *a.* to *f.* #### a. Install `packer` Follow the [getting started guide](https://learn.hashicorp.com/tutorials/packer/get-started-install-cli?in=packer/azure-get-started) to download the necessary binary according to your platform and install it. #### b. Authenticate `packer` with Azure There are several ways to authenticate `packer` to work with Azure (details provided [here](https://learn.hashicorp.com/tutorials/packer/get-started-install-cli?in=packer/azure-get-started)). However, since we already have installed Azure cli (`az`) at the beginning of the notebook, authenticating `packer` with `az` cli is the easiest option. We will let `packer` use the Azure credentials from `az` cli, and so, you do not have to do anything further in this step. #### c. Generate the cloud init script for customizing the VM image `packer` can use a [cloud-init](https://cloudinit.readthedocs.io/en/latest/) script to initialize a VM. The cloud init script contains the set of commands that will set up the environment of our customized VM. We will pass this as an external file to the `packer` command via a configuration script. The cloud init file [cloud_init.yaml.j2](./configs/cloud_init.yaml.j2) file is present in the `configs` folder. In case you want to add/modify any configuration, edit the [cloud_init.yaml.j2](./configs/cloud_init.yaml.j2) before proceeding to the next steps. #### d. Write packer configuration to a configuration file We now need to provide `packer` with a build file with platform related and cloud-init configurations. `packer` will use this to create the customized VM. In this example, we are creating a single custom VM image that will be accessible by the user only. We will use a Ubuntu Server 18.04 base image and customize it. Later on, we will instantiate all our VMs from this customized VM image. If you are curious about what else you can configure, take a look at all the available [Azure build parameters for `packer`](https://www.packer.io/docs/builders/azure/arm). #### NOTE Our resource group already exists in this example. Hence we simply pass in our resource group name in the required parameters `managed_image_resource_group_name` and `build_resource_group_name`. ```ipython3 custom_vm_image_name = "FILL-THIS-IN" packer_config = { "builders": [ { "type": "azure-arm", "use_azure_cli_auth": True, "managed_image_resource_group_name": resource_group, "managed_image_name": custom_vm_image_name, "custom_data_file": "./configs/cloud_init.yaml.j2", "os_type": "Linux", "image_publisher": "Canonical", "image_offer": "UbuntuServer", "image_sku": "18.04-LTS", "azure_tags": { "dept": "RAPIDS-CSP", "task": "RAPIDS Custom Image deployment", }, "build_resource_group_name": resource_group, "vm_size": vm_size, } ], "provisioners": [ { "inline": [ ( "echo 'Waiting for cloud-init'; " "while [ ! -f /var/lib/cloud/instance/boot-finished ]; " "do sleep 1; done; echo 'Done'" ) ], "type": "shell", } ], } with open("packer_config.json", "w") as fh: fh.write(json.dumps(packer_config)) ``` #### e. Run `packer` build and create the image ```ipython3 # # Uncomment the following line and run to create the custom image # ! packer build packer_config.json ``` This will take around 15 minutes. Grab a coffee or watch an episode of your favourite tv show and come back. But remember, you will only have to do this once, unless you want to update the packages in the VM. This means that you can make this custom image once, and then keep on using it for hundreds of times. While packer is building the image, you will see an output similar to what is shown below. ```console $ packer build packer_config.json azure-arm: output will be in this color. ==> azure-arm: Running builder ... ==> azure-arm: Getting tokens using Azure CLI ==> azure-arm: Getting tokens using Azure CLI azure-arm: Creating Azure Resource Manager (ARM) client ... ==> azure-arm: Using existing resource group ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> Location : ==> azure-arm: Validating deployment template ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> DeploymentName : 'pkrdp04rrahxkg9' ==> azure-arm: Deploying deployment template ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> DeploymentName : 'pkrdp04rrahxkg9' ==> azure-arm: ==> azure-arm: Getting the VM's IP address ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> PublicIPAddressName : 'pkrip04rrahxkg9' ==> azure-arm: -> NicName : 'pkrni04rrahxkg9' ==> azure-arm: -> Network Connection : 'PublicEndpoint' ==> azure-arm: -> IP Address : '40.77.62.118' ==> azure-arm: Waiting for SSH to become available... ==> azure-arm: Connected to SSH! ==> azure-arm: Provisioning with shell script: /tmp/packer-shell614221056 azure-arm: Waiting for cloud-init azure-arm: Done ==> azure-arm: Querying the machine's properties ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> ComputeName : 'pkrvm04rrahxkg9' ==> azure-arm: -> Managed OS Disk : '/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pkros04rrahxkg9' ==> azure-arm: Querying the machine's additional disks properties ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> ComputeName : 'pkrvm04rrahxkg9' ==> azure-arm: Powering off machine ... ==> azure-arm: -> ResourceGroupName : ==> azure-arm: -> ComputeName : 'pkrvm04rrahxkg9' ==> azure-arm: Capturing image ... ==> azure-arm: -> Compute ResourceGroupName : ==> azure-arm: -> Compute Name : 'pkrvm04rrahxkg9' ==> azure-arm: -> Compute Location : ==> azure-arm: -> Image ResourceGroupName : ==> azure-arm: -> Image Name : ==> azure-arm: -> Image Location : ==> azure-arm: ==> azure-arm: Deleting individual resources ... ==> azure-arm: Adding to deletion queue -> Microsoft.Compute/virtualMachines : 'pkrvm04rrahxkg9' ==> azure-arm: Adding to deletion queue -> Microsoft.Network/networkInterfaces : 'pkrni04rrahxkg9' ==> azure-arm: Adding to deletion queue -> Microsoft.Network/publicIPAddresses : 'pkrip04rrahxkg9' ==> azure-arm: Adding to deletion queue -> Microsoft.Network/virtualNetworks : 'pkrvn04rrahxkg9' ==> azure-arm: Attempting deletion -> Microsoft.Network/networkInterfaces : 'pkrni04rrahxkg9' ==> azure-arm: Waiting for deletion of all resources... ==> azure-arm: Attempting deletion -> Microsoft.Network/publicIPAddresses : 'pkrip04rrahxkg9' ==> azure-arm: Attempting deletion -> Microsoft.Compute/virtualMachines : 'pkrvm04rrahxkg9' ==> azure-arm: Attempting deletion -> Microsoft.Network/virtualNetworks : 'pkrvn04rrahxkg9' . . . . . . . . ==> azure-arm: Deleting -> Microsoft.Compute/disks : '/subscriptions//resourceGroups//providers/Microsoft.Compute/disks/pkros04rrahxkg9' ==> azure-arm: Removing the created Deployment object: 'pkrdp04rrahxkg9' ==> azure-arm: ==> azure-arm: The resource group was not created by Packer, not deleting ... Build 'azure-arm' finished after 16 minutes 22 seconds. ==> Wait completed after 16 minutes 22 seconds ==> Builds finished. The artifacts of successful builds are: --> azure-arm: Azure.ResourceManagement.VMImage: OSType: Linux ManagedImageResourceGroupName: ManagedImageName: ManagedImageId: /subscriptions//resourceGroups//providers/Microsoft.Compute/images/ ManagedImageLocation: ``` --- When `packer` finishes, at the bottom of the output, you will see something similar to the following: ```default ManagedImageResourceGroupName: ManagedImageName: ManagedImageId: /subscriptions//resourceGroups//providers/Microsoft.Compute/images/ ManagedImageLocation: ``` Make note of the `ManagedImageId`. This is the resource id of the custom image we will use. As shown above the `ManagedImageId` will look something like : `/subscriptions/12345/resourceGroups/myown-rg/providers/Microsoft.Compute/images/myCustomImage` #### f. Set up customized VM information and clear default dask config Once you have the custom VM resource id, you should reset the default VM image information in `dask.config`. The default image value loaded in `dask.config` is that of a basic Ubuntu Server 18.04 LTS (the one that you already customized). If you do not reset it, `dask` will try to use that image instead of your custom made one. ```ipython3 # fill this in with the value from above # or the customized VM id if you already have resource id of the customized VM from a previous run. ManagedImageId = "FILL-THIS-IN" ``` ```ipython3 dask.config.set({"cloudprovider.azure.azurevm.vm_image": {}}) config = dask.config.get("cloudprovider.azure.azurevm", {}) print(config) vm_image = {"id": ManagedImageId} print(vm_image) ``` ### Step 2.1: Start the VM Cluster in Azure Here, if you have used Option 1, i.e., the NVIDIA VM image, pass an empty string for `vm_image` information. For Option 2, pass the `vm_image` information that you got from the output of `packer` run as a parameter to `AzureVMCluster`. Also turn off the bootstrapping of the VM by passing `bootstrap=False`. This will turn off installation of the dependencies in the VM while instantiating, since we already have them on our custom VM in either cases. #### NOTE The rest of the notebook should be the same irrespective of whether you chose Option 1 or Option 2. #### NOTE The number of actual workers that our cluster would have is not always equal to the number of VMs spawned i.e. the value of $n\_workers$ passed in. If the number of GPUs in the chosen `vm_size` is $G$ and number of VMs spawned is $n\_workers$, then we have then number of actual workers $W = n\_workers \times G$. For example, for `Standard_NC12s_v3` VMs that have 2 V100 GPUs per VM, for $n\_workers=2$, we have $W = 2 \times 2=4$. ```ipython3 %%time cluster = AzureVMCluster( location=location, resource_group=resource_group, vnet=vnet, security_group=security_group, vm_image=vm_image, vm_size=vm_size, disk_size=200, docker_image=docker_image, worker_class=worker_class, n_workers=2, security=True, docker_args=docker_args, debug=False, bootstrap=False, # This is to prevent the cloud init jinja2 script from running in the custom VM. ) ``` ```ipython3 client = Client(cluster) client ``` ```ipython3 %%time client.wait_for_workers(2) ``` ```myst-ansi CPU times: user 0 ns, sys: 6.1 ms, total: 6.1 ms Wall time: 29 ms ``` ```ipython3 # Uncomment if you only have the scheduler with n_workers=0 and want to scale the workers separately. # %%time # client.cluster.scale(n_workers) ``` Wait till all the workers are up. This will wait for `n_workers` number of VMs to be up. Before we start the training process, let us take a quick look at the details of the GPUs in the worker pods that we will be using. ```ipython3 import pprint pp = pprint.PrettyPrinter() pp.pprint( client.scheduler_info() ) # will show some information of the GPUs of the workers ``` ```myst-ansi {'address': 'tls://10.5.0.42:8786', 'id': 'Scheduler-3bae5a4d-29d1-4317-bbfc-931e97a077fb', 'services': {'dashboard': 8787}, 'started': 1696235012.5914223, 'type': 'Scheduler', 'workers': {'tls://10.5.0.43:36201': {'gpu': {'memory-total': 17179869184, 'name': 'Tesla V100-PCIE-16GB'}, 'host': '10.5.0.43', 'id': 'dask-92c5978e-worker-54f8d057-1', 'last_seen': 1696235778.2340653, 'local_directory': '/tmp/dask-scratch-space/worker-6bghw_yx', 'memory_limit': 118225670144, 'metrics': {'bandwidth': {'total': 100000000, 'types': {}, 'workers': {}}, 'cpu': 4.0, 'digests_total_since_heartbeat': {'latency': 0.004627227783203125, 'tick-duration': 0.5006744861602783}, 'event_loop_interval': 0.019985613822937013, 'gpu': {'memory-used': 598867968, 'utilization': 0}, 'gpu_memory_used': 598867968, 'gpu_utilization': 0, 'host_disk_io': {'read_bps': 0.0, 'write_bps': 0.0}, 'host_net_io': {'read_bps': 612.42422993883, 'write_bps': 3346.3180145677247}, 'managed_bytes': 0, 'memory': 623116288, 'num_fds': 86, 'rmm': {'rmm-total': 0, 'rmm-used': 0}, 'spilled_bytes': {'disk': 0, 'memory': 0}, 'task_counts': {}, 'time': 1696235777.730071, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 0, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 0}}, 'name': 'dask-92c5978e-worker-54f8d057-1', 'nanny': 'tls://10.5.0.43:42265', 'nthreads': 1, 'resources': {}, 'services': {'dashboard': 44817}, 'status': 'running', 'type': 'Worker'}, 'tls://10.5.0.43:38107': {'gpu': {'memory-total': 17179869184, 'name': 'Tesla V100-PCIE-16GB'}, 'host': '10.5.0.43', 'id': 'dask-92c5978e-worker-54f8d057-0', 'last_seen': 1696235778.2329032, 'local_directory': '/tmp/dask-scratch-space/worker-ix8y4_eg', 'memory_limit': 118225670144, 'metrics': {'bandwidth': {'total': 100000000, 'types': {}, 'workers': {}}, 'cpu': 2.0, 'digests_total_since_heartbeat': {'latency': 0.004603147506713867, 'tick-duration': 0.4996976852416992}, 'event_loop_interval': 0.019999494552612306, 'gpu': {'memory-used': 598867968, 'utilization': 0}, 'gpu_memory_used': 598867968, 'gpu_utilization': 0, 'host_disk_io': {'read_bps': 0.0, 'write_bps': 0.0}, 'host_net_io': {'read_bps': 611.5250712835996, 'write_bps': 3341.404964660714}, 'managed_bytes': 0, 'memory': 623882240, 'num_fds': 86, 'rmm': {'rmm-total': 0, 'rmm-used': 0}, 'spilled_bytes': {'disk': 0, 'memory': 0}, 'task_counts': {}, 'time': 1696235777.729443, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 0, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 0}}, 'name': 'dask-92c5978e-worker-54f8d057-0', 'nanny': 'tls://10.5.0.43:33657', 'nthreads': 1, 'resources': {}, 'services': {'dashboard': 45421}, 'status': 'running', 'type': 'Worker'}, 'tls://10.5.0.44:34087': {'gpu': {'memory-total': 17179869184, 'name': 'Tesla V100-PCIE-16GB'}, 'host': '10.5.0.44', 'id': 'dask-92c5978e-worker-9f9a9c9b-1', 'last_seen': 1696235778.5268767, 'local_directory': '/tmp/dask-scratch-space/worker-1d7vbddw', 'memory_limit': 118225670144, 'metrics': {'bandwidth': {'total': 100000000, 'types': {}, 'workers': {}}, 'cpu': 0.0, 'digests_total_since_heartbeat': {'latency': 0.004075765609741211, 'tick-duration': 0.4998819828033447}, 'event_loop_interval': 0.02001068115234375, 'gpu': {'memory-used': 598867968, 'utilization': 0}, 'gpu_memory_used': 598867968, 'gpu_utilization': 0, 'host_disk_io': {'read_bps': 0.0, 'write_bps': 12597732.652975753}, 'host_net_io': {'read_bps': 612.7208378808626, 'write_bps': 3347.938695871903}, 'managed_bytes': 0, 'memory': 624406528, 'num_fds': 86, 'rmm': {'rmm-total': 0, 'rmm-used': 0}, 'spilled_bytes': {'disk': 0, 'memory': 0}, 'task_counts': {}, 'time': 1696235778.023989, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 0, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 0}}, 'name': 'dask-92c5978e-worker-9f9a9c9b-1', 'nanny': 'tls://10.5.0.44:37979', 'nthreads': 1, 'resources': {}, 'services': {'dashboard': 36073}, 'status': 'running', 'type': 'Worker'}, 'tls://10.5.0.44:37791': {'gpu': {'memory-total': 17179869184, 'name': 'Tesla V100-PCIE-16GB'}, 'host': '10.5.0.44', 'id': 'dask-92c5978e-worker-9f9a9c9b-0', 'last_seen': 1696235778.528408, 'local_directory': '/tmp/dask-scratch-space/worker-7y8g_hu7', 'memory_limit': 118225670144, 'metrics': {'bandwidth': {'total': 100000000, 'types': {}, 'workers': {}}, 'cpu': 0.0, 'digests_total_since_heartbeat': {'latency': 0.003975629806518555, 'tick-duration': 0.4994323253631592}, 'event_loop_interval': 0.020001530647277832, 'gpu': {'memory-used': 598867968, 'utilization': 0}, 'gpu_memory_used': 598867968, 'gpu_utilization': 0, 'host_disk_io': {'read_bps': 0.0, 'write_bps': 12589746.67130889}, 'host_net_io': {'read_bps': 612.3324205749067, 'write_bps': 3345.8163634027583}, 'managed_bytes': 0, 'memory': 623104000, 'num_fds': 86, 'rmm': {'rmm-total': 0, 'rmm-used': 0}, 'spilled_bytes': {'disk': 0, 'memory': 0}, 'task_counts': {}, 'time': 1696235778.0250378, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 0, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 0}}, 'name': 'dask-92c5978e-worker-9f9a9c9b-0', 'nanny': 'tls://10.5.0.44:36779', 'nthreads': 1, 'resources': {}, 'services': {'dashboard': 32965}, 'status': 'running', 'type': 'Worker'}}} ``` ## Step 3: Data Setup, Cleanup and Enhancement ### Step 3.a: Set up the workers for reading parquet files from Azure Data Lake endpoints We will now enable all the workers to read the `parquet` files directly from the Azure Data Lake endpoints. This requires the [`adlfs`](https://github.com/dask/adlfs) python library in the workers. We will pass in the simple function `installAdlfs` in `client.run` which will install the python package in all the workers. ```ipython3 from dask.distributed import PipInstall client.register_worker_plugin(PipInstall(packages=["adlfs"])) ``` ### Step 3.b: Data Cleanup, Enhancement and Persisting Scripts The data needs to be cleaned up first. We remove some columns that we are not interested in. We also define the datatypes each of the columns need to be read as. We also add some new features to our dataframe via some custom functions, namely: 1. Haversine distance: This is used for calculating the total trip distance. 2. Day of the week: This can be useful information for determining the fare cost. `add_features` function combines the two to produce a new dataframe that has the added features. #### NOTE In the function `persist_train_infer_split`, We will also persist the test dataset in the workers. If the `X_infer` i.e. the test dataset is small enough, we can call `compute()` on it to bring the test dataset to the local machine and then perform predict on it. But in general, if the `X_infer` is large, it may not fit in the GPU(s) of the local machine. Moreover, moving around a large amount of data will also add to the prediction latency. Therefore it is better to persist the test dataset on the dask workers, and then call the predict functionality on the individual workers. Finally we collect the prediction results from the dask workers. #### Adding features functions ```ipython3 import math from math import asin, cos, pi, sin, sqrt def haversine_distance_kernel( pickup_latitude_r, pickup_longitude_r, dropoff_latitude_r, dropoff_longitude_r, h_distance, radius, ): for i, (x_1, y_1, x_2, y_2) in enumerate( zip( pickup_latitude_r, pickup_longitude_r, dropoff_latitude_r, dropoff_longitude_r, strict=False, ) ): x_1 = pi / 180 * x_1 y_1 = pi / 180 * y_1 x_2 = pi / 180 * x_2 y_2 = pi / 180 * y_2 dlon = y_2 - y_1 dlat = x_2 - x_1 a = sin(dlat / 2) ** 2 + cos(x_1) * cos(x_2) * sin(dlon / 2) ** 2 c = 2 * asin(sqrt(a)) # radius = 6371 # Radius of earth in kilometers # currently passed as input arguments h_distance[i] = c * radius def day_of_the_week_kernel(day, month, year, day_of_week): for i, (_, _, _) in enumerate(zip(day, month, year, strict=False)): if month[i] < 3: shift = month[i] else: shift = 0 Y = year[i] - (month[i] < 3) y = Y - 2000 c = 20 d = day[i] m = month[i] + shift + 1 day_of_week[i] = (d + math.floor(m * 2.6) + y + (y // 4) + (c // 4) - 2 * c) % 7 def add_features(df): df["hour"] = df["tpepPickupDateTime"].dt.hour df["year"] = df["tpepPickupDateTime"].dt.year df["month"] = df["tpepPickupDateTime"].dt.month df["day"] = df["tpepPickupDateTime"].dt.day df["diff"] = ( df["tpepDropoffDateTime"] - df["tpepPickupDateTime"] ).dt.seconds # convert difference between pickup and dropoff into seconds df["pickup_latitude_r"] = df["startLat"] // 0.01 * 0.01 df["pickup_longitude_r"] = df["startLon"] // 0.01 * 0.01 df["dropoff_latitude_r"] = df["endLat"] // 0.01 * 0.01 df["dropoff_longitude_r"] = df["endLon"] // 0.01 * 0.01 df = df.drop("tpepDropoffDateTime", axis=1) df = df.drop("tpepPickupDateTime", axis=1) df = df.apply_rows( haversine_distance_kernel, incols=[ "pickup_latitude_r", "pickup_longitude_r", "dropoff_latitude_r", "dropoff_longitude_r", ], outcols=dict(h_distance=np.float32), kwargs=dict(radius=6371), ) df = df.apply_rows( day_of_the_week_kernel, incols=["day", "month", "year"], outcols=dict(day_of_week=np.float32), kwargs=dict(), ) df["is_weekend"] = df["day_of_week"] < 2 return df ``` Functions for cleaning and persisting the data in the workers. ```ipython3 def persist_train_infer_split( client, df, response_dtype, response_id, infer_frac=1.0, random_state=42, shuffle=True, ): workers = client.has_what().keys() X, y = df.drop([response_id], axis=1), df[response_id].astype("float32") infer_frac = max(0, min(infer_frac, 1.0)) X_train, X_infer, y_train, y_infer = train_test_split( X, y, shuffle=True, random_state=random_state, test_size=infer_frac ) with dask.annotate(workers=set(workers)): X_train, y_train = client.persist(collections=[X_train, y_train]) if infer_frac != 1.0: with dask.annotate(workers=set(workers)): X_infer, y_infer = client.persist(collections=[X_infer, y_infer]) wait([X_train, y_train, X_infer, y_infer]) else: X_infer = X_train y_infer = y_train wait([X_train, y_train]) return X_train, y_train, X_infer, y_infer def clean(df_part, must_haves): """ This function performs the various clean up tasks for the data and returns the cleaned dataframe. """ # iterate through columns in this df partition for col in df_part.columns: # drop anything not in our expected list if col not in must_haves: df_part = df_part.drop(col, axis=1) continue # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler if df_part[col].dtype == "object" and col in [ "tpepPickupDateTime", "tpepDropoffDateTime", ]: df_part[col] = df_part[col].astype("datetime64[ms]") continue # if column was read as a string, recast as float if df_part[col].dtype == "object": df_part[col] = df_part[col].str.fillna("-1") df_part[col] = df_part[col].astype("float32") else: # downcast from 64bit to 32bit types # Tesla T4 are faster on 32bit ops if "int" in str(df_part[col].dtype): df_part[col] = df_part[col].astype("int32") if "float" in str(df_part[col].dtype): df_part[col] = df_part[col].astype("float32") df_part[col] = df_part[col].fillna(-1) return df_part def taxi_data_loader( client, adlsaccount, adlspath, response_dtype=np.float32, infer_frac=1.0, random_state=0, ): # create a list of columns & dtypes the df must have must_haves = { "tpepPickupDateTime": "datetime64[ms]", "tpepDropoffDateTime": "datetime64[ms]", "passengerCount": "int32", "tripDistance": "float32", "startLon": "float32", "startLat": "float32", "rateCodeId": "int32", "endLon": "float32", "endLat": "float32", "fareAmount": "float32", } workers = client.has_what().keys() response_id = "fareAmount" storage_options = {"account_name": adlsaccount} taxi_data = dask_cudf.read_parquet( adlspath, storage_options=storage_options, chunksize=25e6, npartitions=len(workers), ) taxi_data = clean(taxi_data, must_haves) taxi_data = taxi_data.map_partitions(add_features) # Drop NaN values and convert to float32 taxi_data = taxi_data.dropna() fields = [ "passengerCount", "tripDistance", "startLon", "startLat", "rateCodeId", "endLon", "endLat", "fareAmount", "diff", "h_distance", "day_of_week", "is_weekend", ] taxi_data = taxi_data.astype("float32") taxi_data = taxi_data[fields] taxi_data = taxi_data.reset_index() return persist_train_infer_split( client, taxi_data, response_dtype, response_id, infer_frac, random_state ) ``` ### Step 3.c: Get the split data and persist across workers We will make use of the data from November and December 2014 for the purposes of the demo. ```ipython3 tic = timer() X_train, y_train, X_infer, y_infer = taxi_data_loader( client, adlsaccount="azureopendatastorage", adlspath="az://nyctlc/yellow/puYear=2014/puMonth=1*/*.parquet", infer_frac=0.1, random_state=42, ) toc = timer() print(f"Wall clock time taken for ETL and persisting : {toc-tic} s") ``` ```ipython3 X_train.shape[0].compute() ``` The size of our training dataset is around 49 million rows. Let’s look at the data locally to see what we’re dealing with. We see that there are columns for pickup and dropoff latitude and longitude, passenger count, trip distance, day of week etc. These are the information we’ll use to estimate the trip fare amount. ```ipython3 X_train.head() ``` ```ipython3 X_infer ``` ## Step 4: Train a XGBoost Model We are now ready to train a XGBoost model on the data and then predict the fare for each trip. ### Step 4.a: Set training Parameters In this training example, we will use RMSE as the evaluation metric. It is also worth noting that performing HPO will lead to a set of more optimal hyperparameters. Refer to the notebook [HPO-RAPIDS](../rapids-azureml-hpo/notebook.md) in this repository for how to perform HPO on Azure. ```ipython3 params = { "learning_rate": 0.15, "max_depth": 8, "objective": "reg:squarederror", "subsample": 0.7, "colsample_bytree": 0.7, "min_child_weight": 1, "gamma": 1, "silent": True, "verbose_eval": True, "booster": "gbtree", # 'gblinear' not implemented in dask "debug_synchronize": True, "eval_metric": "rmse", "tree_method": "gpu_hist", "num_boost_rounds": 100, } ``` ### Step 4.b: Train XGBoost Model Since the data is already persisted in the dask workers in the cluster, the next steps should not take a lot of time. ```ipython3 data_train = xgb.dask.DaskDMatrix(client, X_train, y_train) tic = timer() xgboost_output = xgb.dask.train( client, params, data_train, num_boost_round=params["num_boost_rounds"] ) xgb_gpu_model = xgboost_output["booster"] toc = timer() print(f"Wall clock time taken for this cell : {toc-tic} s") ``` ```myst-ansi Wall clock time taken for this cell : 9.483002611901611 s ``` ### Step 4.c: Save the trained model to disk locally ```ipython3 xgb_gpu_model ``` ```ipython3 model_filename = "trained-model_nyctaxi.xgb" xgb_gpu_model.save_model(model_filename) ``` ## Step 5: Predict & Score using vanilla XGBoost Predict Here we will use the `predict` and `inplace_predict` methods provided by the `xgboost.dask` library, out of the box. Later we will also use [Forest Inference Library (FIL)](https://docs.rapids.ai/api/cuml/nightly/api.html?highlight=forestinference#cuml.ForestInference) to perform prediction. ```ipython3 _y_test = y_infer.compute() wait(_y_test) ``` ```ipython3 d_test = xgb.dask.DaskDMatrix(client, X_infer) tic = timer() y_pred = xgb.dask.predict(client, xgb_gpu_model, d_test) y_pred = y_pred.compute() wait(y_pred) toc = timer() print(f"Wall clock time taken for xgb.dask.predict : {toc-tic} s") ``` ```myst-ansi Wall clock time taken for xgb.dask.predict : 1.5550181320868433 s ``` ### Inference with the inplace predict method of dask XGBoost ```ipython3 tic = timer() y_pred = xgb.dask.inplace_predict(client, xgb_gpu_model, X_infer) y_pred = y_pred.compute() wait(y_pred) toc = timer() print(f"Wall clock time taken for inplace inference : {toc-tic} s") ``` ```myst-ansi Wall clock time taken for inplace inference : 1.8849179210374132 s ``` ```ipython3 tic = timer() print("Calculating MSE") score = mean_squared_error(y_pred, _y_test) print("Workflow Complete - RMSE: ", np.sqrt(score)) toc = timer() print(f"Wall clock time taken for this cell : {toc-tic} s") ``` ```myst-ansi Calculating MSE Workflow Complete - RMSE: 2.2968235 Wall clock time taken for this cell : 0.009336891933344305 s ``` ## Step 6: Predict & Score using FIL or Forest Inference Library [Forest Inference Library (FIL)](https://docs.rapids.ai/api/cuml/nightly/api.html?highlight=forestinference#cuml.ForestInference) provides GPU accelerated inference capabilities for tree models. We will import the FIL functionality from [cuML](https://github.com/rapidsai/cuml) library. It accepts a **trained** tree model in a treelite format (currently LightGBM, XGBoost and SKLearn GBDT and random forest models are supported). In general, using FIL allows for faster inference while using a large number of workers, and the latency benefits are more pronounced as the size of the dataset grows large. ### Step 6.a: Predict using `compute` on a single worker in case the test dataset is small. As noted in *Step 3.b*, in case the test dataset is huge, it makes sense to call predict individually on the dask workers instead of bringing the entire test dataset to the local machine. To perform prediction individually on the dask workers, each dask worker needs to load the XGB model using FIL. However, the dask workers are remote and do not have access to the locally saved model. Hence we need to send the locally saved XGB model to the dask workers. #### Persist the local model in the remote dask workers ```ipython3 # the code below will read the locally saved xgboost model # in binary format and write a copy of it to all dask workers def read_model(path): """Read model file into memory.""" with open(path, "rb") as fh: return fh.read() def write_model(path, data): """Write model file to disk.""" with open(path, "wb") as fh: fh.write(data) return path model_data = read_model("trained-model_nyctaxi.xgb") # Tell all the workers to write the model to disk client.run(write_model, "/tmp/model.dat", model_data) # this code reads the binary file in worker directory # and loads the model via FIL for prediction def predict_model(input_df): from cuml import ForestInference # load xgboost model using FIL and make prediction fm = ForestInference.load("/tmp/model.dat", model_type="xgboost") print(fm) pred = fm.predict(input_df) return pred ``` #### Inference with distributed predict with FIL ```ipython3 tic = timer() predictions = X_infer.map_partitions( predict_model, meta="float" ) # this is like MPI reduce y_pred = predictions.compute() wait(y_pred) toc = timer() print(f"Wall clock time taken for this cell : {toc-tic} s") ``` ```ipython3 rows_csv = X_infer.iloc[:, 0].shape[0].compute() print( f"It took {toc-tic} seconds to predict on {rows_csv} rows using FIL distributedly on each worker" ) ``` ```myst-ansi It took 5.638823717948981 seconds to predict on 5426301 rows using FIL distributedly on each worker ``` ```ipython3 tic = timer() score = mean_squared_error(y_pred, _y_test) toc = timer() print("Final - RMSE: ", np.sqrt(score)) ``` ```myst-ansi Final - RMSE: 2.2968235 ``` ## Step 7: Clean up ```ipython3 client.close() cluster.close() ``` ```myst-ansi Terminated VM dask-92c5978e-worker-54f8d057 Terminated VM dask-92c5978e-worker-9f9a9c9b Removed disks for VM dask-92c5978e-worker-54f8d057 Removed disks for VM dask-92c5978e-worker-9f9a9c9b Deleted network interface Deleted network interface Terminated VM dask-92c5978e-scheduler Removed disks for VM dask-92c5978e-scheduler Deleted network interface Unassigned public IP ``` # index.html.md # Accelerating data analysis using cudf.pandas *April, 2025* This notebook was designed to be used on Coiled Notebooks to demonstrate how data scientists can quickly and easily leverage cloud GPU resources and dramatically accelerate their analysis workflows without modifying existing code. Using the NYC ride-share dataset—containing millions of trip records with detailed information about pickup/dropoff locations, fares, and ride durations—we demonstrate the seamless integration of GPU acceleration through RAPIDS’ cudf.pandas extension. By simply adding one import statement, analysts can continue using the familiar Pandas API while operations execute on NVIDIA GPUs in the background, reducing processing time from minutes to seconds. To use cudf.pandas, Load the cudf.pandas extension at the beginning of your notebook or IPython session. After that, just import pandas and operations will use the GPU. ```ipython3 %load_ext cudf.pandas import matplotlib.pyplot as plt import pandas as pd import seaborn as sns ``` # NYC Taxi Data Analysis This notebook analyzes taxi ride data from the NYC TLC ride share dataset. We’re using this dataset stored in S3 that contains information about rides including pickup/dropoff locations, fares, trip times, and other metrics. #### NOTE For more details about this notebook check out the accompanying blog post [Simplify Setup and Boost Data Science in the Cloud using NVIDIA CUDA-X and Coiled](https://developer.nvidia.com/blog/simplify-setup-and-boost-data-science-in-the-cloud-using-nvidia-cuda-x-and-coiled/). In the following cells, we: 1. Create an S3 filesystem connection 2. Load and concatenate multiple Parquet files from the dataset 3. Explore the data structure and prepare for analysis The dataset contains detailed ride information that will allow us to analyze patterns in taxi usage, pricing, and service differences between companies. ```ipython3 import s3fs fs = s3fs.S3FileSystem(anon=True) ``` ```ipython3 path_files = [] for i in range(660, 720): path_files.append( pd.read_parquet(f"s3://coiled-data/uber/part.{i}.parquet", filesystem=fs) ) data = pd.concat(path_files, ignore_index=True) len(data) ``` # Data Loading and Initial Exploration In the previous cells, we: 1. Set up AWS credentials to access S3 storage 2. Created an S3 filesystem connection 3. Loaded and concatenated multiple Parquet files (parts 660-720) from the ride-share dataset 4. Checked the dataset size (64,811,259 records) Now we’re examining the structure of our data by: - Viewing the first few rows with `head()` - Inspecting column names - Analyzing data types - Optimizing memory usage by converting data types (int32→int16, float64→float32, string→category) The dataset contains ride information from various ride-hailing services, which we’ll map to company names (Uber, Lyft, etc.) for better analysis. ```ipython3 data.head() ``` ```ipython3 data.columns ``` ```ipython3 data.dtypes ``` ```ipython3 for col in data.columns: if data[col].dtype == "int32": min_value = -32768 max_value = 32767 if data[col].min() >= min_value and data[col].max() <= max_value: data[col] = data[col].astype("int16") else: print( f"Column '{col}' cannot be safely converted to int16 due to value range." ) if data[col].dtype == "float64": data[col] = data[col].astype("float32") if data[col].dtype == "string" or data[col].dtype == "object": data[col] = data[col].astype("category") ``` ```myst-ansi Column 'trip_time' cannot be safely converted to int16 due to value range. ``` ```ipython3 data.dtypes ``` ```ipython3 # data = data.dropna() # Create a company mapping dictionary company_mapping = { "HV0002": "Juno", "HV0003": "Uber", "HV0004": "Via", "HV0005": "Lyft", } # Replace the hvfhs_license_num with company names data["company"] = data["hvfhs_license_num"].map(company_mapping) data.drop("hvfhs_license_num", axis=1, inplace=True) ``` # Data Transformation and Analysis In the next three cells, we’re performing several key data transformations and analyses: 1. **Cell 15**: We’re extracting the month from the pickup datetime and creating a new column. Then we’re calculating the total fare by summing various fare components. Finally, we’re grouping the data by company and month to analyze trip counts, revenue, average fares, and driver payments. 2. **Cell 16**: We’re calculating the profit for each company by month by subtracting the total driver payout from the total revenue. 3. **Cell 17**: We’re displaying the complete grouped dataset that includes all the metrics we’ve calculated (trip counts, revenue, average fares, driver payouts, and profits) for each company by month. These transformations help us understand the financial performance of different rideshare companies across different months. ```ipython3 data["pickup_month"] = data["pickup_datetime"].dt.month data["total_fare"] = ( data["base_passenger_fare"] + data["tolls"] + data["bcf"] + data["sales_tax"] + data["congestion_surcharge"] + data["airport_fee"] ) grouped = ( data.groupby(["company", "pickup_month"]) .agg( { "company": "count", "total_fare": ["sum", "mean"], "driver_pay": "sum", "tips": "sum", } ) .reset_index() ) grouped.columns = [ "company", "pickup_month", "trip_count", "total_revenue", "avg_fare", "total_driver_pay", "total_tips", ] grouped["total_driver_payout"] = grouped["total_driver_pay"] + grouped["total_tips"] grouped = grouped[ [ "company", "pickup_month", "trip_count", "total_revenue", "avg_fare", "total_driver_payout", ] ] grouped = grouped.sort_values(["company", "pickup_month"]) grouped["profit"] = grouped["total_revenue"] - grouped["total_driver_payout"] grouped.head() ``` ```ipython3 grouped["profit"] = grouped["total_revenue"] - grouped["total_driver_payout"] ``` ```ipython3 grouped ``` # Trip Duration Analysis The next three cells are performing the following operations: 1. **Cell 19**: We’re defining a function called `categorize_trip` that categorizes trips based on their duration. - Trips less than 10 minutes (600 seconds) are categorized as short (0) - Trips between 10-20 minutes (600-1200 seconds) are categorized as medium (1) - Trips longer than 20 minutes (1200+ seconds) are categorized as long (2) This categorization helps us analyze how trip duration affects various metrics. User-Defined Functions (UDFs) like the one above. perform better with numerical values as compared to strings, hence we are using a numerical representation of trip types. 2. **Cell 20**: We’re applying the `categorize_trip` function to each row in our dataset, creating a new column called ‘trip_category’ that contains the category value (0, 1, or 2) for each trip. This transformation allows us to group and analyze trips by their duration categories. 3. **Cell 21**: We’re grouping the data by trip category and calculating statistics for each group: - The mean and sum of total fares - The count of trips in each category This analysis helps us understand how trip duration relates to fare amounts and trip frequency. ```ipython3 def categorize_trip(row): if row["trip_time"] < 600: # Less than 10 minutes return 0 elif row["trip_time"] < 1200: # 10-20 minutes return 1 else: # More than 20 minutes return 2 ``` ```ipython3 # Apply UDF data["trip_category"] = data.apply(categorize_trip, axis=1) ``` ```ipython3 # Create a mapping for trip categories trip_category_map = {0: "short", 1: "medium", 2: "long"} # Group by trip category category_stats = data.groupby("trip_category").agg( {"total_fare": ["mean", "sum"], "trip_time": "count"} ) # Rename the index with descriptive labels category_stats.index = category_stats.index.map(lambda x: f"{trip_category_map[x]}") category_stats ``` ## Location Data Analysis The TLC dataset has columns PULocationID and DOLocationID which indicate the zone and borough information according to the taxi zones of the New York TLC. You can download this information and look up the zones corresponding to the index in CSV format [here](https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv). The next few cells (23-32) are focused on: 1. **Cells 23-26**: Loading and preparing taxi zone data - Loading taxi zone information from a CSV file - Examining the data structure - Selecting only the relevant columns (LocationID, zone, borough) 2. **Cells 27-28**: Enriching our trip data with location information - Merging pickup location data using PULocationID - Creating a combined pickup_location field - Merging dropoff location data using DOLocationID - Creating a combined dropoff_location field 3. **Cell 29**: Analyzing popular routes - Grouping data by pickup and dropoff locations - Counting rides between each location pair - Identifying the top 10 most frequent routes (hotspots) ```ipython3 taxi_zones = pd.read_csv("taxi_zone_lookup.csv") ``` ```ipython3 taxi_zones.head() ``` ```ipython3 taxi_zones = taxi_zones[["LocationID", "zone", "borough"]] ``` ```ipython3 taxi_zones ``` ```ipython3 data = pd.merge( data, taxi_zones, left_on="PULocationID", right_on="LocationID", how="left" ) for col in ["zone", "borough"]: data[col] = data[col].fillna("NA") data["pickup_location"] = data["zone"] + "," + data["borough"] data.drop(["LocationID", "zone", "borough"], axis=1, inplace=True) ``` ```ipython3 data = pd.merge( data, taxi_zones, left_on="DOLocationID", right_on="LocationID", how="left" ) for col in ["zone", "borough"]: data[col] = data[col].fillna("NA") data["dropoff_location"] = data["zone"] + "," + data["borough"] data.drop(["LocationID", "zone", "borough"], axis=1, inplace=True) ``` ```ipython3 location_group = ( data.groupby(["pickup_location", "dropoff_location"]) .size() .reset_index(name="ride_count") ) location_group = location_group.sort_values("ride_count", ascending=False) # Identify top 10 hotspots top_hotspots = location_group.head(10) print("Top 10 Pickup and Dropoff Hotspots:") print(top_hotspots) ``` ```myst-ansi Top 10 Pickup and Dropoff Hotspots: pickup_location dropoff_location ride_count 29305 JFK Airport,Queens NA,NA 214629 17422 East New York,Brooklyn East New York,Brooklyn 204280 5533 Borough Park,Brooklyn Borough Park,Brooklyn 144201 31607 LaGuardia Airport,Queens NA,NA 130948 8590 Canarsie,Brooklyn Canarsie,Brooklyn 117952 13640 Crown Heights North,Brooklyn Crown Heights North,Brooklyn 99066 1068 Astoria,Queens Astoria,Queens 87116 2538 Bay Ridge,Brooklyn Bay Ridge,Brooklyn 87009 29518 Jackson Heights,Queens Jackson Heights,Queens 85413 50620 South Ozone Park,Queens JFK Airport,Queens 82798 ``` ```ipython3 data.drop(["pickup_month", "PULocationID", "DOLocationID"], axis=1, inplace=True) ``` ```ipython3 data.head() ``` # Time-Based Analysis and Visualization The next two cells analyze and visualize how ride patterns change throughout the day: 1. Cell 33 extracts the hour of the day from pickup timestamps and calculates the average trip time and cost for each hour. It handles missing hours by adding them with zero values, ensuring a complete 24-hour view. 2. Cell 34 displays the resulting dataframe, showing how trip duration and cost vary by hour of the day. This helps identify peak hours, pricing patterns, and potential opportunities for optimizing service. ```ipython3 # Find the volume per hour of the day and how much an average trip costs along with average trip time. data["pickup_hour"] = data["pickup_datetime"].dt.hour time_grouped = ( data.groupby("pickup_hour") .agg({"trip_time": "mean", "total_fare": "mean"}) .reset_index() ) time_grouped.columns = ["pickup_hour", "mean_trip_time", "mean_trip_cost"] hours = range(0, 24) missing_hours = [h for h in hours if h not in time_grouped["pickup_hour"].values] for hour in missing_hours: new_row = {"pickup_hour": hour, "mean_trip_time": 0.0, "mean_trip_cost": 0.0} time_grouped = pd.concat([time_grouped, pd.DataFrame([new_row])], ignore_index=True) time_grouped = time_grouped.sort_values("pickup_hour") ``` ```ipython3 time_grouped ``` # Time-Based Visualization The next cell creates a time series visualization that shows how average fares change over time for different ride-hailing companies: 1. It groups the data by company and day (using pd.Grouper with freq=’D’) 2. Calculates the mean total fare for each company-day combination 3. Creates a line plot using seaborn’s lineplot function, with: - Time on the x-axis - Average fare on the y-axis - Different colors for each company This visualization helps identify trends in pricing over time and compare fare patterns between companies (Uber vs. Lyft). ```ipython3 financial = ( data.groupby(["company", pd.Grouper(key="pickup_datetime", freq="D")])[ ["total_fare"] ] .mean() .reset_index() ) # Example visualization plt.figure(figsize=(10, 6)) sns.lineplot(x="pickup_datetime", y="total_fare", hue="company", data=financial) plt.title("Average Fare Over Time by Company") plt.show() ``` # Shared Ride and Accessibility Analysis The next cell analyzes two important service aspects of ride-hailing platforms: 1. **Shared Ride Metrics**: - Calculates average fare and trip time for shared vs. non-shared rides - Determines the acceptance rate of shared ride requests (when riders opt in but may not get matched) - Helps understand the economics and efficiency of ride-sharing features 2. **Wheelchair Accessibility Metrics**: - Analyzes average fare and trip time for wheelchair accessible vehicles (WAV) - Calculates the percentage of wheelchair accessible ride requests that were fulfilled - Provides insights into service equity and accessibility compliance The analysis prints summary statistics for both service types and their respective acceptance rates. ```ipython3 shared_grouped = ( data.groupby("shared_match_flag") .agg({"total_fare": "mean", "trip_time": "mean"}) .reset_index() ) shared_grouped.columns = ["shared_match_flag", "mean_fare_shared", "mean_time_shared"] shared_request_acceptance = ( data[data["shared_request_flag"] == "Y"] .groupby("shared_match_flag")["shared_request_flag"] .count() .reset_index() ) shared_request_acceptance.columns = ["shared_match_flag", "count"] shared_request_acceptance = shared_request_acceptance.set_index("shared_match_flag") total_shared_requests = shared_request_acceptance.sum() shared_acceptance_rate = ( shared_request_acceptance["count"]["Y"] / total_shared_requests * 100 ) print(f"Shared Ride Acceptance Rate: {float(shared_acceptance_rate)}%") wav_grouped = ( data.groupby("wav_match_flag") .agg({"total_fare": "mean", "trip_time": "mean"}) .reset_index() ) wav_grouped.columns = ["wav_match_flag", "mean_fare_wav", "mean_time_wav"] # 4. Calculate percentage of wheelchair accessible ride requests that were accepted wav_request_acceptance = ( data[data["wav_request_flag"] == "Y"] .groupby("wav_match_flag")["wav_request_flag"] .count() .reset_index() ) wav_request_acceptance.columns = ["wav_match_flag", "count"] wav_request_acceptance = wav_request_acceptance.set_index("wav_match_flag") total_wav_requests = wav_request_acceptance.sum() wav_acceptance_rate = wav_request_acceptance["count"]["Y"] / total_wav_requests * 100 print(f"Wheelchair Accessible Ride Acceptance Rate: {float(wav_acceptance_rate)}%") # Display the results print(shared_grouped) print(wav_grouped) ``` ```myst-ansi Shared Ride Acceptance Rate: 33.766986535707765% Wheelchair Accessible Ride Acceptance Rate: 99.99361674964892% shared_match_flag mean_fare_shared mean_time_shared 0 Y 25.189627 1770.353920 1 N 28.541140 1154.111679 wav_match_flag mean_fare_wav mean_time_wav 0 Y 24.208971 1064.793459 1 N 28.819339 1166.241749 ``` # Fare Per Mile Analysis In the next three cells, we: 1. Define a function `fare_per_mile()` that calculates the fare per mile for each trip by dividing the total fare by the trip miles. The function includes validation to handle edge cases where trip miles or trip time might be zero. 2. Apply this function to create a new column in our dataset called ‘fare_per_mile’, which represents the cost efficiency of each trip. 3. Calculate and display summary statistics for fare per mile grouped by trip category, showing the mean fare per mile and count of trips for each category. This helps us understand how cost efficiency varies across different trip types. This analysis provides insights into pricing efficiency and helps identify potential pricing anomalies across different trip categories. ```ipython3 def fare_per_mile(row): if row["trip_time"] > 0: if row["trip_miles"] > 0: return row["total_fare"] / row["trip_miles"] else: return 0 return 0 ``` ```ipython3 data["fare_per_mile"] = data.apply(fare_per_mile, axis=1) ``` ```ipython3 # Create a mapping for trip categories trip_category_map = {0: "short", 1: "medium", 2: "long"} # Calculate fare per mile statistics grouped by trip category fare_per_mile_stats = data.groupby("trip_category").agg( {"fare_per_mile": ["mean", "count"]} ) # Add a more descriptive index using the mapping fare_per_mile_stats.index = fare_per_mile_stats.index.map( lambda x: f"{trip_category_map[x]}" ) fare_per_mile_stats ``` # Conclusion This example showcases how data scientists can leverage GPU computing through RAPIDS cuDF.pandas to analyze transportation data at scale, gaining insights into pricing patterns, geographic hotspots, and service efficiency. For additional learning resources: * Blog: [Simplify Setup and Boost Data Science in the Cloud using NVIDIA CUDA-X and Coiled](https://developer.nvidia.com/blog/simplify-setup-and-boost-data-science-in-the-cloud-using-nvidia-cuda-x-and-coiled/) * [cuDF.pandas](https://rapids.ai/cudf-pandas/) - Accelerate pandas operations on GPUs with zero code changes, getting up to 150x performance improvements while maintaining compatibility with the pandas ecosystem * [RAPIDS workflow examples](https://docs.rapids.ai/deployment/stable/examples/) - Explore a comprehensive collection of GPU-accelerated data science workflows spanning cloud deployments, hyperparameter optimization, multi-GPU training, and integration with platforms like Kubernetes, Databricks, and Snowflake # index.html.md # Perform Time Series Forecasting on Google Kubernetes Engine with NVIDIA GPUs *October, 2023* In this example, we will be looking at a real-world example of **time series forecasting** with data from [the M5 Forecasting Competition](https://www.kaggle.com/competitions/m5-forecasting-accuracy). Walmart provides historical sales data from multiple stores in three states, and our job is to predict the sales in a future 28-day period. ## Prerequisites ### Prepare GKE cluster To run the example, you will need a working Google Kubernetes Engine (GKE) cluster with access to NVIDIA GPUs. 1. To ensure that the example runs smoothly, ensure that you have ample memory in your GPUs. This notebook has been tested with NVIDIA A100. 2. Set up Dask-Kubernetes integration by following instructions in the following guides: * [Install the Dask-Kubernetes operator](https://kubernetes.dask.org/en/latest/operator_installation.html) * [Install Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/) Kubeflow is not strictly necessary, but we highly recommend it, as Kubeflow gives you a nice notebook environment to run this notebook within the k8s cluster. (You may choose any method; we tested this example after installing Kubeflow from manifests.) When creating the notebook environment, use the following configuration: * 2 CPUs, 16 GiB of memory * 1 NVIDIA GPU * 40 GiB disk volume After uploading all the notebooks in the example, run this notebook (`notebook.ipynb`) in the notebook environment. Note: We will use the worker pods to speed up the training stage. The preprocessing steps will run solely on the scheduler node. ### Prepare a bucket in Google Cloud Storage Create a new bucket in Google Cloud Storage. Make sure that the worker pods in the k8s cluster has read/write access to this bucket. This can be done in one of the following methods: 1. Option 1: Specify an additional scope when provisioning the GKE cluster. When you are provisioning a new GKE cluster, add the `storage-rw` scope. This option is only available if you are creating a new cluster from scratch. If you are using an existing GKE cluster, see Option 2. Example: ```default gcloud container clusters create my_new_cluster --accelerator type=nvidia-tesla-t4 \ --machine-type n1-standard-32 --zone us-central1-c --release-channel stable \ --num-nodes 5 --scopes=gke-default,storage-rw ``` 1. Option 2: Grant bucket access to the associated service account. Find out which service account is associated with your GKE cluster. You can grant the bucket access to the service account as follows: Navigate to the Cloud Storage console, open the Bucket Details page for the bucket, open the Permissions tab, and click on Grant Access. Enter the name of the bucket that your cluster has read-write access to: ```ipython3 bucket_name = "" ``` ### Install Python packages in the notebook environment ```ipython3 !pip install kaggle gcsfs dask-kubernetes optuna ``` ```ipython3 # Test if the bucket is accessible import gcsfs fs = gcsfs.GCSFileSystem() fs.ls(f"{bucket_name}/") ``` ## Obtain the time series data set from Kaggle If you do not yet have an account with Kaggle, create one now. Then follow instructions in [Public API Documentation of Kaggle](https://www.kaggle.com/docs/api) to obtain the API key. This step is needed to obtain the training data from the M5 Forecasting Competition. Once you obtained the API key, fill in the following: ```ipython3 kaggle_username = "" kaggle_api_key = "" ``` Now we are ready to download the data set: ```ipython3 %env KAGGLE_USERNAME=$kaggle_username %env KAGGLE_KEY=$kaggle_api_key !kaggle competitions download -c m5-forecasting-accuracy ``` Let’s unzip the ZIP archive and see what’s inside. ```ipython3 import zipfile with zipfile.ZipFile("m5-forecasting-accuracy.zip", "r") as zf: zf.extractall(path="./data") ``` ```ipython3 !ls -lh data/*.csv ``` ```myst-ansi -rw-r--r-- 1 rapids conda 102K Sep 28 18:59 data/calendar.csv -rw-r--r-- 1 rapids conda 117M Sep 28 18:59 data/sales_train_evaluation.csv -rw-r--r-- 1 rapids conda 115M Sep 28 18:59 data/sales_train_validation.csv -rw-r--r-- 1 rapids conda 5.0M Sep 28 18:59 data/sample_submission.csv -rw-r--r-- 1 rapids conda 194M Sep 28 18:59 data/sell_prices.csv ``` ## Data Preprocessing We are now ready to run the preprocessing steps. ### Import modules and define utility functions ```ipython3 import gc import pathlib import cudf import gcsfs import numpy as np def sizeof_fmt(num, suffix="B"): for unit in ["", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"]: if abs(num) < 1024.0: return f"{num:3.1f}{unit}{suffix}" num /= 1024.0 return f"{num:.1f}Yi{suffix}" def report_dataframe_size(df, name): mem_usage = sizeof_fmt(grid_df.memory_usage(index=True).sum()) print(f"{name} takes up {mem_usage} memory on GPU") ``` ### Load Data ```ipython3 TARGET = "sales" # Our main target END_TRAIN = 1941 # Last day in train set ``` ```ipython3 raw_data_dir = pathlib.Path("./data/") ``` ```ipython3 train_df = cudf.read_csv(raw_data_dir / "sales_train_evaluation.csv") prices_df = cudf.read_csv(raw_data_dir / "sell_prices.csv") calendar_df = cudf.read_csv(raw_data_dir / "calendar.csv").rename( columns={"d": "day_id"} ) ``` ```ipython3 train_df ``` The columns `d_1`, `d_2`, …, `d_1941` indicate the sales data at days 1, 2, …, 1941 from 2011-01-29. ```ipython3 prices_df ``` ```ipython3 calendar_df ``` ### Reformat sales times series data Pivot the columns `d_1`, `d_2`, …, `d_1941` into separate rows using `cudf.melt`. ```ipython3 index_columns = ["id", "item_id", "dept_id", "cat_id", "store_id", "state_id"] grid_df = cudf.melt( train_df, id_vars=index_columns, var_name="day_id", value_name=TARGET ) grid_df ``` For each time series, add 28 rows that corresponds to the future forecast horizon: ```ipython3 add_grid = cudf.DataFrame() for i in range(1, 29): temp_df = train_df[index_columns] temp_df = temp_df.drop_duplicates() temp_df["day_id"] = "d_" + str(END_TRAIN + i) temp_df[TARGET] = np.nan # Sales amount at time (n + i) is unknown add_grid = cudf.concat([add_grid, temp_df]) add_grid["day_id"] = add_grid["day_id"].astype( "category" ) # The day_id column is categorical, after cudf.melt grid_df = cudf.concat([grid_df, add_grid]) grid_df = grid_df.reset_index(drop=True) grid_df["sales"] = grid_df["sales"].astype( np.float32 ) # Use float32 type for sales column, to conserve memory grid_df ``` ### Free up GPU memory GPU memory is a precious resource, so let’s try to free up some memory. First, delete temporary variables we no longer need: ```ipython3 # Use xdel magic to scrub extra references from Jupyter notebook %xdel temp_df %xdel add_grid %xdel train_df # Invoke the garbage collector explicitly to free up memory gc.collect() ``` Second, let’s reduce the footprint of `grid_df` by converting strings into categoricals: ```ipython3 report_dataframe_size(grid_df, "grid_df") ``` ```myst-ansi grid_df takes up 5.2GiB memory on GPU ``` ```ipython3 grid_df.dtypes ``` ```ipython3 for col in index_columns: grid_df[col] = grid_df[col].astype("category") gc.collect() report_dataframe_size(grid_df, "grid_df") ``` ```myst-ansi grid_df takes up 802.6MiB memory on GPU ``` ```ipython3 grid_df.dtypes ``` ### Identify the release week of each product Each row in the `prices_df` table contains the price of a product sold at a store for a given week. ```ipython3 prices_df ``` Notice that not all products were sold over every week. Some products were sold only during some weeks. Let’s use the groupby operation to identify the first week in which each product went on the shelf. ```ipython3 release_df = ( prices_df.groupby(["store_id", "item_id"])["wm_yr_wk"].agg("min").reset_index() ) release_df.columns = ["store_id", "item_id", "release_week"] release_df ``` Now that we’ve computed the release week for each product, let’s merge it back to `grid_df`: ```ipython3 grid_df = grid_df.merge(release_df, on=["store_id", "item_id"], how="left") grid_df = grid_df.sort_values(index_columns + ["day_id"]).reset_index(drop=True) grid_df ``` ```ipython3 del release_df # No longer needed gc.collect() ``` ```ipython3 report_dataframe_size(grid_df, "grid_df") ``` ```myst-ansi grid_df takes up 1.2GiB memory on GPU ``` ### Filter out entries with zero sales We can further save space by dropping rows from `grid_df` that correspond to zero sales. Since each product doesn’t go on the shelf until its release week, its sale must be zero during any week that’s prior to the release week. To make use of this insight, we bring in the `wm_yr_wk` column from `calendar_df`: ```ipython3 grid_df = grid_df.merge(calendar_df[["wm_yr_wk", "day_id"]], on=["day_id"], how="left") grid_df ``` ```ipython3 report_dataframe_size(grid_df, "grid_df") ``` ```myst-ansi grid_df takes up 1.7GiB memory on GPU ``` The `wm_yr_wk` column identifies the week that contains the day given by the `day_id` column. Now let’s filter all rows in `grid_df` for which `wm_yr_wk` is less than `release_week`: ```ipython3 df = grid_df[grid_df["wm_yr_wk"] < grid_df["release_week"]] df ``` As we suspected, the sales amount is zero during weeks that come before the release week. ```ipython3 assert (df["sales"] == 0).all() ``` For the purpose of our data analysis, we can safely drop the rows with zero sales: ```ipython3 grid_df = grid_df[grid_df["wm_yr_wk"] >= grid_df["release_week"]].reset_index(drop=True) grid_df["wm_yr_wk"] = grid_df["wm_yr_wk"].astype( np.int32 ) # Convert wm_yr_wk column to int32, to conserve memory grid_df ``` ```ipython3 report_dataframe_size(grid_df, "grid_df") ``` ```myst-ansi grid_df takes up 1.2GiB memory on GPU ``` ### Assign weights for product items When we assess the accuracy of our machine learning model, we should assign a weight for each product item, to indicate the relative importance of the item. For the M5 competition, the weights are computed from the total sales amount (in US dollars) in the last 28 days. ```ipython3 # Convert day_id to integers grid_df["day_id_int"] = grid_df["day_id"].to_pandas().apply(lambda x: x[2:]).astype(int) # Compute the total sales over the latest 28 days, per product item last28 = grid_df[(grid_df["day_id_int"] >= 1914) & (grid_df["day_id_int"] < 1942)] last28 = last28[["item_id", "wm_yr_wk", "sales"]].merge( prices_df[["item_id", "wm_yr_wk", "sell_price"]], on=["item_id", "wm_yr_wk"] ) last28["sales_usd"] = last28["sales"] * last28["sell_price"] total_sales_usd = last28.groupby("item_id")[["sales_usd"]].agg(["sum"]).sort_index() total_sales_usd.columns = total_sales_usd.columns.map("_".join) total_sales_usd ``` To obtain weights, we normalize the sales amount for one item by the total sales for all items. ```ipython3 weights = total_sales_usd / total_sales_usd.sum() weights = weights.rename(columns={"sales_usd_sum": "weights"}) weights ``` ```ipython3 # No longer needed del grid_df["day_id_int"] ``` ### Generate price-related features Let us engineer additional features that are related to the sale price. We consider the distribution of the price of a given product over time and ask how the current price compares to the historical trend. ```ipython3 # Highest price over all weeks prices_df["price_max"] = prices_df.groupby(["store_id", "item_id"])[ "sell_price" ].transform("max") # Lowest price over all weeks prices_df["price_min"] = prices_df.groupby(["store_id", "item_id"])[ "sell_price" ].transform("min") # Standard deviation of the price prices_df["price_std"] = prices_df.groupby(["store_id", "item_id"])[ "sell_price" ].transform("std") # Mean (average) price over all weeks prices_df["price_mean"] = prices_df.groupby(["store_id", "item_id"])[ "sell_price" ].transform("mean") ``` We also consider the ratio of the current price to the max price. ```ipython3 prices_df["price_norm"] = prices_df["sell_price"] / prices_df["price_max"] ``` Some items have a very stable price, whereas other items respond to inflation quickly and rise in price. To capture the price elasticity, we count the number of unique price values for a given product over time. ```ipython3 prices_df["price_nunique"] = prices_df.groupby(["store_id", "item_id"])[ "sell_price" ].transform("nunique") ``` We also consider, for a given price, how many other items are being sold at the exact same price. ```ipython3 prices_df["item_nunique"] = prices_df.groupby(["store_id", "sell_price"])[ "item_id" ].transform("nunique") ``` ```ipython3 prices_df ``` Another useful way to put prices in context is to compare the price of a product to its historical price a week ago, a month ago, or an year ago. ```ipython3 # Add "month" and "year" columns to prices_df week_to_month_map = calendar_df[["wm_yr_wk", "month", "year"]].drop_duplicates( subset=["wm_yr_wk"] ) prices_df = prices_df.merge(week_to_month_map, on=["wm_yr_wk"], how="left") # Sort by wm_yr_wk. The rows will also be sorted in ascending months and years. prices_df = prices_df.sort_values(["store_id", "item_id", "wm_yr_wk"]) ``` ```ipython3 # Compare with the average price in the previous week prices_df["price_momentum"] = prices_df["sell_price"] / prices_df.groupby( ["store_id", "item_id"] )["sell_price"].shift(1) # Compare with the average price in the previous month prices_df["price_momentum_m"] = prices_df["sell_price"] / prices_df.groupby( ["store_id", "item_id", "month"] )["sell_price"].transform("mean") # Compare with the average price in the previous year prices_df["price_momentum_y"] = prices_df["sell_price"] / prices_df.groupby( ["store_id", "item_id", "year"] )["sell_price"].transform("mean") ``` ```ipython3 # Remove "month" and "year" columns, as we don't need them any more del prices_df["month"], prices_df["year"] # Convert float64 columns into float32 type to save memory columns = [ "sell_price", "price_max", "price_min", "price_std", "price_mean", "price_norm", "price_momentum", "price_momentum_m", "price_momentum_y", ] for col in columns: prices_df[col] = prices_df[col].astype(np.float32) ``` ```ipython3 prices_df.dtypes ``` ### Bring in price-related features into `grid_df` ```ipython3 # After merging price_df, keep columns id and day_id from grid_df and drop all other columns from grid_df original_columns = list(grid_df) grid_df_with_price = grid_df.copy() grid_df_with_price = grid_df_with_price.merge( prices_df, on=["store_id", "item_id", "wm_yr_wk"], how="left" ) columns_to_keep = ["id", "day_id"] + [ col for col in list(grid_df_with_price) if col not in original_columns ] grid_df_with_price = grid_df_with_price[["id", "day_id"] + columns_to_keep] grid_df_with_price ``` ### Generate date-related features We identify the date in each row of `grid_df` using information from `calendar_df`. ```ipython3 # Bring in the following columns from calendar_df into grid_df grid_df_id_only = grid_df[["id", "day_id"]].copy() icols = [ "date", "day_id", "event_name_1", "event_type_1", "event_name_2", "event_type_2", "snap_CA", "snap_TX", "snap_WI", ] grid_df_with_calendar = grid_df_id_only.merge( calendar_df[icols], on=["day_id"], how="left" ) grid_df_with_calendar ``` ```ipython3 # Convert columns into categorical type to save memory for col in [ "event_name_1", "event_type_1", "event_name_2", "event_type_2", "snap_CA", "snap_TX", "snap_WI", ]: grid_df_with_calendar[col] = grid_df_with_calendar[col].astype("category") # Convert "date" column into timestamp type grid_df_with_calendar["date"] = cudf.to_datetime(grid_df_with_calendar["date"]) ``` Using the `date` column, we can generate related features, such as day, week, or month. ```ipython3 import cupy as cp grid_df_with_calendar["tm_d"] = grid_df_with_calendar["date"].dt.day.astype(np.int8) grid_df_with_calendar["tm_w"] = ( grid_df_with_calendar["date"].dt.isocalendar().week.astype(np.int8) ) grid_df_with_calendar["tm_m"] = grid_df_with_calendar["date"].dt.month.astype(np.int8) grid_df_with_calendar["tm_y"] = grid_df_with_calendar["date"].dt.year grid_df_with_calendar["tm_y"] = ( grid_df_with_calendar["tm_y"] - grid_df_with_calendar["tm_y"].min() ).astype(np.int8) grid_df_with_calendar["tm_wm"] = cp.ceil( grid_df_with_calendar["tm_d"].to_cupy() / 7 ).astype( np.int8 ) # which week in tje month? grid_df_with_calendar["tm_dw"] = grid_df_with_calendar["date"].dt.dayofweek.astype( np.int8 ) # which day in the week? grid_df_with_calendar["tm_w_end"] = (grid_df_with_calendar["tm_dw"] >= 5).astype( np.int8 ) # whether today is in the weekend del grid_df_with_calendar["date"] # no longer needed grid_df_with_calendar ``` ```ipython3 del grid_df_id_only # No longer needed gc.collect() ``` ### Generate lag features **Lag features** are the value of the target variable at prior timestamps. Lag features are useful because what happens in the past often influences what would happen in the future. In our example, we generate lag features by reading the sales amount at X days prior, where X = 28, 29, …, 42. ```ipython3 SHIFT_DAY = 28 LAG_DAYS = [col for col in range(SHIFT_DAY, SHIFT_DAY + 15)] # Need to first ensure that rows in each time series are sorted by day_id grid_df_lags = grid_df[["id", "day_id", "sales"]].copy() grid_df_lags = grid_df_lags.sort_values(["id", "day_id"]) grid_df_lags = grid_df_lags.assign( **{ f"sales_lag_{ld}": grid_df_lags.groupby(["id"])["sales"].shift(ld) for ld in LAG_DAYS } ) ``` ```ipython3 grid_df_lags ``` ### Compute rolling window statistics In the previous cell, we used the value of sales at a single timestamp to generate lag features. To capture richer information about the past, let us also get the distribution of the sales value over multiple timestamps, by computing **rolling window statistics**. Rolling window statistics are statistics (e.g. mean, standard deviation) over a time duration in the past. Rolling windows statistics complement lag features and provide more information about the past behavior of the target variable. Read more about lag features and rolling window statistics in [Introduction to feature engineering for time series forecasting](https://medium.com/data-science-at-microsoft/introduction-to-feature-engineering-for-time-series-forecasting-620aa55fcab0). ```ipython3 # Shift by 28 days and apply windows of various sizes print(f"Shift size: {SHIFT_DAY}") for i in [7, 14, 30, 60, 180]: print(f" Window size: {i}") grid_df_lags[f"rolling_mean_{i}"] = ( grid_df_lags.groupby(["id"])["sales"] .shift(SHIFT_DAY) .rolling(i) .mean() .astype(np.float32) ) grid_df_lags[f"rolling_std_{i}"] = ( grid_df_lags.groupby(["id"])["sales"] .shift(SHIFT_DAY) .rolling(i) .std() .astype(np.float32) ) ``` ```myst-ansi Shift size: 28 Window size: 7 Window size: 14 Window size: 30 Window size: 60 Window size: 180 ``` ```ipython3 grid_df_lags.columns ``` ```ipython3 grid_df_lags.dtypes ``` ```ipython3 grid_df_lags ``` ### Target encoding Categorical variables present challenges to many machine learning algorithms such as XGBoost. One way to overcome the challenge is to use **target encoding**, where we encode categorical variables by replacing them with a statistic for the target variable. In this example, we will use the mean and the standard deviation. Read more about target encoding in [Target-encoding Categorical Variables](https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-encoder-a0f1733a4c69). ```ipython3 icols = [["store_id", "dept_id"], ["item_id", "state_id"]] new_columns = [] grid_df_target_enc = grid_df[ ["id", "day_id", "item_id", "state_id", "store_id", "dept_id", "sales"] ].copy() grid_df_target_enc["sales"].fillna(value=0, inplace=True) for col in icols: print(f"Encoding columns {col}") col_name = "_" + "_".join(col) + "_" grid_df_target_enc["enc" + col_name + "mean"] = ( grid_df_target_enc.groupby(col)["sales"].transform("mean").astype(np.float32) ) grid_df_target_enc["enc" + col_name + "std"] = ( grid_df_target_enc.groupby(col)["sales"].transform("std").astype(np.float32) ) new_columns.extend(["enc" + col_name + "mean", "enc" + col_name + "std"]) ``` ```myst-ansi Encoding columns ['store_id', 'dept_id'] Encoding columns ['item_id', 'state_id'] ``` ```ipython3 grid_df_target_enc = grid_df_target_enc[["id", "day_id"] + new_columns] grid_df_target_enc ``` ```ipython3 grid_df_target_enc.dtypes ``` ### Filter by store and product department and create data segments After combining all columns produced in the previous notebooks, we filter the rows in the data set by `store_id` and `dept_id` and create a segment. Each segment is saved as a pickle file and then upload to Cloud Storage. ```ipython3 segmented_data_dir = pathlib.Path("./segmented_data/") segmented_data_dir.mkdir(exist_ok=True) STORES = [ "CA_1", "CA_2", "CA_3", "CA_4", "TX_1", "TX_2", "TX_3", "WI_1", "WI_2", "WI_3", ] DEPTS = [ "HOBBIES_1", "HOBBIES_2", "HOUSEHOLD_1", "HOUSEHOLD_2", "FOODS_1", "FOODS_2", "FOODS_3", ] grid2_colnm = [ "sell_price", "price_max", "price_min", "price_std", "price_mean", "price_norm", "price_nunique", "item_nunique", "price_momentum", "price_momentum_m", "price_momentum_y", ] grid3_colnm = [ "event_name_1", "event_type_1", "event_name_2", "event_type_2", "snap_CA", "snap_TX", "snap_WI", "tm_d", "tm_w", "tm_m", "tm_y", "tm_wm", "tm_dw", "tm_w_end", ] lag_colnm = [ "sales_lag_28", "sales_lag_29", "sales_lag_30", "sales_lag_31", "sales_lag_32", "sales_lag_33", "sales_lag_34", "sales_lag_35", "sales_lag_36", "sales_lag_37", "sales_lag_38", "sales_lag_39", "sales_lag_40", "sales_lag_41", "sales_lag_42", "rolling_mean_7", "rolling_std_7", "rolling_mean_14", "rolling_std_14", "rolling_mean_30", "rolling_std_30", "rolling_mean_60", "rolling_std_60", "rolling_mean_180", "rolling_std_180", ] target_enc_colnm = [ "enc_store_id_dept_id_mean", "enc_store_id_dept_id_std", "enc_item_id_state_id_mean", "enc_item_id_state_id_std", ] ``` ```ipython3 def prepare_data(store, dept=None): """ Filter and clean data according to stores and product departments Parameters ---------- store: Filter data by retaining rows whose store_id matches this parameter. dept: Filter data by retaining rows whose dept_id matches this parameter. This parameter can be set to None to indicate that we shouldn't filter by dept_id. """ if store is None: raise ValueError("store parameter must not be None") if dept is None: grid1 = grid_df[grid_df["store_id"] == store] else: grid1 = grid_df[ (grid_df["store_id"] == store) & (grid_df["dept_id"] == dept) ].drop(columns=["dept_id"]) grid1 = grid1.drop(columns=["release_week", "wm_yr_wk", "store_id", "state_id"]) grid2 = grid_df_with_price[["id", "day_id"] + grid2_colnm] grid_combined = grid1.merge(grid2, on=["id", "day_id"], how="left") del grid1, grid2 grid3 = grid_df_with_calendar[["id", "day_id"] + grid3_colnm] grid_combined = grid_combined.merge(grid3, on=["id", "day_id"], how="left") del grid3 lag_df = grid_df_lags[["id", "day_id"] + lag_colnm] grid_combined = grid_combined.merge(lag_df, on=["id", "day_id"], how="left") del lag_df target_enc_df = grid_df_target_enc[["id", "day_id"] + target_enc_colnm] grid_combined = grid_combined.merge(target_enc_df, on=["id", "day_id"], how="left") del target_enc_df gc.collect() grid_combined = grid_combined.drop(columns=["id"]) grid_combined["day_id"] = ( grid_combined["day_id"] .to_pandas() .astype("str") .apply(lambda x: x[2:]) .astype(np.int16) ) return grid_combined ``` ```ipython3 # First save the segment to the disk for store in STORES: print(f"Processing store {store}...") segment_df = prepare_data(store=store) segment_df.to_pandas().to_pickle( segmented_data_dir / f"combined_df_store_{store}.pkl" ) del segment_df gc.collect() for store in STORES: for dept in DEPTS: print(f"Processing (store {store}, department {dept})...") segment_df = prepare_data(store=store, dept=dept) segment_df.to_pandas().to_pickle( segmented_data_dir / f"combined_df_store_{store}_dept_{dept}.pkl" ) del segment_df gc.collect() ``` ```myst-ansi Processing store CA_1... Processing store CA_2... Processing store CA_3... Processing store CA_4... Processing store TX_1... Processing store TX_2... Processing store TX_3... Processing store WI_1... Processing store WI_2... Processing store WI_3... Processing (store CA_1, department HOBBIES_1)... Processing (store CA_1, department HOBBIES_2)... Processing (store CA_1, department HOUSEHOLD_1)... Processing (store CA_1, department HOUSEHOLD_2)... Processing (store CA_1, department FOODS_1)... Processing (store CA_1, department FOODS_2)... Processing (store CA_1, department FOODS_3)... Processing (store CA_2, department HOBBIES_1)... Processing (store CA_2, department HOBBIES_2)... Processing (store CA_2, department HOUSEHOLD_1)... Processing (store CA_2, department HOUSEHOLD_2)... Processing (store CA_2, department FOODS_1)... Processing (store CA_2, department FOODS_2)... Processing (store CA_2, department FOODS_3)... Processing (store CA_3, department HOBBIES_1)... Processing (store CA_3, department HOBBIES_2)... Processing (store CA_3, department HOUSEHOLD_1)... Processing (store CA_3, department HOUSEHOLD_2)... Processing (store CA_3, department FOODS_1)... Processing (store CA_3, department FOODS_2)... Processing (store CA_3, department FOODS_3)... Processing (store CA_4, department HOBBIES_1)... Processing (store CA_4, department HOBBIES_2)... Processing (store CA_4, department HOUSEHOLD_1)... Processing (store CA_4, department HOUSEHOLD_2)... Processing (store CA_4, department FOODS_1)... Processing (store CA_4, department FOODS_2)... Processing (store CA_4, department FOODS_3)... Processing (store TX_1, department HOBBIES_1)... Processing (store TX_1, department HOBBIES_2)... Processing (store TX_1, department HOUSEHOLD_1)... Processing (store TX_1, department HOUSEHOLD_2)... Processing (store TX_1, department FOODS_1)... Processing (store TX_1, department FOODS_2)... Processing (store TX_1, department FOODS_3)... Processing (store TX_2, department HOBBIES_1)... Processing (store TX_2, department HOBBIES_2)... Processing (store TX_2, department HOUSEHOLD_1)... Processing (store TX_2, department HOUSEHOLD_2)... Processing (store TX_2, department FOODS_1)... Processing (store TX_2, department FOODS_2)... Processing (store TX_2, department FOODS_3)... Processing (store TX_3, department HOBBIES_1)... Processing (store TX_3, department HOBBIES_2)... Processing (store TX_3, department HOUSEHOLD_1)... Processing (store TX_3, department HOUSEHOLD_2)... Processing (store TX_3, department FOODS_1)... Processing (store TX_3, department FOODS_2)... Processing (store TX_3, department FOODS_3)... Processing (store WI_1, department HOBBIES_1)... Processing (store WI_1, department HOBBIES_2)... Processing (store WI_1, department HOUSEHOLD_1)... Processing (store WI_1, department HOUSEHOLD_2)... Processing (store WI_1, department FOODS_1)... Processing (store WI_1, department FOODS_2)... Processing (store WI_1, department FOODS_3)... Processing (store WI_2, department HOBBIES_1)... Processing (store WI_2, department HOBBIES_2)... Processing (store WI_2, department HOUSEHOLD_1)... Processing (store WI_2, department HOUSEHOLD_2)... Processing (store WI_2, department FOODS_1)... Processing (store WI_2, department FOODS_2)... Processing (store WI_2, department FOODS_3)... Processing (store WI_3, department HOBBIES_1)... Processing (store WI_3, department HOBBIES_2)... Processing (store WI_3, department HOUSEHOLD_1)... Processing (store WI_3, department HOUSEHOLD_2)... Processing (store WI_3, department FOODS_1)... Processing (store WI_3, department FOODS_2)... Processing (store WI_3, department FOODS_3)... ``` ```ipython3 # Then copy the segment to Cloud Storage fs = gcsfs.GCSFileSystem() for e in segmented_data_dir.glob("*.pkl"): print(f"Uploading {e}...") basename = e.name fs.put_file(e, f"{bucket_name}/{basename}") ``` ```myst-ansi Uploading segmented_data/combined_df_store_CA_3_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_CA_3_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_WI_3.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_CA_3_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_CA_3_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_CA_1.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_WI_2.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_CA_3_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_WI_1.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_TX_3.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_CA_4.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_CA_3_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_CA_3.pkl... Uploading segmented_data/combined_df_store_TX_2.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_TX_1_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_WI_3_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_CA_2_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_WI_1_dept_FOODS_3.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_WI_2_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_CA_4_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_HOUSEHOLD_2.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_HOBBIES_1.pkl... Uploading segmented_data/combined_df_store_TX_2_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_FOODS_2.pkl... Uploading segmented_data/combined_df_store_CA_3_dept_HOUSEHOLD_1.pkl... Uploading segmented_data/combined_df_store_CA_1_dept_HOBBIES_2.pkl... Uploading segmented_data/combined_df_store_TX_1.pkl... Uploading segmented_data/combined_df_store_TX_3_dept_FOODS_1.pkl... Uploading segmented_data/combined_df_store_CA_2.pkl... ``` ```ipython3 # Also upload the product weights fs = gcsfs.GCSFileSystem() weights.to_pandas().to_pickle("product_weights.pkl") fs.put_file("product_weights.pkl", f"{bucket_name}/product_weights.pkl") ``` ## Training and Evaluation with Hyperparameter Optimization (HPO) Now that we finished processing the data, we are now ready to train a model to forecast future sales. We will leverage the worker pods to run multiple training jobs in parallel, speeding up the hyperparameter search. ### Import modules and define constants ```ipython3 import copy import gc import json import pickle import time import cudf import gcsfs import matplotlib import matplotlib.pyplot as plt import numpy as np import optuna import pandas as pd import xgboost as xgb from dask.distributed import Client, wait from dask_kubernetes.operator import KubeCluster from matplotlib.patches import Patch ``` ```ipython3 # Choose the same RAPIDS image you used for launching the notebook session rapids_image = "rapidsai/notebooks:25.12a-cuda12-py3.13" # Use the number of worker nodes in your Kubernetes cluster. n_workers = 2 # Bucket that contains the processed data pickles bucket_name = "" bucket_name = "phcho-m5-competition-hpo-example" # List of stores and product departments STORES = [ "CA_1", "CA_2", "CA_3", "CA_4", "TX_1", "TX_2", "TX_3", "WI_1", "WI_2", "WI_3", ] DEPTS = [ "HOBBIES_1", "HOBBIES_2", "HOUSEHOLD_1", "HOUSEHOLD_2", "FOODS_1", "FOODS_2", "FOODS_3", ] ``` ### Define cross-validation folds **[Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))** is a statistical method for estimating how well a machine learning model generalizes to an independent data set. The method is also useful for evaluating the choice of a given combination of model hyperparameters. To estimate the capacity to generalize, we define multiple cross-validation **folds** consisting of multiple pairs of `(training set, validation set)`. For each fold, we fit a model using the training set and evaluate its accuracy on the validation set. The “goodness” score for a given hyperparameter combination is the accuracy of the model on each validation set, averaged over all cross-validation folds. Great care must be taken when defining cross-validation folds for time-series data. We are not allowed to use the future to predict the past, so the training set must precede (in time) the validation set. Consequently, we partition the data set in the time dimension and assign the training and validation sets using time ranges: ```ipython3 # Cross-validation folds and held-out test set (in time dimension) # The held-out test set is used for final evaluation cv_folds = [ # (train_set, validation_set) ([0, 1114], [1114, 1314]), ([0, 1314], [1314, 1514]), ([0, 1514], [1514, 1714]), ([0, 1714], [1714, 1914]), ] n_folds = len(cv_folds) holdout = [1914, 1942] time_horizon = 1942 ``` It is helpful to visualize the cross-validation folds using Matplotlib. ```ipython3 cv_cmap = matplotlib.colormaps["cividis"] plt.figure(figsize=(8, 3)) for i, (train_mask, valid_mask) in enumerate(cv_folds): idx = np.array([np.nan] * time_horizon) idx[np.arange(*train_mask)] = 1 idx[np.arange(*valid_mask)] = 0 plt.scatter( range(time_horizon), [i + 0.5] * time_horizon, c=idx, marker="_", capstyle="butt", s=1, lw=20, cmap=cv_cmap, vmin=-1.5, vmax=1.5, ) idx = np.array([np.nan] * time_horizon) idx[np.arange(*holdout)] = -1 plt.scatter( range(time_horizon), [n_folds + 0.5] * time_horizon, c=idx, marker="_", capstyle="butt", s=1, lw=20, cmap=cv_cmap, vmin=-1.5, vmax=1.5, ) plt.xlabel("Time") plt.yticks( ticks=np.arange(n_folds + 1) + 0.5, labels=[f"Fold {i}" for i in range(n_folds)] + ["Holdout"], ) plt.ylim([len(cv_folds) + 1.2, -0.2]) norm = matplotlib.colors.Normalize(vmin=-1.5, vmax=1.5) plt.legend( [ Patch(color=cv_cmap(norm(1))), Patch(color=cv_cmap(norm(0))), Patch(color=cv_cmap(norm(-1))), ], ["Training set", "Validation set", "Held-out test set"], ncol=3, loc="best", ) plt.tight_layout() ``` ### Launch a Dask client on Kubernetes Let us set up a Dask cluster using the `KubeCluster` class. ```ipython3 cluster = KubeCluster( name="rapids-dask", image=rapids_image, worker_command="dask-cuda-worker", n_workers=n_workers, resources={"limits": {"nvidia.com/gpu": "1"}}, env={"EXTRA_PIP_PACKAGES": "optuna gcsfs"}, ) ``` ```ipython3 cluster ``` ```ipython3 client = Client(cluster) client ``` ### Define the custom evaluation metric The M5 forecasting competition defines a custom metric called WRMSSE as follows: $$ WRMSSE = \sum w_i \cdot RMSSE_i $$ i.e. WRMSEE is a weighted sum of RMSSE for all product items $i$. RMSSE is in turn defined to be $$ RMSSE = \sqrt{\frac{1/h \cdot \sum_t{\left(Y_t - \hat{Y}_t\right)}^2}{1/(n-1)\sum_t{(Y_t - Y_{t-1})}^2}} $$ where the squared error of the prediction (forecast) is normalized by the speed at which the sales amount changes per unit in the training data. Here is the implementation of the WRMSSE using cuDF. We use the product weights $w_i$ as computed in the first preprocessing notebook. ```ipython3 def wrmsse(product_weights, df, pred_sales, train_mask, valid_mask): """Compute WRMSSE metric""" df_train = df[(df["day_id"] >= train_mask[0]) & (df["day_id"] < train_mask[1])] df_valid = df[(df["day_id"] >= valid_mask[0]) & (df["day_id"] < valid_mask[1])] # Compute denominator: 1/(n-1) * sum( (y(t) - y(t-1))**2 ) diff = ( df_train.sort_values(["item_id", "day_id"]) .groupby(["item_id"])[["sales"]] .diff(1) ) x = ( df_train[["item_id", "day_id"]] .join(diff, how="left") .rename(columns={"sales": "diff"}) .sort_values(["item_id", "day_id"]) ) x["diff"] = x["diff"] ** 2 xx = x.groupby(["item_id"])[["diff"]].agg(["sum", "count"]).sort_index() xx.columns = xx.columns.map("_".join) xx["denominator"] = xx["diff_sum"] / xx["diff_count"] xx.reset_index() # Compute numerator: 1/h * sum( (y(t) - y_pred(t))**2 ) X_valid = df_valid.drop(columns=["item_id", "cat_id", "day_id", "sales"]) if "dept_id" in X_valid.columns: X_valid = X_valid.drop(columns=["dept_id"]) df_pred = cudf.DataFrame( { "item_id": df_valid["item_id"].copy(), "pred_sales": pred_sales, "sales": df_valid["sales"].copy(), } ) df_pred["diff"] = (df_pred["sales"] - df_pred["pred_sales"]) ** 2 yy = df_pred.groupby(["item_id"])[["diff"]].agg(["sum", "count"]).sort_index() yy.columns = yy.columns.map("_".join) yy["numerator"] = yy["diff_sum"] / yy["diff_count"] zz = yy[["numerator"]].join(xx[["denominator"]], how="left") zz = zz.join(product_weights, how="left").sort_index() # Filter out zero denominator. # This can occur if the product was never on sale during the period in the training set zz = zz[zz["denominator"] != 0] zz["rmsse"] = np.sqrt(zz["numerator"] / zz["denominator"]) return zz["rmsse"].multiply(zz["weights"]).sum() ``` ### Define the training and hyperparameter search pipeline using Optuna Optuna lets us define the training procedure iteratively, i.e. as if we were to write an ordinary function to train a single model. Instead of a fixed hyperparameter combination, the function now takes in a `trial` object which yields different hyperparameter combinations. In this example, we partition the training data according to the store and then fit a separate XGBoost model per data segment. ```ipython3 def objective(trial): fs = gcsfs.GCSFileSystem() with fs.open(f"{bucket_name}/product_weights.pkl", "rb") as f: product_weights = cudf.DataFrame(pd.read_pickle(f)) params = { "n_estimators": 100, "verbosity": 0, "learning_rate": 0.01, "objective": "reg:tweedie", "tree_method": "gpu_hist", "grow_policy": "depthwise", "predictor": "gpu_predictor", "enable_categorical": True, "lambda": trial.suggest_float("lambda", 1e-8, 100.0, log=True), "alpha": trial.suggest_float("alpha", 1e-8, 100.0, log=True), "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), "max_depth": trial.suggest_int("max_depth", 2, 6, step=1), "min_child_weight": trial.suggest_float( "min_child_weight", 1e-8, 100, log=True ), "gamma": trial.suggest_float("gamma", 1e-8, 1.0, log=True), "tweedie_variance_power": trial.suggest_float("tweedie_variance_power", 1, 2), } scores = [[] for store in STORES] for store_id, store in enumerate(STORES): print(f"Processing store {store}...") with fs.open(f"{bucket_name}/combined_df_store_{store}.pkl", "rb") as f: df = cudf.DataFrame(pd.read_pickle(f)) for train_mask, valid_mask in cv_folds: df_train = df[ (df["day_id"] >= train_mask[0]) & (df["day_id"] < train_mask[1]) ] df_valid = df[ (df["day_id"] >= valid_mask[0]) & (df["day_id"] < valid_mask[1]) ] X_train, y_train = ( df_train.drop( columns=["item_id", "dept_id", "cat_id", "day_id", "sales"] ), df_train["sales"], ) X_valid = df_valid.drop( columns=["item_id", "dept_id", "cat_id", "day_id", "sales"] ) clf = xgb.XGBRegressor(**params) clf.fit(X_train, y_train) pred_sales = clf.predict(X_valid) scores[store_id].append( wrmsse(product_weights, df, pred_sales, train_mask, valid_mask) ) del df_train, df_valid, X_train, y_train, clf gc.collect() del df gc.collect() # We can sum WRMSSE scores over data segments because data segments contain disjoint sets of time series return np.array(scores).sum(axis=0).mean() ``` Using the Dask cluster client, we execute multiple training jobs in parallel. Optuna keeps track of the progress in the hyperparameter search using in-memory Dask storage. ```ipython3 ##### Number of hyperparameter combinations to try in parallel n_trials = 9 # Using a small n_trials so that the demo can finish quickly # n_trials = 100 # Optimize in parallel on your Dask cluster backend_storage = optuna.storages.InMemoryStorage() dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) study = optuna.create_study( direction="minimize", sampler=optuna.samplers.RandomSampler(seed=0), storage=dask_storage, ) futures = [] for i in range(0, n_trials, n_workers): iter_range = (i, min([i + n_workers, n_trials])) futures.append( { "range": iter_range, "futures": [ client.submit( # Work around bug https://github.com/optuna/optuna/issues/4859 lambda objective, n_trials: ( study.sampler.reseed_rng(), study.optimize(objective, n_trials), ), objective, n_trials=1, pure=False, ) for _ in range(*iter_range) ], } ) tstart = time.perf_counter() for partition in futures: iter_range = partition["range"] print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}") _ = wait(partition["futures"]) for fut in partition["futures"]: _ = fut.result() # Ensure that the training job was successful tnow = time.perf_counter() print( f"Best cross-validation metric: {study.best_value}, Time elapsed = {tnow - tstart}" ) tend = time.perf_counter() print(f"Total time elapsed = {tend - tstart}") ``` ```myst-ansi /tmp/ipykernel_1321/3389696366.py:7: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future. dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) ``` ```myst-ansi Testing hyperparameter combinations 0..2 Best cross-validation metric: 10.027767173304472, Time elapsed = 331.6198390149948 Testing hyperparameter combinations 2..4 Best cross-validation metric: 9.426913749927916, Time elapsed = 640.7606940959959 Testing hyperparameter combinations 4..6 Best cross-validation metric: 9.426913749927916, Time elapsed = 958.0816706369951 Testing hyperparameter combinations 6..8 Best cross-validation metric: 9.426913749927916, Time elapsed = 1295.700604706988 Testing hyperparameter combinations 8..9 Best cross-validation metric: 8.915009508695244, Time elapsed = 1476.1182343699911 Total time elapsed = 1476.1219055669935 ``` Once the hyperparameter search is complete, we fetch the optimal hyperparameter combination using the attributes of the `study` object. ```ipython3 study.best_params ``` ```ipython3 study.best_trial ``` ```ipython3 # Make a deep copy to preserve the dictionary after deleting the Dask cluster best_params = copy.deepcopy(study.best_params) best_params ``` ```ipython3 fs = gcsfs.GCSFileSystem() with fs.open(f"{bucket_name}/params.json", "w") as f: json.dump(best_params, f) ``` ### Train the final XGBoost model and evaluate Using the optimal hyperparameters found in the search, fit a new model using the whole training data. As in the previous section, we fit a separate XGBoost model per data segment. ```ipython3 fs = gcsfs.GCSFileSystem() with fs.open(f"{bucket_name}/params.json", "r") as f: best_params = json.load(f) with fs.open(f"{bucket_name}/product_weights.pkl", "rb") as f: product_weights = cudf.DataFrame(pd.read_pickle(f)) ``` ```ipython3 def final_train(best_params): fs = gcsfs.GCSFileSystem() params = { "n_estimators": 100, "verbosity": 0, "learning_rate": 0.01, "objective": "reg:tweedie", "tree_method": "gpu_hist", "grow_policy": "depthwise", "predictor": "gpu_predictor", "enable_categorical": True, } params.update(best_params) model = {} train_mask = [0, 1914] for store in STORES: print(f"Processing store {store}...") with fs.open(f"{bucket_name}/combined_df_store_{store}.pkl", "rb") as f: df = cudf.DataFrame(pd.read_pickle(f)) df_train = df[(df["day_id"] >= train_mask[0]) & (df["day_id"] < train_mask[1])] X_train, y_train = ( df_train.drop(columns=["item_id", "dept_id", "cat_id", "day_id", "sales"]), df_train["sales"], ) clf = xgb.XGBRegressor(**params) clf.fit(X_train, y_train) model[store] = clf del df gc.collect() return model ``` ```ipython3 model = final_train(best_params) ``` ```myst-ansi Processing store CA_1... Processing store CA_2... Processing store CA_3... Processing store CA_4... Processing store TX_1... Processing store TX_2... Processing store TX_3... Processing store WI_1... Processing store WI_2... Processing store WI_3... ``` Let’s now evaluate the final model using the held-out test set: ```ipython3 test_wrmsse = 0 for store in STORES: with fs.open(f"{bucket_name}/combined_df_store_{store}.pkl", "rb") as f: df = cudf.DataFrame(pd.read_pickle(f)) df_test = df[(df["day_id"] >= holdout[0]) & (df["day_id"] < holdout[1])] X_test = df_test.drop(columns=["item_id", "dept_id", "cat_id", "day_id", "sales"]) pred_sales = model[store].predict(X_test) test_wrmsse += wrmsse( product_weights, df, pred_sales, train_mask=[0, 1914], valid_mask=holdout ) print(f"WRMSSE metric on the held-out test set: {test_wrmsse}") ``` ```myst-ansi WRMSSE metric on the held-out test set: 9.478942050051291 ``` ```ipython3 # Save the model to the Cloud Storage with fs.open(f"{bucket_name}/final_model.pkl", "wb") as f: pickle.dump(model, f) ``` ## Create an ensemble model using a different strategy for segmenting sales data It is common to create an ensemble model where multiple machine learning methods are used to obtain better predictive performance. Prediction is made from an ensemble model by averaging the prediction output of the constituent models. In this example, we will create a second model by segmenting the sales data in a different way. Instead of splitting by stores, we will split the data by both stores and product categories. ```ipython3 def objective_alt(trial): fs = gcsfs.GCSFileSystem() with fs.open(f"{bucket_name}/product_weights.pkl", "rb") as f: product_weights = cudf.DataFrame(pd.read_pickle(f)) params = { "n_estimators": 100, "verbosity": 0, "learning_rate": 0.01, "objective": "reg:tweedie", "tree_method": "gpu_hist", "grow_policy": "depthwise", "predictor": "gpu_predictor", "enable_categorical": True, "lambda": trial.suggest_float("lambda", 1e-8, 100.0, log=True), "alpha": trial.suggest_float("alpha", 1e-8, 100.0, log=True), "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), "max_depth": trial.suggest_int("max_depth", 2, 6, step=1), "min_child_weight": trial.suggest_float( "min_child_weight", 1e-8, 100, log=True ), "gamma": trial.suggest_float("gamma", 1e-8, 1.0, log=True), "tweedie_variance_power": trial.suggest_float("tweedie_variance_power", 1, 2), } scores = [[] for i in range(len(STORES) * len(DEPTS))] for store_id, store in enumerate(STORES): for dept_id, dept in enumerate(DEPTS): print(f"Processing store {store}, department {dept}...") with fs.open( f"{bucket_name}/combined_df_store_{store}_dept_{dept}.pkl", "rb" ) as f: df = cudf.DataFrame(pd.read_pickle(f)) for train_mask, valid_mask in cv_folds: df_train = df[ (df["day_id"] >= train_mask[0]) & (df["day_id"] < train_mask[1]) ] df_valid = df[ (df["day_id"] >= valid_mask[0]) & (df["day_id"] < valid_mask[1]) ] X_train, y_train = ( df_train.drop(columns=["item_id", "cat_id", "day_id", "sales"]), df_train["sales"], ) X_valid = df_valid.drop( columns=["item_id", "cat_id", "day_id", "sales"] ) clf = xgb.XGBRegressor(**params) clf.fit(X_train, y_train) sales_pred = clf.predict(X_valid) scores[store_id * len(DEPTS) + dept_id].append( wrmsse(product_weights, df, sales_pred, train_mask, valid_mask) ) del df_train, df_valid, X_train, y_train, clf gc.collect() del df gc.collect() # We can sum WRMSSE scores over data segments because data segments contain disjoint sets of time series return np.array(scores).sum(axis=0).mean() ``` ```ipython3 ##### Number of hyperparameter combinations to try in parallel n_trials = 9 # Using a small n_trials so that the demo can finish quickly # n_trials = 100 # Optimize in parallel on your Dask cluster backend_storage = optuna.storages.InMemoryStorage() dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) study = optuna.create_study( direction="minimize", sampler=optuna.samplers.RandomSampler(seed=0), storage=dask_storage, ) futures = [] for i in range(0, n_trials, n_workers): iter_range = (i, min([i + n_workers, n_trials])) futures.append( { "range": iter_range, "futures": [ client.submit( # Work around bug https://github.com/optuna/optuna/issues/4859 lambda objective, n_trials: ( study.sampler.reseed_rng(), study.optimize(objective, n_trials), ), objective_alt, n_trials=1, pure=False, ) for _ in range(*iter_range) ], } ) tstart = time.perf_counter() for partition in futures: iter_range = partition["range"] print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}") _ = wait(partition["futures"]) for fut in partition["futures"]: _ = fut.result() # Ensure that the training job was successful tnow = time.perf_counter() print( f"Best cross-validation metric: {study.best_value}, Time elapsed = {tnow - tstart}" ) tend = time.perf_counter() print(f"Total time elapsed = {tend - tstart}") ``` ```myst-ansi /tmp/ipykernel_1321/491731696.py:7: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future. dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client) ``` ```myst-ansi Testing hyperparameter combinations 0..2 Best cross-validation metric: 9.896445497438858, Time elapsed = 802.2191872399999 Testing hyperparameter combinations 2..4 Best cross-validation metric: 9.896445497438858, Time elapsed = 1494.0718872279976 Testing hyperparameter combinations 4..6 Best cross-validation metric: 9.835407407395302, Time elapsed = 2393.3159628150024 Testing hyperparameter combinations 6..8 Best cross-validation metric: 9.330048901795887, Time elapsed = 3092.471466117 Testing hyperparameter combinations 8..9 Best cross-validation metric: 9.330048901795887, Time elapsed = 3459.9082761530008 Total time elapsed = 3459.911843854992 ``` ```ipython3 # Make a deep copy to preserve the dictionary after deleting the Dask cluster best_params_alt = copy.deepcopy(study.best_params) best_params_alt ``` ```ipython3 fs = gcsfs.GCSFileSystem() with fs.open(f"{bucket_name}/params_alt.json", "w") as f: json.dump(best_params_alt, f) ``` Using the optimal hyperparameters found in the search, fit a new model using the whole training data. ```ipython3 def final_train_alt(best_params): fs = gcsfs.GCSFileSystem() params = { "n_estimators": 100, "verbosity": 0, "learning_rate": 0.01, "objective": "reg:tweedie", "tree_method": "gpu_hist", "grow_policy": "depthwise", "predictor": "gpu_predictor", "enable_categorical": True, } params.update(best_params) model = {} train_mask = [0, 1914] for _, store in enumerate(STORES): for _, dept in enumerate(DEPTS): print(f"Processing store {store}, department {dept}...") with fs.open( f"{bucket_name}/combined_df_store_{store}_dept_{dept}.pkl", "rb" ) as f: df = cudf.DataFrame(pd.read_pickle(f)) for train_mask, _ in cv_folds: df_train = df[ (df["day_id"] >= train_mask[0]) & (df["day_id"] < train_mask[1]) ] X_train, y_train = ( df_train.drop(columns=["item_id", "cat_id", "day_id", "sales"]), df_train["sales"], ) clf = xgb.XGBRegressor(**params) clf.fit(X_train, y_train) model[(store, dept)] = clf del df gc.collect() return model ``` ```ipython3 fs = gcsfs.GCSFileSystem() with fs.open(f"{bucket_name}/params_alt.json", "r") as f: best_params_alt = json.load(f) with fs.open(f"{bucket_name}/product_weights.pkl", "rb") as f: product_weights = cudf.DataFrame(pd.read_pickle(f)) ``` ```ipython3 model_alt = final_train_alt(best_params_alt) ``` ```myst-ansi Processing store CA_1, department HOBBIES_1... Processing store CA_1, department HOBBIES_2... Processing store CA_1, department HOUSEHOLD_1... Processing store CA_1, department HOUSEHOLD_2... Processing store CA_1, department FOODS_1... Processing store CA_1, department FOODS_2... Processing store CA_1, department FOODS_3... Processing store CA_2, department HOBBIES_1... Processing store CA_2, department HOBBIES_2... Processing store CA_2, department HOUSEHOLD_1... Processing store CA_2, department HOUSEHOLD_2... Processing store CA_2, department FOODS_1... Processing store CA_2, department FOODS_2... Processing store CA_2, department FOODS_3... Processing store CA_3, department HOBBIES_1... Processing store CA_3, department HOBBIES_2... Processing store CA_3, department HOUSEHOLD_1... Processing store CA_3, department HOUSEHOLD_2... Processing store CA_3, department FOODS_1... Processing store CA_3, department FOODS_2... Processing store CA_3, department FOODS_3... Processing store CA_4, department HOBBIES_1... Processing store CA_4, department HOBBIES_2... Processing store CA_4, department HOUSEHOLD_1... Processing store CA_4, department HOUSEHOLD_2... Processing store CA_4, department FOODS_1... Processing store CA_4, department FOODS_2... Processing store CA_4, department FOODS_3... Processing store TX_1, department HOBBIES_1... Processing store TX_1, department HOBBIES_2... Processing store TX_1, department HOUSEHOLD_1... Processing store TX_1, department HOUSEHOLD_2... Processing store TX_1, department FOODS_1... Processing store TX_1, department FOODS_2... Processing store TX_1, department FOODS_3... Processing store TX_2, department HOBBIES_1... Processing store TX_2, department HOBBIES_2... Processing store TX_2, department HOUSEHOLD_1... Processing store TX_2, department HOUSEHOLD_2... Processing store TX_2, department FOODS_1... Processing store TX_2, department FOODS_2... Processing store TX_2, department FOODS_3... Processing store TX_3, department HOBBIES_1... Processing store TX_3, department HOBBIES_2... Processing store TX_3, department HOUSEHOLD_1... Processing store TX_3, department HOUSEHOLD_2... Processing store TX_3, department FOODS_1... Processing store TX_3, department FOODS_2... Processing store TX_3, department FOODS_3... Processing store WI_1, department HOBBIES_1... Processing store WI_1, department HOBBIES_2... Processing store WI_1, department HOUSEHOLD_1... Processing store WI_1, department HOUSEHOLD_2... Processing store WI_1, department FOODS_1... Processing store WI_1, department FOODS_2... Processing store WI_1, department FOODS_3... Processing store WI_2, department HOBBIES_1... Processing store WI_2, department HOBBIES_2... Processing store WI_2, department HOUSEHOLD_1... Processing store WI_2, department HOUSEHOLD_2... Processing store WI_2, department FOODS_1... Processing store WI_2, department FOODS_2... Processing store WI_2, department FOODS_3... Processing store WI_3, department HOBBIES_1... Processing store WI_3, department HOBBIES_2... Processing store WI_3, department HOUSEHOLD_1... Processing store WI_3, department HOUSEHOLD_2... Processing store WI_3, department FOODS_1... Processing store WI_3, department FOODS_2... Processing store WI_3, department FOODS_3... ``` ```ipython3 # Save the model to the Cloud Storage with fs.open(f"{bucket_name}/final_model_alt.pkl", "wb") as f: pickle.dump(model_alt, f) ``` Now consider an ensemble consisting of the two models `model` and `model_alt`. We evaluate the ensemble by computing the WRMSSE metric for the average of the predictions of the two models. ```ipython3 test_wrmsse = 0 for store in STORES: print(f"Processing store {store}...") # Prediction from Model 1 with fs.open(f"{bucket_name}/combined_df_store_{store}.pkl", "rb") as f: df = cudf.DataFrame(pd.read_pickle(f)) df_test = df[(df["day_id"] >= holdout[0]) & (df["day_id"] < holdout[1])] X_test = df_test.drop(columns=["item_id", "dept_id", "cat_id", "day_id", "sales"]) df_test["pred1"] = model[store].predict(X_test) # Prediction from Model 2 df_test["pred2"] = [np.nan] * len(df_test) df_test["pred2"] = df_test["pred2"].astype("float32") for dept in DEPTS: with fs.open( f"{bucket_name}/combined_df_store_{store}_dept_{dept}.pkl", "rb" ) as f: df2 = cudf.DataFrame(pd.read_pickle(f)) df2_test = df2[(df2["day_id"] >= holdout[0]) & (df2["day_id"] < holdout[1])] X_test = df2_test.drop(columns=["item_id", "cat_id", "day_id", "sales"]) assert np.sum(df_test["dept_id"] == dept) == len(X_test) df_test["pred2"][df_test["dept_id"] == dept] = model_alt[(store, dept)].predict( X_test ) # Average prediction df_test["avg_pred"] = (df_test["pred1"] + df_test["pred2"]) / 2.0 test_wrmsse += wrmsse( product_weights, df, df_test["avg_pred"], train_mask=[0, 1914], valid_mask=holdout, ) print(f"WRMSSE metric on the held-out test set: {test_wrmsse}") ``` ```myst-ansi Processing store CA_1... Processing store CA_2... Processing store CA_3... Processing store CA_4... Processing store TX_1... Processing store TX_2... Processing store TX_3... Processing store WI_1... Processing store WI_2... Processing store WI_3... WRMSSE metric on the held-out test set: 10.69187847848366 ``` ```ipython3 # Close the Dask cluster to clean up cluster.close() ``` ## Conclusion We demonstrated an end-to-end workflow where we take a real-world time-series data and train a forecasting model using Google Kubernetes Engine (GKE). We were able to speed up the hyperparameter optimization (HPO) process by dispatching parallel training jobs to NVIDIA GPUs. # index.html.md # Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker *January, 2023* ## Import packages and create Amazon SageMaker and Boto3 sessions ```ipython3 import time import boto3 import sagemaker ``` ```ipython3 execution_role = sagemaker.get_execution_role() session = sagemaker.Session() region = boto3.Session().region_name account = boto3.client("sts").get_caller_identity().get("Account") ``` ```ipython3 account, region ``` ## Upload the higgs-boson dataset to s3 bucket ```ipython3 !mkdir -p ./dataset !if [ ! -f "dataset/HIGGS.csv" ]; then wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz; fi !if [ ! -f "dataset/HIGGS.csv" ]; then gunzip dataset/HIGGS.csv.gz; fi ``` ```ipython3 s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset") ``` ```ipython3 s3_data_dir ``` ## Download latest RAPIDS container from DockerHub To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to [DockerHub](https://hub.docker.com/r/rapidsai/base/tags). You will need to extend this container by creating a Dockerfile, copying the training script and installing [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) to makes RAPIDS compatible with SageMaker ```ipython3 estimator_info = { "rapids_container": "rapidsai/base:25.12a-cuda12-py3.13", "ecr_image": "sagemaker-rapids-higgs:latest", "ecr_repository": "sagemaker-rapids-higgs", } ``` ```ipython3 %%time !docker pull {estimator_info['rapids_container']} ``` ```ipython3 !cat Dockerfile ``` ```myst-ansi ARG RAPIDS_IMAGE FROM $RAPIDS_IMAGE as rapids # Installs a few more dependencies RUN conda install --yes -n base \ cupy \ flask \ protobuf \ 'sagemaker-python-sdk>=2.239.0' # Copies the training code inside the container COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py # Defines rapids-higgs.py as script entry point # ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html ENV SAGEMAKER_PROGRAM rapids-higgs.py # override entrypoint from the base image with one that accepts # 'train' and 'serve' (as SageMaker expects to provide) COPY entrypoint.sh /opt/entrypoint.sh ENTRYPOINT ["/opt/entrypoint.sh"] ``` ```ipython3 !docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} . ``` ```myst-ansi Sending build context to Docker daemon 7.68kB Step 1/7 : ARG RAPIDS_IMAGE Step 2/7 : FROM $RAPIDS_IMAGE as rapids ---> a80bdce0d796 Step 3/7 : RUN conda install --yes -n base cupy flask protobuf sagemaker ---> Running in f6522ce9b303 Channels: - rapidsai-nightly - dask/label/dev - pytorch - conda-forge - nvidia Platform: linux-64 Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... done ## Package Plan ## environment location: /opt/conda added / updated specs: - cupy - flask - protobuf - sagemaker The following packages will be downloaded: package | build ---------------------------|----------------- blinker-1.8.2 | pyhd8ed1ab_0 14 KB conda-forge boto3-1.34.118 | pyhd8ed1ab_0 78 KB conda-forge botocore-1.34.118 |pyge310_1234567_0 6.8 MB conda-forge dill-0.3.8 | pyhd8ed1ab_0 86 KB conda-forge flask-3.0.3 | pyhd8ed1ab_0 79 KB conda-forge google-pasta-0.2.0 | pyh8c360ce_0 42 KB conda-forge itsdangerous-2.2.0 | pyhd8ed1ab_0 19 KB conda-forge jmespath-1.0.1 | pyhd8ed1ab_0 21 KB conda-forge multiprocess-0.70.16 | py310h2372a71_0 238 KB conda-forge openssl-3.3.1 | h4ab18f5_0 2.8 MB conda-forge pathos-0.3.2 | pyhd8ed1ab_1 52 KB conda-forge pox-0.3.4 | pyhd8ed1ab_0 26 KB conda-forge ppft-1.7.6.8 | pyhd8ed1ab_0 33 KB conda-forge protobuf-4.25.3 | py310ha8c1f0e_0 325 KB conda-forge protobuf3-to-dict-0.1.5 | py310hff52083_8 14 KB conda-forge s3transfer-0.10.1 | pyhd8ed1ab_0 61 KB conda-forge sagemaker-2.75.1 | pyhd8ed1ab_0 377 KB conda-forge smdebug-rulesconfig-1.0.1 | pyhd3deb0d_1 20 KB conda-forge werkzeug-3.0.3 | pyhd8ed1ab_0 237 KB conda-forge ------------------------------------------------------------ Total: 11.2 MB The following NEW packages will be INSTALLED: blinker conda-forge/noarch::blinker-1.8.2-pyhd8ed1ab_0 boto3 conda-forge/noarch::boto3-1.34.118-pyhd8ed1ab_0 botocore conda-forge/noarch::botocore-1.34.118-pyge310_1234567_0 dill conda-forge/noarch::dill-0.3.8-pyhd8ed1ab_0 flask conda-forge/noarch::flask-3.0.3-pyhd8ed1ab_0 google-pasta conda-forge/noarch::google-pasta-0.2.0-pyh8c360ce_0 itsdangerous conda-forge/noarch::itsdangerous-2.2.0-pyhd8ed1ab_0 jmespath conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0 multiprocess conda-forge/linux-64::multiprocess-0.70.16-py310h2372a71_0 pathos conda-forge/noarch::pathos-0.3.2-pyhd8ed1ab_1 pox conda-forge/noarch::pox-0.3.4-pyhd8ed1ab_0 ppft conda-forge/noarch::ppft-1.7.6.8-pyhd8ed1ab_0 protobuf conda-forge/linux-64::protobuf-4.25.3-py310ha8c1f0e_0 protobuf3-to-dict conda-forge/linux-64::protobuf3-to-dict-0.1.5-py310hff52083_8 s3transfer conda-forge/noarch::s3transfer-0.10.1-pyhd8ed1ab_0 sagemaker conda-forge/noarch::sagemaker-2.75.1-pyhd8ed1ab_0 smdebug-rulesconf~ conda-forge/noarch::smdebug-rulesconfig-1.0.1-pyhd3deb0d_1 werkzeug conda-forge/noarch::werkzeug-3.0.3-pyhd8ed1ab_0 The following packages will be UPDATED: openssl 3.3.0-h4ab18f5_3 --> 3.3.1-h4ab18f5_0 Downloading and Extracting Packages: ...working... done Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done Removing intermediate container f6522ce9b303 ---> 883c682b36bc Step 4/7 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py ---> 2f6b3e0bec44 Step 5/7 : ENV SAGEMAKER_PROGRAM rapids-higgs.py ---> Running in df524941c02e Removing intermediate container df524941c02e ---> 4cf437176c8c Step 6/7 : COPY entrypoint.sh /opt/entrypoint.sh ---> 32d95ff5bd74 Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"] ---> Running in c396fa9e98ad Removing intermediate container c396fa9e98ad ---> 39f900bfeba0 Successfully built 39f900bfeba0 Successfully tagged sagemaker-rapids-higgs:latest ``` ```ipython3 !docker images ``` ## Publish to Elastic Container Registry When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster. Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker. ```ipython3 ECR_container_fullname = ( f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}" ) ``` ```ipython3 ECR_container_fullname ``` ```ipython3 !docker tag {estimator_info['ecr_image']} {ECR_container_fullname} ``` ```ipython3 print( f"source : {estimator_info['ecr_image']}\n" f"destination : {ECR_container_fullname}" ) ``` ```myst-ansi source : sagemaker-rapids-higgs:latest destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest ``` ```ipython3 !aws ecr create-repository --repository-name {estimator_info['ecr_repository']} !$(aws ecr get-login --no-include-email --region {region}) ``` ```ipython3 !docker push {ECR_container_fullname} ``` ```myst-ansi The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs] [1B3be3c6f4: Preparing [1Ba7112765: Preparing [1B5c05c772: Preparing [1Bbdce5066: Preparing [1B923ec1b3: Preparing [1B3fcfb3d4: Preparing [1Bbf18a086: Preparing [1Bf3ff1008: Preparing [1Bb6fb91b8: Preparing [1B7bf1eb99: Preparing [1B264186e1: Preparing [1B7d7711e0: Preparing [1Bee96f292: Preparing [1Be2a80b3f: Preparing [1B0a873d7a: Preparing [1Bbcc60d01: Preparing [1B1dcee623: Preparing [1B9a46b795: Preparing [1B5e83c163: Preparing [18Bc05c772: Pushed 643.1MB/637.1MB9A[2K[18A[2K[10A[2K[9A[2K[7A[2K[2A[2K[1A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2Klatest: digest: sha256:c8172a0ad30cd39b091f5fc3f3cde922ceabb103d0a0ec90beb1a5c4c9c6c97c size: 4504 ``` ## Testing your Amazon SageMaker compatible RAPIDS container locally Before you go off and spend time and money on running a large experiment on a large cluster, you should run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have [SageMaker SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk) installed on your local machine. Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the [cuML docs](https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest) page. ```ipython3 hyperparams = { "n_estimators": 15, "max_depth": 5, "n_bins": 8, "split_criterion": 0, # GINI:0, ENTROPY:1 "bootstrap": 0, # true: sample with replacement, false: sample without replacement "max_leaves": -1, # unlimited leaves "max_features": 0.2, } ``` Now, specify the instance type as `local_gpu`. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace `local_gpu` with with a `p3` or `p2` GPU instance by updating the `instance_type` variable. ```ipython3 from sagemaker.estimator import Estimator rapids_estimator = Estimator( image_uri=ECR_container_fullname, role=execution_role, instance_count=1, instance_type="ml.p3.2xlarge", #'local_gpu' max_run=60 * 60 * 24, max_wait=(60 * 60 * 24) + 1, use_spot_instances=True, hyperparameters=hyperparams, metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}], ) ``` ```ipython3 %%time rapids_estimator.fit(inputs=s3_data_dir) ``` ```myst-ansi INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371 ``` ```myst-ansi 2024-06-05 02:14:30 Starting - Starting the training job... 2024-06-05 02:14:54 Starting - Preparing the instances for training... 2024-06-05 02:15:26 Downloading - Downloading input data.................. 2024-06-05 02:18:16 Downloading - Downloading the training image... 2024-06-05 02:18:47 Training - Training image download completed. Training in progress...[34m@ entrypoint -> launching training script [0m 2024-06-05 02:19:27 Uploading - Uploading generated training model[34mtest_acc: 0.7133834362030029[0m 2024-06-05 02:19:35 Completed - Training job completed Training seconds: 249 Billable seconds: 78 Managed Spot Training savings: 68.7% CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms Wall time: 5min 43s ``` Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel. ## Define hyperparameter ranges and run a large-scale search experiment There’s not a whole lot of code changes required to go from local training to training at scale. First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK: ```ipython3 from sagemaker.tuner import ( CategoricalParameter, ContinuousParameter, HyperparameterTuner, IntegerParameter, ) hyperparameter_ranges = { "n_estimators": IntegerParameter(10, 200), "max_depth": IntegerParameter(1, 22), "n_bins": IntegerParameter(5, 24), "split_criterion": CategoricalParameter([0, 1]), "bootstrap": CategoricalParameter([True, False]), "max_features": ContinuousParameter(0.01, 0.5), } ``` Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — `ml.p3.8xlarge`. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training. ```ipython3 from sagemaker.estimator import Estimator rapids_estimator = Estimator( image_uri=ECR_container_fullname, role=execution_role, instance_count=2, instance_type="ml.p3.8xlarge", max_run=60 * 60 * 24, max_wait=(60 * 60 * 24) + 1, use_spot_instances=True, hyperparameters=hyperparams, metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}], ) ``` Now you define a HyperparameterTuner object using the estimator you defined above. ```ipython3 tuner = HyperparameterTuner( rapids_estimator, objective_metric_name="test_acc", hyperparameter_ranges=hyperparameter_ranges, strategy="Bayesian", max_jobs=2, max_parallel_jobs=2, objective_type="Maximize", metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}], ) ``` ```ipython3 job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime()) tuner.fit({"dataset": s3_data_dir}, job_name=job_name) ``` ## Clean up - Delete S3 buckets and files you don’t need - Kill training jobs that you don’t want running - Delete container images and the repository you just created ```ipython3 !aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']} ``` # index.html.md # Autoscaling Multi-Tenant Kubernetes Deep-Dive *February, 2023* In this example we are going to take a deep-dive into launching an autoscaling multi-tenant RAPIDS environment on Kubernetes. Being able to scale out your workloads and only pay for the resources you use is a fantastic way to save costs when using RAPIDS. If you have many folks in your organization who all want to be able to do this you can get added benefits by pooling your resources into an autoscaling Kubernetes cluster. Let’s run through the steps required to launch a Kubernetes cluster on [Google Cloud](https://cloud.google.com), then simulate the workloads of many users sharing the cluster. Then we can explore what that experience was like both from a user perspective and also from a cost perspective. ## Prerequisites Before we get started you’ll need to ensure you have a few CLI tools installed. - [`gcloud`](https://cloud.google.com/sdk/gcloud) (and make sure you run [`gcloud auth login`](https://cloud.google.com/sdk/gcloud/reference/auth/login)) - [`kubectl`](https://kubernetes.io/docs/tasks/tools/) - [`helm`](https://helm.sh/docs/intro/install/) ## Get a Kubernetes Cluster For this example we are going to use [Google Cloud’s Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) to launch a cluster. ```ipython3 ! gcloud container clusters create multi-tenant-rapids \ --accelerator type=nvidia-tesla-t4,count=2 --machine-type n1-standard-4 \ --region us-central1 --node-locations us-central1-b,us-central1-c \ --release-channel stable \ --enable-autoscaling --autoscaling-profile optimize-utilization \ --num-nodes 1 --min-nodes 1 --max-nodes 20 \ --image-type="COS_CONTAINERD" --enable-image-streaming ``` ```myst-ansi Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the `--no-enable-ip-alias` flag Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the `--location-policy` flag. Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s). Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus Creating cluster multi-tenant-rapids in us-central1... Cluster is being configu red...⠼ Creating cluster multi-tenant-rapids in us-central1... Cluster is being deploye d...⠏ Creating cluster multi-tenant-rapids in us-central1... Cluster is being health- checked (master is healthy)...done. Created [https://container.googleapis.com/v1/projects/nv-ai-infra/zones/us-central1/clusters/multi-tenant-rapids]. To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/multi-tenant-rapids?project=nv-ai-infra kubeconfig entry generated for multi-tenant-rapids. NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS multi-tenant-rapids us-central1 1.23.14-gke.1800 104.197.37.225 n1-standard-4 1.23.14-gke.1800 2 RUNNING ``` Now that we have our cluster let’s [install the NVIDIA Drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers). ```ipython3 ! kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml ``` ```myst-ansi daemonset.apps/nvidia-driver-installer created ``` ## Observability Once we have run some workloads on our Kubernetes cluster we will want to be able to go back through the cluster telemetry data to see how our autoscaling behaved. To do this let’s install [Prometheus](https://prometheus.io/) so that we are recording cluster metrics and can explore them later. ### Prometheus Stack Let’s start by installing the [Kubernetes Prometheus Stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) which includes everything we need to run Prometheus on our cluster. We need to add a couple of extra configuration options to ensure Prometheus is collecting data frequently enough to analyse, which you will find in `prometheus-stack-values.yaml`. ```ipython3 ! cat prometheus-stack-values.yaml ``` ```myst-ansi # prometheus-stack-values.yaml serviceMonitorSelectorNilUsesHelmValues: false prometheus: prometheusSpec: # Setting this to a high frequency so that we have richer data for analysis later scrapeInterval: 1s ``` ```ipython3 ! helm install --repo https://prometheus-community.github.io/helm-charts kube-prometheus-stack kube-prometheus-stack \ --create-namespace --namespace prometheus \ --values prometheus-stack-values.yaml ``` ```myst-ansi NAME: kube-prometheus-stack LAST DEPLOYED: Tue Feb 21 09:19:39 2023 NAMESPACE: prometheus STATUS: deployed REVISION: 1 NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack" Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator. ``` Now that we have Prometheus running and collecting data we can move on and install RAPIDS and run some workloads. We will come back to these tools later when we want to explore the data we have collected. ## Install RAPIDS For this RAPIDS installation we are going to use a single [Jupyter Notebook Pod](../../platforms/kubernetes.md) and the [Dask Operator](../../tools/kubernetes/dask-operator.md). In a real deployment you would use something like [JupyterHub](https://jupyter.org/hub) or [Kubeflow Notebooks](https://www.kubeflow.org/docs/components/notebooks/) to create a notebook spawning service with user authentication, but that is out of scope for this example. ### Image Steaming (optional) In order to steam the container image to the GKE nodes our image needs to be stored in [Google Cloud Artifact Registry](https://cloud.google.com/artifact-registry/) in the same region as our cluster. ```console $ docker pull rapidsai/base:25.12a-cuda12-py3.13 $ docker tag rapidsai/base:25.12a-cuda12-py3.13 REGION-docker.pkg.dev/PROJECT/REPO/IMAGE:TAG $ docker push REGION-docker.pkg.dev/PROJECT/REPO/IMAGE:TAG ``` Be sure to replace the image throughout the notebook with the one that you have pushed to your own Google Cloud project. ### Image Prepuller (optional) If you know that many users are going to want to frequently pull a specific container image I like to run a small `DaemonSet` which ensures that image starts streaming onto a node as soon as it joins the cluster. This is optional but can reduce wait time for users. ```ipython3 ! cat ./image-prepuller.yaml ``` ```myst-ansi # image-prepuller.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: prepull-rapids spec: selector: matchLabels: name: prepull-rapids template: metadata: labels: name: prepull-rapids spec: initContainers: - name: prepull-rapids image: us-central1-docker.pkg.dev/nv-ai-infra/rapidsai/rapidsai/base:example command: ["sh", "-c", "'true'"] containers: - name: pause image: gcr.io/google_containers/pause ``` ```ipython3 ! kubectl apply -f image-prepuller.yaml ``` ```myst-ansi daemonset.apps/prepull-rapids created ``` ### RAPIDS Notebook Pod Now let’s launch a Notebook Pod. #### NOTE From this Pod we are going to want to be able to spawn Dask cluster resources on Kubernetes, so we need to ensure the Pod has the appropriate permissions to interact with the Kubernetes API. ```ipython3 ! kubectl apply -f rapids-notebook.yaml ``` ```myst-ansi serviceaccount/rapids-dask created role.rbac.authorization.k8s.io/rapids-dask created rolebinding.rbac.authorization.k8s.io/rapids-dask created configmap/jupyter-server-proxy-config created service/rapids-notebook created pod/rapids-notebook created ``` ### Install the Dask Operator Lastly we need to install the Dask Operator so we can spawn RAPIDS Dask cluster from our Notebook session. ```ipython3 ! helm install --repo https://helm.dask.org dask-kubernetes-operator \ --generate-name --create-namespace --namespace dask-operator ``` ```myst-ansi NAME: dask-kubernetes-operator-1676971371 LAST DEPLOYED: Tue Feb 21 09:23:06 2023 NAMESPACE: dask-operator STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Operator has been installed successfully. ``` ## Running Some Work Next let’s connect to the Jupyter session and run some work on our cluster. You can do this by port forwarding the Jupyter service to your local machine. ```console $ kubectl port-forward svc/rapids-notebook 8888:8888 Forwarding from 127.0.0.1:8888 -> 8888 Forwarding from [::1]:8888 -> 8888 ``` Then open http://localhost:8888 in your browser. #### NOTE If you are following along with this notebook locally you will also want to upload it to the Jupyter session and continue running the cells from there. ### Check Capabilities Let’s make sure our environment is all set up correctly by checking out our capabilities. We can start by running `nvidia-smi` to inspect our Notebook GPU. ```ipython3 ! nvidia-smi ``` ```myst-ansi Tue Feb 21 14:50:01 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 41C P8 14W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` Great we can see our notebook has an NVIDIA T4. Now let’s use `kubectl` to inspect our cluster. We won’t actually have `kubectl` installed in our remote Jupyter environment so let’s do that first. ```ipython3 ! mamba install --quiet -c conda-forge kubernetes-client -y ``` ```myst-ansi Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done ``` ```ipython3 ! kubectl get pods ``` ```myst-ansi NAME READY STATUS RESTARTS AGE prepull-rapids-l5qgt 1/1 Running 0 3m24s prepull-rapids-w8xcj 1/1 Running 0 3m24s rapids-notebook 1/1 Running 0 2m54s ``` We can see our prepull Pods we created earlier alongside our `rapids-notebook` Pod that we are currently in. As we created the prepull Pod via a `DaemonSet` we also know that there are two nodes in our Kubernetes cluster because there are two prepull Pods. As our cluster scales we will see more of them appear. ```ipython3 ! kubectl get daskclusters ``` ```myst-ansi No resources found in default namespace. ``` We can also see that we currently have no `DaskCluster` resources, but this is good because we didn’t get a `server doesn't have a resource type "daskclusters"` error so we know the Dask Operator also installed successfully. ### Small Workload Let’s run a small RAPIDS workload that stretches our Kubernetes cluster a little and causes it to scale. We know that we have two nodes in our Kubernetes cluster and we selected a node type with 2 GPUs when we launched it on GKE. Our Notebook Pod is taking up one GPU so we have three remaining. If we launch a Dask Cluster we will need one GPU for the scheduler and one for each worker. So let’s create a Dask cluster with four workers which will cause our Kubernetes to add one more node. First let’s install `dask-kubernetes` so we can create our `DaskCluster` resources from Python. We will also install `gcsfs` so that our workload can read data from [Google Cloud Storage](https://cloud.google.com/storage). ```ipython3 ! mamba install --quiet -c conda-forge dask-kubernetes gcsfs -y ``` ```myst-ansi Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done ``` ```ipython3 from dask_kubernetes.operator import KubeCluster cluster = KubeCluster( name="rapids-dask-1", image="rapidsai/base:25.12a-cuda12-py3.13", # Replace me with your cached image n_workers=4, resources={"limits": {"nvidia.com/gpu": "1"}}, env={"EXTRA_PIP_PACKAGES": "gcsfs"}, worker_command="dask-cuda-worker", ) ``` ```myst-ansi Unclosed client session client_session: Unclosed connection client_connection: Connection ``` Great our Dask cluster was created but right now we just have a scheduler with half of our workers. We can use `kubectl` to see what is happening. ```ipython3 ! kubectl get pods ``` ```myst-ansi NAME READY STATUS RESTARTS AGE prepull-rapids-l5qgt 1/1 Running 0 6m18s prepull-rapids-w8xcj 1/1 Running 0 6m18s rapids-dask-1-default-worker-5f59bc8e7a 0/1 Pending 0 68s rapids-dask-1-default-worker-88ab088b7c 0/1 Pending 0 68s rapids-dask-1-default-worker-b700343afe 1/1 Running 0 68s rapids-dask-1-default-worker-e0bb7fff2d 1/1 Running 0 68s rapids-dask-1-scheduler 1/1 Running 0 69s rapids-notebook 1/1 Running 0 5m48s ``` We see here that most of our Pods are `Running` but two workers are `Pending`. This is because we don’t have enough GPUs for them right now. We can look at the events on our pending pods for more information. ```ipython3 ! kubectl get event --field-selector involvedObject.name=rapids-dask-1-default-worker-5f59bc8e7a ``` ```myst-ansi LAST SEEN TYPE REASON OBJECT MESSAGE 50s Warning FailedScheduling pod/rapids-dask-1-default-worker-5f59bc8e7a 0/2 nodes are available: 2 Insufficient nvidia.com/gpu. 12s Normal TriggeredScaleUp pod/rapids-dask-1-default-worker-5f59bc8e7a pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/nv-ai-infra/zones/us-central1-b/instanceGroups/gke-multi-tenant-rapids-default-pool-3a6a793f-grp 1->2 (max: 20)}] ``` Here we can see that our Pod triggered the cluster to scale from one to two nodes. If we wait for our new node to come online we should see a few things happen. - First there will be a new prepull Pod scheduled on the new node which will start streaming the RAPIDS container image. - Other Pods in the `kube-system` namespace will be scheduled to install NVIDIA drivers and update the Kubernetes API. - Then once the GPU drivers have finished installing the worker Pods will be scheduled onto our new node - Then once the image is ready our Pods move into a `Running` phase. ```ipython3 ! kubectl get pods -w ``` ```myst-ansi NAME READY STATUS RESTARTS AGE prepull-rapids-l5qgt 1/1 Running 0 6m41s prepull-rapids-w8xcj 1/1 Running 0 6m41s rapids-dask-1-default-worker-5f59bc8e7a 0/1 Pending 0 91s rapids-dask-1-default-worker-88ab088b7c 0/1 Pending 0 91s rapids-dask-1-default-worker-b700343afe 1/1 Running 0 91s rapids-dask-1-default-worker-e0bb7fff2d 1/1 Running 0 91s rapids-dask-1-scheduler 1/1 Running 0 92s rapids-notebook 1/1 Running 0 6m11s prepull-rapids-69pbq 0/1 Pending 0 0s prepull-rapids-69pbq 0/1 Pending 0 0s prepull-rapids-69pbq 0/1 Init:0/1 0 4s rapids-dask-1-default-worker-88ab088b7c 0/1 Pending 0 2m3s prepull-rapids-69pbq 0/1 Init:0/1 0 9s prepull-rapids-69pbq 0/1 PodInitializing 0 15s rapids-dask-1-default-worker-5f59bc8e7a 0/1 Pending 0 2m33s prepull-rapids-69pbq 1/1 Running 0 3m7s rapids-dask-1-default-worker-5f59bc8e7a 0/1 Pending 0 5m13s rapids-dask-1-default-worker-88ab088b7c 0/1 Pending 0 5m13s rapids-dask-1-default-worker-5f59bc8e7a 0/1 ContainerCreating 0 5m14s rapids-dask-1-default-worker-88ab088b7c 0/1 ContainerCreating 0 5m14s rapids-dask-1-default-worker-5f59bc8e7a 1/1 Running 0 5m26s rapids-dask-1-default-worker-88ab088b7c 1/1 Running 0 5m26s ^C ``` Awesome we can now run some work on our Dask cluster. ```ipython3 from dask.distributed import Client, wait client = Client(cluster) client ``` Let’s load some data from GCS into memory on our GPUs. ```ipython3 %%time import dask.config import dask.dataframe as dd dask.config.set({"dataframe.backend": "cudf"}) df = dd.read_parquet( "gcs://anaconda-public-data/nyc-taxi/2015.parquet/part.1*", storage_options={"token": "anon"}, ).persist() wait(df) df ``` Now we can do some calculation. This can be whatever you want to do with your data, for this example let’s do something quick like calculating the haversine distance between the pickup and dropoff locations (yes calculating this on ~100M rows is a quick task for RAPIDS 😁). ```ipython3 import cuspatial def map_haversine(part): pickup = cuspatial.GeoSeries.from_points_xy( part[["pickup_longitude", "pickup_latitude"]].interleave_columns() ) dropoff = cuspatial.GeoSeries.from_points_xy( part[["dropoff_longitude", "dropoff_latitude"]].interleave_columns() ) return cuspatial.haversine_distance(pickup, dropoff) df["haversine_distance"] = df.map_partitions(map_haversine) ``` ```ipython3 %%time df["haversine_distance"].compute() ``` ```myst-ansi CPU times: user 1.44 s, sys: 853 ms, total: 2.29 s Wall time: 4.66 s ``` Great, so we now have a little toy workloads that opens some data, does some calculation and takes a bit of time. Let’s remove our single Dask cluster and switch to simulating many workloads running at once. ```ipython3 client.close() cluster.close() ``` ## Simulating Many Multi-Tenant Workloads Now we have a toy workload which we can use to represent one user on our multi-tenant cluster. Let’s now construct a larger graph to simulate lots of users spinning up Dask clusters and running workloads. First let’s create a function that contains our whole workload including our cluster setup. ```ipython3 import dask.delayed @dask.delayed def run_haversine(*args): import uuid import dask.config import dask.dataframe as dd from dask.distributed import Client from dask_kubernetes.operator import KubeCluster dask.config.set({"dataframe.backend": "cudf"}) def map_haversine(part): from cuspatial import haversine_distance return haversine_distance( part["pickup_longitude"], part["pickup_latitude"], part["dropoff_longitude"], part["dropoff_latitude"], ) with KubeCluster( name="rapids-dask-" + uuid.uuid4().hex[:5], image="rapidsai/base:25.12a-cuda12-py3.13", # Replace me with your cached image n_workers=2, resources={"limits": {"nvidia.com/gpu": "1"}}, env={"EXTRA_PIP_PACKAGES": "gcsfs"}, worker_command="dask-cuda-worker", resource_timeout=600, ) as cluster: with Client(cluster) as client: client.wait_for_workers(2) df = dd.read_parquet( "gcs://anaconda-public-data/nyc-taxi/2015.parquet", storage_options={"token": "anon"}, ) client.compute(df.map_partitions(map_haversine)) ``` Now if we run this function we will launch a Dask cluster and run our workload. We will use context managers to ensure our Dask cluster gets cleaned up when the work is complete. Given that we have no active Dask clusters this function will be executed on the Notebook Pod. ```ipython3 %%time run_haversine().compute() ``` ```myst-ansi Unclosed client session client_session: Unclosed connection client_connection: Connection ``` ```myst-ansi CPU times: user 194 ms, sys: 30 ms, total: 224 ms Wall time: 23.6 s ``` Great that works, so we have a self contained RAPIDS workload that launches its own Dask cluster and performs some work. ### Simulating our Multi-Tenant Workloads To see how our Kubernetes cluster behaves when many users are sharing it we want to run our haversine workload a bunch of times. #### NOTE If you’re not interested in how we simulate this workload feel free to skip onto the analysis section. To do this we can create another Dask cluster which we will use to pilot our workloads. This cluster will be a proxy for the Jupyter sessions our users would be interacting with. Then we will construct a Dask graph which runs our haversine workload many times in various configurations to simulate different users submitting different workloads on an ad-hoc basis. ```ipython3 from dask_kubernetes.operator import KubeCluster, make_cluster_spec cluster_spec = make_cluster_spec( name="mock-jupyter-cluster", image="rapidsai/base:25.12a-cuda12-py3.13", # Replace me with your cached image n_workers=1, resources={"limits": {"nvidia.com/gpu": "1"}, "requests": {"cpu": "50m"}}, env={"EXTRA_PIP_PACKAGES": "gcsfs dask-kubernetes"}, ) cluster_spec["spec"]["worker"]["spec"]["serviceAccountName"] = "rapids-dask" cluster = KubeCluster(custom_cluster_spec=cluster_spec) cluster ``` ```myst-ansi Unclosed client session client_session: ``` We need to ensure our workers have the same dependencies as our Notebook session here so that it can spawn more Dask clusters so we install `gcsfs` and `dask-kubernetes`. ```ipython3 client = Client(cluster) client ``` Now lets submit our workload again but this time to our cluster. Our function will be sent to our “Jupyter” worker which will then spawn another Dask cluster to run the workload. We don’t have enough GPUs in our cluster to do this so it will trigger another scale operation. ```ipython3 %%time run_haversine().compute() ``` ```myst-ansi CPU times: user 950 ms, sys: 9.1 ms, total: 959 ms Wall time: 27.1 s ``` Now let’s write a small function which we can use to build up arbitrarily complex workloads. We can define how many stages we have, how many concurrent Dask clusters their should be, how quickly to vary width over time, etc. ```ipython3 from random import randrange def generate_workload( stages=3, min_width=1, max_width=3, variation=1, input_workload=None ): graph = [input_workload] if input_workload is not None else [run_haversine()] last_width = min_width for _ in range(stages): width = randrange( max(min_width, last_width - variation), min(max_width, last_width + variation) + 1, ) graph = [run_haversine(*graph) for _ in range(width)] last_width = width return run_haversine(*graph) ``` ```ipython3 cluster.scale(3) # Let's also bump up our user cluster to show more users logging in. ``` To visualize our graphs let’s check that we have `graphviz` installed. ```ipython3 !mamba install -c conda-forge --quiet graphviz python-graphviz -y ``` ```myst-ansi Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done ``` Let’s start with a small workload which will run a couple of stages and trigger a scale up. ```ipython3 workload = generate_workload(stages=2, max_width=2) workload.visualize() ``` This is great we have multiple stages where one or two users are running workloads at the same time. Now lets chain a bunch of these workloads together to simulate varying demands over a larger period of time. We will also track the start and end times of the run so that we can grab the right data from Prometheus later. ```ipython3 import datetime ``` #### WARNING The next cell will take around 1h to run. ```ipython3 %%time start_time = (datetime.datetime.now() - datetime.timedelta(minutes=15)).strftime( "%Y-%m-%dT%H:%M:%SZ" ) try: # Start with a couple of concurrent workloads workload = generate_workload(stages=10, max_width=2) # Then increase demand as more users appear workload = generate_workload( stages=5, max_width=5, min_width=3, variation=5, input_workload=workload ) # Now reduce the workload for a longer period of time, this could be over a lunchbreak or something workload = generate_workload(stages=30, max_width=2, input_workload=workload) # Everyone is back from lunch and it hitting the cluster hard workload = generate_workload( stages=10, max_width=10, min_width=3, variation=5, input_workload=workload ) # The after lunch rush is easing workload = generate_workload( stages=5, max_width=5, min_width=3, variation=5, input_workload=workload ) # As we get towards the end of the day demand slows off again workload = generate_workload(stages=10, max_width=2, input_workload=workload) workload.compute() finally: client.close() cluster.close() end_time = (datetime.datetime.now() + datetime.timedelta(minutes=15)).strftime( "%Y-%m-%dT%H:%M:%SZ" ) ``` ```myst-ansi Task exception was never retrieved future: .wait() done, defined at /opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/client.py:2119> exception=AllExit()> Traceback (most recent call last): File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/client.py", line 2128, in wait raise AllExit() distributed.client.AllExit ``` ```myst-ansi CPU times: user 2min 43s, sys: 3.04 s, total: 2min 46s Wall time: 1h 18min 18s ``` Ok great, our large graph of workloads resulted in ~200 clusters launching throughout the run with varying capacity demands and took just over an hour to run. ## Analysis Let’s explore the data we’ve been collecting with Prometheus to see how our cluster perforumed during our simulated workload. We could do this in [Grafana](https://grafana.com/), but instead let’s stay in the notebook and use `prometheus-pandas`. ```ipython3 ! pip install prometheus-pandas ``` ```myst-ansi Collecting prometheus-pandas Downloading prometheus_pandas-0.3.2-py3-none-any.whl (6.1 kB) Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from prometheus-pandas) (1.23.5) Requirement already satisfied: pandas in /opt/conda/envs/rapids/lib/python3.9/site-packages (from prometheus-pandas) (1.5.2) Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->prometheus-pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->prometheus-pandas) (2022.6) Requirement already satisfied: six>=1.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas->prometheus-pandas) (1.16.0) Installing collected packages: prometheus-pandas Successfully installed prometheus-pandas-0.3.2 [33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m [0m ``` Connect to the prometheus endpoint within our cluster. ```ipython3 from prometheus_pandas import query p = query.Prometheus("http://kube-prometheus-stack-prometheus.prometheus:9090") ``` ### Pending Pods First let’s see how long each of our Pods spent in a `Pending` phase. This is the amount of time users would have to wait for their work to start running when they create their Dask clusters. ```ipython3 pending_pods = p.query_range( 'kube_pod_status_phase{phase="Pending",namespace="default"}', start_time, end_time, "1s", ).sum() ``` ```ipython3 from dask.utils import format_time ``` Average time for Pod creation. ```ipython3 format_time(pending_pods.median()) ``` ```ipython3 format_time(pending_pods.mean()) ``` 99th percentile time for Pod creation. ```ipython3 format_time(pending_pods.quantile(0.99)) ``` These numbers seem great, the most common start time for a cluster is two seconds! With the average being around 20 seconds. If your cluster triggers Kubernetes to scale up you could be waiting for 5 minutes though. Let’s see how many users would end up in that situation. What percentage of users get workers in less than 2 seconds, 5 seconds, 60 seconds, etc? ```ipython3 from scipy import stats stats.percentileofscore(pending_pods, 2.01) ``` ```ipython3 stats.percentileofscore(pending_pods, 5.01) ``` ```ipython3 stats.percentileofscore(pending_pods, 60.01) ``` Ok this looks pretty reasonable. Nearly 75% of users get a cluster in less than 5 seconds, and over 90% get it in under a minute. But if you’re in the other 10% you may have to wait for 5 minutes. Let’s bucket this data to see the distribution of startup times visually. ```ipython3 ax = pending_pods.hist(bins=range(0, 600, 30)) ax.set_title("Dask Worker Pod wait times") ax.set_xlabel("Seconds") ax.set_ylabel("Pods") ``` ```ipython3 ax = pending_pods.hist(bins=range(0, 60, 2)) ax.set_title("Dask Worker Pod wait times (First minute)") ax.set_xlabel("Seconds") ax.set_ylabel("Pods") ``` Here we can see clearly that most users get their worker Pods scheduled in less than 5 seconds. ### Cluster scaling and efficiency Ok so our users are getting clusters nice and quick, that’s because there is some warm capacity in the Kubernetes cluster that they are able to grab. When the limit is reached GKE autoscales to add new nodes. When demand drops for a while capacity is released again to save cost. Lets query to see how many nodes there were during the run and combine that with the number of running GPU Pods there were to see how efficiently we were using our resources. ```ipython3 running_pods = p.query_range( 'kube_pod_status_phase{phase=~"Running|ContainerCreating",namespace="default"}', start_time, end_time, "1s", ) running_pods = running_pods[ running_pods.columns.drop(list(running_pods.filter(regex="prepull"))) ] nodes = p.query_range("count(kube_node_info)", start_time, end_time, "1s") nodes.columns = ["Available GPUs"] nodes["Available GPUs"] = ( nodes["Available GPUs"] * 2 ) # We know our nodes each had 2 GPUs nodes["Utilized GPUs"] = running_pods.sum(axis=1) ``` ```ipython3 nodes.plot() ``` Excellent so we can see our cluster adding and removing nodes as our workload demand changed. The space between the orange and blue lines is our warm capacity. Ideally we want this to be as small as possible. Let’s calculate what the gap is. How many GPU hours did our users utilize? ```ipython3 gpu_hours_utilized = nodes["Utilized GPUs"].sum() / 60 / 60 gpu_hours_utilized ``` How many GPU hours were we charged for? ```ipython3 gpu_hours_cost = nodes["Available GPUs"].sum() / 60 / 60 gpu_hours_cost ``` What was the overhead? ```ipython3 overhead = (1 - (gpu_hours_utilized / gpu_hours_cost)) * 100 str(int(overhead)) + "% overhead" ``` Ok not bad, so on our interactive cluster we managed 64% utilization of our GPU resources. Compared to non-autoscaling workloads where users interactively use long running workstations and clusters this is fantastic. If we measured batch workloads that ran for longer periods we would see this utilization clumb much higher. ## Closing thoughts By sharing a Kubernetes cluster between many users who are all launching many ephemeral Dask Clusters to perform their work we are able to balance cost vs user time. Peaks in individual user demands get smoothed out over time in a multi-tenant model, and the overall peaks and troughs of the day are accommodated by the Kubernetes cluster autoscaler. We managed to create a responsive experience for our users where they generally got Dask clusters in a few seconds. We also managed to hit 64% utilization of the GPUs in our cluster, a very respectable number for an interactive cluster. There are more things we could tune to increase utilization, but there are also some tradeoffs to be made here. If we scale down more aggressively then we would end up needing to scale back up more often resulting in more users waiting longer for their clusters. We can also see there there is some unused capacity between the nodes starting and our workload running. This is the time when image pulling happens, drivers get installed, etc. There are definitely things we could do to improve this so that nodes are ready to go as soon as they have booted. Compared to every user spinning up dedicated nodes for their individual workloads and paying the driver install and environment pull wait time and overhead cost every time, we are pooling our resources and reusing our capacity effectively. ## Teardown Finally to clean everything up we can delete our GKE cluster by running the following command locally. ```ipython3 ! gcloud container clusters delete multi-tenant-rapids --region us-central1 --quiet ``` ```myst-ansi Deleting cluster multi-tenant-rapids...done. Deleted [https://container.googleapis.com/v1/projects/nv-ai-infra/zones/us-central1/clusters/multi-tenant-rapids]. ``` # index.html.md # Multi-node Multi-GPU Example on AWS using dask-cloudprovider *February, 2023* [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) is a native cloud integration for dask. It helps manage Dask clusters on different cloud platforms. In this notebook, we will look at how we can use the package to set-up a AWS cluster and run a multi-node multi-GPU (MNMG) example with [RAPIDS](https://rapids.ai/). RAPIDS provides a suite of libraries to accelerate data science pipelines on the GPU entirely. This can be scaled to multiple nodes using Dask as we will see through this notebook. ## Create Your Cluster #### NOTE First follow the [full instructions](../../cloud/aws/ec2-multi.md) on launching a multi-node GPU cluster with Dask Cloud Provider. Once you have a `cluster` object up and running head back here and continue. ```python from dask_cloudprovider.aws import EC2Cluster cluster = EC2Cluster(...) ``` ## Client Set Up Now we can create a [Dask Client](https://distributed.dask.org/en/latest/client.html) with the cluster we just defined. ```ipython3 from dask.distributed import Client client = Client(cluster) client ``` ### Optionally: We can wait for all workers to be up and running. We do so by adding: ```python # n_workers is the number of GPUs your cluster will have client.wait_for_workers(n_workers) ``` ## Machine Learning Workflow Once workers become available, we can now run the rest of our workflow: - read and clean the data - add features - split into training and validation sets - fit a Random Forest model - predict on the validation set - compute RMSE Let’s import the rest of our dependencies. ```ipython3 import dask_cudf import numpy as np from cuml.dask.common import utils as dask_utils from cuml.dask.ensemble import RandomForestRegressor from cuml.metrics import mean_squared_error from dask_ml.model_selection import train_test_split ``` ### 1. Read and Clean Data The data needs to be cleaned up before it can be used in a meaningful way. We verify the columns have appropriate datatypes to make it ready for computation using cuML. ```ipython3 # create a list of all columns & dtypes the df must have for reading col_dtype = { "VendorID": "int32", "tpep_pickup_datetime": "datetime64[ms]", "tpep_dropoff_datetime": "datetime64[ms]", "passenger_count": "int32", "trip_distance": "float32", "pickup_longitude": "float32", "pickup_latitude": "float32", "RatecodeID": "int32", "store_and_fwd_flag": "int32", "dropoff_longitude": "float32", "dropoff_latitude": "float32", "payment_type": "int32", "fare_amount": "float32", "extra": "float32", "mta_tax": "float32", "tip_amount": "float32", "total_amount": "float32", "tolls_amount": "float32", "improvement_surcharge": "float32", } ``` ```ipython3 taxi_df = dask_cudf.read_csv( "https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-02.csv", dtype=col_dtype, ) ``` ```ipython3 # Dictionary of required columns and their datatypes must_haves = { "pickup_datetime": "datetime64[ms]", "dropoff_datetime": "datetime64[ms]", "passenger_count": "int32", "trip_distance": "float32", "pickup_longitude": "float32", "pickup_latitude": "float32", "rate_code": "int32", "dropoff_longitude": "float32", "dropoff_latitude": "float32", "fare_amount": "float32", } ``` ```ipython3 def clean(ddf, must_haves): # replace the extraneous spaces in column names and lower the font type tmp = {col: col.strip().lower() for col in list(ddf.columns)} ddf = ddf.rename(columns=tmp) ddf = ddf.rename( columns={ "tpep_pickup_datetime": "pickup_datetime", "tpep_dropoff_datetime": "dropoff_datetime", "ratecodeid": "rate_code", } ) ddf["pickup_datetime"] = ddf["pickup_datetime"].astype("datetime64[ms]") ddf["dropoff_datetime"] = ddf["dropoff_datetime"].astype("datetime64[ms]") for col in ddf.columns: if col not in must_haves: ddf = ddf.drop(columns=col) continue if ddf[col].dtype == "object": # Fixing error: could not convert arg to str ddf = ddf.drop(columns=col) else: # downcast from 64bit to 32bit types # Tesla T4 are faster on 32bit ops if "int" in str(ddf[col].dtype): ddf[col] = ddf[col].astype("int32") if "float" in str(ddf[col].dtype): ddf[col] = ddf[col].astype("float32") ddf[col] = ddf[col].fillna(-1) return ddf ``` ```ipython3 taxi_df = taxi_df.map_partitions(clean, must_haves, meta=must_haves) ``` ### 2. Add Features We’ll add new features to the dataframe: 1. We can split the datetime column to retrieve year, month, day, hour, day_of_week columns. Find the difference between pickup time and drop off time. 2. Haversine Distance between the pick-up and drop-off coordinates. ```ipython3 ## add features taxi_df["hour"] = taxi_df["pickup_datetime"].dt.hour.astype("int32") taxi_df["year"] = taxi_df["pickup_datetime"].dt.year.astype("int32") taxi_df["month"] = taxi_df["pickup_datetime"].dt.month.astype("int32") taxi_df["day"] = taxi_df["pickup_datetime"].dt.day.astype("int32") taxi_df["day_of_week"] = taxi_df["pickup_datetime"].dt.weekday.astype("int32") taxi_df["is_weekend"] = (taxi_df["day_of_week"] >= 5).astype("int32") # calculate the time difference between dropoff and pickup. taxi_df["diff"] = taxi_df["dropoff_datetime"].astype("int32") - taxi_df[ "pickup_datetime" ].astype("int32") taxi_df["diff"] = (taxi_df["diff"] / 1000).astype("int32") taxi_df["pickup_latitude_r"] = taxi_df["pickup_latitude"] // 0.01 * 0.01 taxi_df["pickup_longitude_r"] = taxi_df["pickup_longitude"] // 0.01 * 0.01 taxi_df["dropoff_latitude_r"] = taxi_df["dropoff_latitude"] // 0.01 * 0.01 taxi_df["dropoff_longitude_r"] = taxi_df["dropoff_longitude"] // 0.01 * 0.01 taxi_df = taxi_df.drop("pickup_datetime", axis=1) taxi_df = taxi_df.drop("dropoff_datetime", axis=1) def haversine_dist(df): import cuspatial pickup = cuspatial.GeoSeries.from_points_xy( df[["pickup_longitude", "pickup_latitude"]].interleave_columns() ) dropoff = cuspatial.GeoSeries.from_points_xy( df[["dropoff_longitude", "dropoff_latitude"]].interleave_columns() ) df["h_distance"] = cuspatial.haversine_distance(pickup, dropoff) df["h_distance"] = df["h_distance"].astype("float32") return df taxi_df = taxi_df.map_partitions(haversine_dist) ``` ### 3. Split Data ```ipython3 # Split into training and validation sets X, y = taxi_df.drop(["fare_amount"], axis=1).astype("float32"), taxi_df[ "fare_amount" ].astype("float32") X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True) ``` ```ipython3 workers = client.has_what().keys() X_train, X_test, y_train, y_test = dask_utils.persist_across_workers( client, [X_train, X_test, y_train, y_test], workers=workers ) ``` ### 4. Create and fit a Random Forest Model ```ipython3 # create cuml.dask RF regressor cu_dask_rf = RandomForestRegressor(ignore_empty_partitions=True) ``` ```ipython3 # fit RF model cu_dask_rf = cu_dask_rf.fit(X_train, y_train) ``` ### 5. Predict on validation set ```ipython3 # predict on validation set y_pred = cu_dask_rf.predict(X_test) ``` ### 6. Compute RMSE ```ipython3 # compute RMSE score = mean_squared_error(y_pred.compute().to_numpy(), y_test.compute().to_numpy()) print("Workflow Complete - RMSE: ", np.sqrt(score)) ``` ### Resource Cleanup ```ipython3 # Clean up resources client.close() cluster.close() ``` #### Learn More - [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/) # index.html.md # HPO with dask-ml and cuml *April, 2023* ## Introduction [Hyperparameter optimization](https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview) is the task of picking hyperparameters values of the model that provide the optimal results for the problem, as measured on a specific test dataset. This is often a crucial step and can help boost the model accuracy when done correctly. Cross-validation is often used to more accurately estimate the performance of the models in the search process. Cross-validation is the method of splitting the training set into complementary subsets and performing training on one of the subsets, then predicting the models performance on the other. This is a potential indication of how the model will generalise to data it has not seen before. Despite its theoretical importance, HPO has been difficult to implement in practical applications because of the resources needed to run so many distinct training jobs. The two approaches that we will be exploring in this notebook are : ### 1. GridSearch As the name suggests, the “search” is done over each possible combination in a grid of parameters that the user provides. The user must manually define this grid.. For each parameter that needs to be tuned, a set of values are given and the final grid search is performed with tuple having one element from each set, thus resulting in a Catersian Product of the elements. For example, assume we want to perform HPO on XGBoost. For simplicity lets tune only `n_estimators` and `max_depth` - `n_estimators: [50, 100, 150]` - `max_depth: [6, 7, ,8]` The grid search will take place over |n_estimators| x |max_depth| which is 3 x 3 = 9. As you have probably guessed, the grid size grows rapidly as the number of parameters and their search space increases. ### 2. RandomSearch [Random Search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) replaces the exhaustive nature of the search from before with a random selection of parameters over the specified space. This method can outperform GridSearch in cases where the number of parameters affecting the model’s performance is small (low-dimension optimization problems). Since this does not pick every tuple from the cartesian product, it tends to yield results faster, and the performance can be comparable to that of the Grid Search approach. It’s worth keeping in mind that the random nature of this search means, the results with each run might differ. Some of the other methods used for HPO include: 1. Bayesian Optimization 2. Gradient-based Optimization 3. Evolutionary Optimization To learn more about HPO, some papers are linked to at the end of the notebook for further reading. Now that we have a basic understanding of what HPO is, let’s discuss what we wish to achieve with this demo. The aim of this notebook is to show the importance of hyper parameter optimisation and the performance of dask-ml GPU for xgboost and cuML-RF. For this demo, we will be using the [Airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html). The aim of the problem is to predict the arrival delay. It has about 116 million entries with 13 attributes that are used to determine the delay for a given airline. We have modified this problem to serve as a binary classification problem to determine if the airline will be delayed (True) or not. Let’s get started! ```ipython3 import warnings warnings.filterwarnings("ignore") # Reduce number of messages/warnings displayed ``` ```ipython3 import os from urllib.request import urlretrieve import cudf import dask_ml.model_selection as dcv import numpy as np import pandas as pd import xgboost as xgb from cuml.ensemble import RandomForestClassifier from cuml.metrics.accuracy import accuracy_score from cuml.model_selection import train_test_split from dask.distributed import Client from dask_cuda import LocalCUDACluster from sklearn.metrics import make_scorer ``` ### Spinning up a CUDA Cluster This notebook is designed to run on a single node with multiple GPUs, you can get multi-GPU VMs from [AWS](https://docs.rapids.ai/deployment/stable/cloud/aws/ec2-multi.html), [GCP](https://docs.rapids.ai/deployment/stable/cloud/gcp/dataproc.html), [Azure](https://docs.rapids.ai/deployment/stable/cloud/azure/azure-vm-multi.html), [IBM](https://docs.rapids.ai/deployment/stable/cloud/ibm/virtual-server.html) and more. We start a [local cluster](../../tools/dask-cuda.md) and keep it ready for running distributed tasks with dask. Below, [LocalCUDACluster](https://github.com/rapidsai/dask-cuda) launches one Dask worker for each GPU in the current systems. It’s developed as a part of the RAPIDS project. Learn More: - [Setting up Dask](https://docs.dask.org/en/latest/setup.html) - [Dask Client](https://distributed.dask.org/en/latest/client.html) ```ipython3 cluster = LocalCUDACluster() client = Client(cluster) client ``` ## Data Preparation We download the Airline [dataset](https://s3.console.aws.amazon.com/s3/buckets/rapidsai-cloud-ml-sample-data?region=us-west-2&tab=objects) and save it to local directory specific by `data_dir` and `file_name`. In this step, we also want to convert the input data into appropriate dtypes. For this, we will use the `prepare_dataset` function. Note: To ensure that this example runs quickly on a modest machine, we default to using a small subset of the airline dataset. To use the full dataset, pass the argument `use_full_dataset=True` to the `prepare_dataset` function. ```ipython3 data_dir = "./rapids_hpo/data/" file_name = "airlines.parquet" parquet_name = os.path.join(data_dir, file_name) ``` ```ipython3 parquet_name ``` ```ipython3 def prepare_dataset(use_full_dataset=False): global file_path, data_dir if use_full_dataset: url = "https://data.rapids.ai/cloud-ml/airline_20000000.parquet" else: url = "https://data.rapids.ai/cloud-ml/airline_small.parquet" if os.path.isfile(parquet_name): print(f" > File already exists. Ready to load at {parquet_name}") else: # Ensure folder exists os.makedirs(data_dir, exist_ok=True) def data_progress_hook(block_number, read_size, total_filesize): if (block_number % 1000) == 0: print( f" > percent complete: { 100 * ( block_number * read_size ) / total_filesize:.2f}\r", end="", ) return urlretrieve( url=url, filename=parquet_name, reporthook=data_progress_hook, ) print(f" > Download complete {file_name}") input_cols = [ "Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime", "CRSArrTime", "UniqueCarrier", "FlightNum", "ActualElapsedTime", "Origin", "Dest", "Distance", "Diverted", ] dataset = cudf.read_parquet(parquet_name) # encode categoricals as numeric for col in dataset.select_dtypes(["object"]).columns: dataset[col] = dataset[col].astype("category").cat.codes.astype(np.int32) # cast all columns to int32 for col in dataset.columns: dataset[col] = dataset[col].astype(np.float32) # needed for random forest # put target/label column first [ classic XGBoost standard ] output_cols = ["ArrDelayBinary"] + input_cols dataset = dataset.reindex(columns=output_cols) return dataset ``` ```ipython3 df = prepare_dataset() ``` ```ipython3 import time from contextlib import contextmanager # Helping time blocks of code @contextmanager def timed(txt): t0 = time.time() yield t1 = time.time() print(f"{txt:>32} time: {t1-t0:8.5f}") ``` ```ipython3 # Define some default values to make use of across the notebook for a fair comparison N_FOLDS = 5 N_ITER = 25 ``` ```ipython3 label = "ArrDelayBinary" ``` ## Splitting Data We split the data randomnly into train and test sets using the [cuml train_test_split](https://docs.rapids.ai/api/cuml/nightly/api.html#cuml.model_selection.train_test_split) and create CPU versions of the data. ```ipython3 X_train, X_test, y_train, y_test = train_test_split(df, label, test_size=0.2) ``` ```ipython3 X_cpu = X_train.to_pandas() y_cpu = y_train.to_numpy() X_test_cpu = X_test.to_pandas() y_test_cpu = y_test.to_numpy() ``` ## Setup Custom cuML scorers The search functions (such as GridSearchCV) for scikit-learn and dask-ml expect the metric functions (such as accuracy_score) to match the “scorer” API. This can be achieved using the scikit-learn’s [make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) function. We will generate a `cuml_scorer` with the cuML `accuracy_score` function. You’ll also notice an `accuracy_score_wrapper` which primarily converts the y label into a `float32` type. This is because some cuML models only accept this type for now and in order to make it compatible, we perform this conversion. We also create helper functions for performing HPO in 2 different modes: 1. `gpu-grid`: Perform GPU based GridSearchCV 2. `gpu-random`: Perform GPU based RandomizedSearchCV ```ipython3 def accuracy_score_wrapper(y, y_hat): """ A wrapper function to convert labels to float32, and pass it to accuracy_score. Params: - y: The y labels that need to be converted - y_hat: The predictions made by the model """ y = y.astype("float32") # cuML RandomForest needs the y labels to be float32 return accuracy_score(y, y_hat, convert_dtype=True) accuracy_wrapper_scorer = make_scorer(accuracy_score_wrapper) cuml_accuracy_scorer = make_scorer(accuracy_score, convert_dtype=True) ``` ```ipython3 def do_HPO(model, gridsearch_params, scorer, X, y, mode="gpu-Grid", n_iter=10): """ Perform HPO based on the mode specified mode: default gpu-Grid. The possible options are: 1. gpu-grid: Perform GPU based GridSearchCV 2. gpu-random: Perform GPU based RandomizedSearchCV n_iter: specified with Random option for number of parameter settings sampled Returns the best estimator and the results of the search """ if mode == "gpu-grid": print("gpu-grid selected") clf = dcv.GridSearchCV(model, gridsearch_params, cv=N_FOLDS, scoring=scorer) elif mode == "gpu-random": print("gpu-random selected") clf = dcv.RandomizedSearchCV( model, gridsearch_params, cv=N_FOLDS, scoring=scorer, n_iter=n_iter ) else: print("Unknown Option, please choose one of [gpu-grid, gpu-random]") return None, None res = clf.fit(X, y) print(f"Best clf and score {res.best_estimator_} {res.best_score_}\n---\n") return res.best_estimator_, res ``` ```ipython3 def print_acc(model, X_train, y_train, X_test, y_test, mode_str="Default"): """ Trains a model on the train data provided, and prints the accuracy of the trained model. mode_str: User specifies what model it is to print the value """ y_pred = model.fit(X_train, y_train).predict(X_test) score = accuracy_score(y_pred, y_test.astype("float32"), convert_dtype=True) print(f"{mode_str} model accuracy: {score}") ``` ```ipython3 X_train.shape ``` ## Launch HPO We will first see the model’s performances without the gridsearch and then compare it with the performance after searching. ### XGBoost To perform the Hyperparameter Optimization, we make use of the sklearn version of the [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn).We’re making use of this version to make it compatible and easily comparable to the scikit-learn version. The model takes a set of parameters that can be found in the documentation. We’re primarily interested in the `max_depth`, `learning_rate`, `min_child_weight`, `reg_alpha` and `num_round` as these affect the performance of XGBoost the most. Read more about what these parameters are useful for [here](https://xgboost.readthedocs.io/en/latest/parameter.html) #### Default Performance We first use the model with it’s default parameters and see the accuracy of the model. In this case, it is 84% ```ipython3 model_gpu_xgb_ = xgb.XGBClassifier(tree_method="gpu_hist") print_acc(model_gpu_xgb_, X_train, y_cpu, X_test, y_test_cpu) ``` #### Parameter Distributions The way we define the grid to perform the search is by including ranges of parameters that need to be used for the search. In this example we make use of [np.arange](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) which returns an ndarray of even spaced values, [np.logspace](https://docs.scipy.org/doc/numpy/reference/generated/numpy.logspace.html#numpy.logspace) returns a specified number of ssamples that are equally spaced on the log scale. We can also specify as lists, NumPy arrays or make use of any random variate sample that gives a sample when called. SciPy provides various functions for this too. ```ipython3 # For xgb_model model_gpu_xgb = xgb.XGBClassifier(tree_method="gpu_hist") # More range params_xgb = { "max_depth": np.arange(start=3, stop=12, step=3), # Default = 6 "alpha": np.logspace(-3, -1, 5), # default = 0 "learning_rate": [0.05, 0.1, 0.15], # default = 0.3 "min_child_weight": np.arange(start=2, stop=10, step=3), # default = 1 "n_estimators": [100, 200, 1000], } ``` #### RandomizedSearchCV We’ll now try [RandomizedSearchCV](https://ml.dask.org/modules/generated/dask_ml.model_selection.RandomizedSearchCV.html). `n_iter` specifies the number of parameters points theat the search needs to perform. Here we will search `N_ITER` (defined earlier) points for the best performance. ```ipython3 mode = "gpu-random" with timed("XGB-" + mode): res, results = do_HPO( model_gpu_xgb, params_xgb, cuml_accuracy_scorer, X_train, y_cpu, mode=mode, n_iter=N_ITER, ) num_params = len(results.cv_results_["mean_test_score"]) print(f"Searched over {num_params} parameters") ``` ```ipython3 print_acc(res, X_train, y_cpu, X_test, y_test_cpu, mode_str=mode) ``` ```ipython3 mode = "gpu-grid" with timed("XGB-" + mode): res, results = do_HPO( model_gpu_xgb, params_xgb, cuml_accuracy_scorer, X_train, y_cpu, mode=mode ) num_params = len(results.cv_results_["mean_test_score"]) print(f"Searched over {num_params} parameters") ``` ```ipython3 print_acc(res, X_train, y_cpu, X_test, y_test_cpu, mode_str=mode) ``` ### Improved performance There’s a 5% improvement in the performance. We notice that performing grid search and random search yields similar performance improvements even though random search used just 25 combination of parameters. We will stick to performing Random Search for the rest of the notebook with RF with the assumption that there will not be a major difference in performance if the ranges are large enough. ### Visualizing the Search Let’s plot some graphs to get an understanding how the parameters affect the accuracy. The code for these plots are included in `cuml/experimental/hyperopt_utils/plotting_utils.py` #### Mean/Std of test scores We fix all parameters except one for each of these graphs and plot the effect the parameter has on the mean test score with the error bar indicating the standard deviation ```ipython3 from cuml.experimental.hyperopt_utils import plotting_utils ``` ```ipython3 plotting_utils.plot_search_results(results) ``` #### Heatmaps - Between parameter pairs (we can do a combination of all possible pairs, but only one are shown in this notebook) - This gives a visual representation of how the pair affect the test score ```ipython3 df_gridsearch = pd.DataFrame(results.cv_results_) plotting_utils.plot_heatmap(df_gridsearch, "param_max_depth", "param_n_estimators") ``` ## RandomForest Let’s use RandomForest Classifier to perform a hyper-parameter search. We’ll make use of the cuml RandomForestClassifier and visualize the results using heatmap. ```ipython3 ## Random Forest model_rf_ = RandomForestClassifier() params_rf = { "max_depth": np.arange(start=3, stop=15, step=2), # Default = 6 "max_features": [0.1, 0.50, 0.75, "auto"], # default = 0.3 "n_estimators": [100, 200, 500, 1000], } for col in X_train.columns: X_train[col] = X_train[col].astype("float32") y_train = y_train.astype("int32") ``` ```ipython3 print( "Default acc: ", accuracy_score(model_rf_.fit(X_train, y_train).predict(X_test), y_test), ) ``` ```ipython3 mode = "gpu-random" model_rf = RandomForestClassifier() with timed("RF-" + mode): res, results = do_HPO( model_rf, params_rf, cuml_accuracy_scorer, X_train, y_cpu, mode=mode, n_iter=N_ITER, ) num_params = len(results.cv_results_["mean_test_score"]) print(f"Searched over {num_params} parameters") ``` ```ipython3 print("Improved acc: ", accuracy_score(res.predict(X_test), y_test)) ``` ```ipython3 df_gridsearch = pd.DataFrame(results.cv_results_) plotting_utils.plot_heatmap(df_gridsearch, "param_max_depth", "param_n_estimators") ``` ## Conclusion and Next Steps We notice improvements in the performance for a really basic version of the GridSearch and RandomizedSearch. Generally, the more data we use, the better the model performs, so you are encouraged to try for larger data and broader range of parameters. This experiment can also be repeated with different classifiers and different ranges of parameters to notice how HPO can help improve the performance metric. In this example, we have chosen a basic metric - accuracy, but you can use more interesting metrics that help in determining the usefulness of a model. You can even send a list of parameters to the scoring function. This makes HPO really powerful, and it can add a significant boost to the model that we generate. ### Further Reading - [The 5 Classification Evaluation Metrics You Must Know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226) - [11 Important Model Evaluation Metrics for Machine Learning Everyone should know](https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/) - [Algorithms for Hyper-Parameter Optimisation](http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf) - [Forward and Reverse Gradient-Based Hyperparameter Optimization](http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf) - [Practical Bayesian Optimization of Machine Learning Algorithms](http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf) - [Random Search for Hyper-Parameter Optimization](http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf) # index.html.md # GPU-Accelerated Land Use Land Cover Classification Working with satellite imagery at scale quickly exposes the limitations of CPU-bound workflows. A single Sentinel-2 tile spans gigabytes and assembling a season’s worth of acquisitions means streaming dozens of those tiles, masking pixels covered by clouds, and compositing them before you can even think about training a model. Processing data of such scale on CPUs means hours of preprocessing before training, making GPUs an attractive solution to accelerate the workflow. Due to the parallel nature of satellite data, GPUs offer tremendous acceleration in every stage of a machine learning workflow. RAPIDS libraries like [cuDF](https://docs.rapids.ai/api/cudf/stable/) and [cuML](https://docs.rapids.ai/api/cuml/stable/) with the help of other libraries like [Dask](https://www.dask.org/) and [Xarray](https://docs.xarray.dev/en/stable/) map operations to CUDA kernels, so resampling, feature derivation, and tree training execute across thousands of cores at once. By loading data into GPU memory and keeping subsequent transformations local to the GPU, we avoid the I/O overhead of shuffling data between the GPU and local memory and thus sustain the throughput needed for year-scale land-cover modelling. This notebook establishes a end-to-end workflow to train a classification model on satellite imagery. We start by streaming Sentinel-2 imagery from [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/) into a [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/stable/) cluster, using Dask/CuPy backed xarray to clean and aggregate the rasters. We then keep the data on device to train a cuML random forest and finish by writing predictions straight back to Cloud-Optimized GeoTIFFs, ready for validation and sharing. For this workflow, we use two open and freely available data sources as features and labels, downloaded from Microsoft Planetary Computer. [Sentinel-2 Level-2A](https://dataspace.copernicus.eu/data-collections/copernicus-sentinel-data/sentinel-2) imagery supplies 10 m multispectral observations (B02, B03, B04, B08 are the 10 metre bands that correspond to blue, green, red and near-infrared wavelengths) annually with a 5 day revisit frequency, which we condense into cloud-free yearly composites and enrich with indices like NDVI and NDWI for the year 2022. [ESA WorldCover](https://esa-worldcover.org/en) provides annual 10 m land-cover labels, with its 2023 release reflecting the landscape from 2022, giving us labels for supervised training. Together they provide the coverage and scale to illustrate the benefits of using GPUs for this task. The machine learning use case illustrated in this notebook, Land use and land cover (LULC) classification, is the task of labelling each pixel in an image according to the surface type it represents. Typical labels include water, trees, crops, built areas, bare ground, and rangeland. These maps help planners monitor urban growth, estimate crop acreage, or track ecosystem change. ## Prerequisites - Access to an NVIDIA GPU (preferably multiple GPUs) with CUDA, RAPIDS, Dask, and the other libraries imported below. - A GeoJSON that defines your area of interest (AOI) and access to Microsoft Planetary Computer for Sentinel-2 and ESA WorldCover assets. - Optional: access to write Cloud-Optimized GeoTIFFs to your target S3 bucket. ```ipython3 import dataclasses import json import pickle from pathlib import Path import boto3 import cudf import cupy as cp import geopandas as gpd import matplotlib.pyplot as plt import numpy as np import pandas as pd import planetary_computer import pyproj import rasterio import seaborn as sns import stackstac import xarray as xr from cuml.ensemble import RandomForestClassifier from cuml.metrics import accuracy_score from dask.distributed import Client as DaskClient from dask.distributed import progress, wait from dask_cuda import LocalCUDACluster from matplotlib.colors import BoundaryNorm, ListedColormap from pystac_client import Client from rio_cogeo.cogeo import cog_translate from rio_cogeo.profiles import cog_profiles from shapely.ops import transform from sklearn.metrics import classification_report, confusion_matrix from stackstac.raster_spec import RasterSpec ``` ## Stage 1 · Ingest and Prepare Training Data We begin by getting the raw ingredients into GPU-friendly form. That means streaming Sentinel-2 scenes and WorldCover labels straight from the Planetary Computer, reprojecting them onto a common grid, and reshaping them into chunk sizes that Dask can scatter across every GPU. The goal is to build clean yearly composites and companion label layers once, persist them as [Zarr](https://zarr.dev/) stores, and avoid recomputing expensive preprocessing later in the notebook. ### 1. Set Workspace Paths and Parameters Run this cell to lock in the project-wide constants the pipeline needs: where to find your AOI GeoJSON, which Sentinel-2 bands and WorldCover assets to request, how to chunk raster stacks, and where to stage intermediate outputs. Update the paths and `S3_BUCKET`/`S3_PREFIX` now so the rest of the notebook writes to the right locations. Confirm the directories exist (the code creates any missing output folders for you) and keep the date range and cloud filter aligned with the scenes you plan to process. In this example, we are using an area of interest of 1209 sqKM over the Boston Metropolitan area from 2022 as the bounds for the training data. ```ipython3 AOI_GEOJSON = Path("") FEATURES_ZARR = Path("") MODEL_PATH = Path("") INFERENCE_OUTPUT_DIR = Path("") AWS_REGION = "" S3_BUCKET = "" S3_PREFIX = "lulc" MODEL_PATH = Path("") MODEL_PATH.parent.mkdir(parents=True, exist_ok=True) ``` ```ipython3 # General configuration / outputs DATE_RANGE = ("2022-01-01", "2022-12-31") MAX_CLOUD_FILTER = ( 50 # Set this value higher to fetch more scenes with higher cloud cover ) S2_ASSETS = ["B02", "B03", "B04", "B08", "SCL"] TARGET_RESOLUTION = 10 STACK_CHUNKS = {"time": 1, "band": 1, "y": 2048, "x": 2048} WORLDCOVER_COLLECTION = "io-lulc-annual-v02" WORLDCOVER_ASSET = "data" FEATURES_ZARR.mkdir(parents=True, exist_ok=True) INFERENCE_OUTPUT_DIR.mkdir(parents=True, exist_ok=True) MODEL_PATH.parent.mkdir(parents=True, exist_ok=True) CATALOG_URL = "https://planetarycomputer.microsoft.com/api/stac/v1" BANDS = ["B02", "B03", "B04", "B08"] ALL_FEATURES = ["B02", "B03", "B04", "B08", "NDVI", "NDWI"] NODATA_VALUE = 0 VALID_CLASSES = [1, 2, 4, 5, 7, 8, 11] COG_PROFILE = cog_profiles.get("deflate") s3_client = boto3.client("s3", region_name=AWS_REGION) ``` ### 2. Launch a Dask-CUDA Cluster Start the local Dask-CUDA cluster now so every downstream step can submit GPU workloads through the distributed client. Run this cell and confirm the dashboard link appears. If you need to pin workers to specific GPUs or change memory limits, modify the arguments for `LocalCUDACluster` before proceeding. ```ipython3 cluster = LocalCUDACluster() client = DaskClient(cluster) client ``` ### 3. Load and Reproject the Area of Interest Use this cell to validate and prepare your AOI geometry. It reads the GeoJSON configured earlier, checks that a CRS is declared, merges all features into one geometry, and chooses an appropriate UTM zone based on the centroid. Run it once to produce `aoi_geom`, its projected bounds, and a GeoJSON payload that the STAC search will reuse. If your AOI spans multiple UTM zones, replace the automated EPSG selection with the desired one before executing. To create an AOI in GeoJSON format by drawing polygons over a map in an interactive way, you can use the [geojson.io](https://geojson.io/#map=2/0/20) website. ```ipython3 aoi_gdf = gpd.read_file(AOI_GEOJSON) if aoi_gdf.crs is None: raise ValueError("AOI GeoJSON must declare a CRS.") aoi_geom = aoi_gdf.geometry.union_all() # choose UTM zone from AOI centroid centroid = aoi_geom.centroid utm_zone = int((centroid.x + 180) // 6) + 1 target_epsg = ( int(f"326{utm_zone:02d}") if centroid.y >= 0 else int(f"327{utm_zone:02d}") ) aoi_geom_geojson = json.loads(gpd.GeoSeries([aoi_geom]).to_json())["features"][0][ "geometry" ] project = pyproj.Transformer.from_crs(4326, target_epsg, always_xy=True).transform aoi_geom_proj = transform(project, aoi_geom) aoi_bounds = tuple(aoi_geom_proj.bounds) ``` ```ipython3 print(aoi_bounds) print(target_epsg) print(aoi_geom_geojson) ``` ```myst-ansi (305317.1002981191, 4672613.712909488, 337897.39514006017, 4709696.839024326) 32619 {'type': 'Polygon', 'coordinates': [[[-71.35765504252703, 42.515400835973594], [-71.35765504252703, 42.18871065844891], [-70.97331384909891, 42.18871065844891], [-70.97331384909891, 42.515400835973594], [-71.35765504252703, 42.515400835973594]]]} ``` ### 4. Fetch Sentinel-2 tiles for the AOI Search the Planetary Computer STAC API for Sentinel-2 Level-2A scenes that overlap your AOI, match the configured date window, and meet the cloud threshold. Run this cell to pull the items, guard against empty results, and build a lazily-loaded raster stack with `stackstac`. The stack is clipped to your AOI bounds, resampled to the target resolution and chunked for GPU friendly processing. ```ipython3 stac = Client.open( "https://planetarycomputer.microsoft.com/api/stac/v1", modifier=planetary_computer.sign_inplace, ) search = stac.search( collections=["sentinel-2-l2a"], intersects=aoi_geom_geojson, datetime=f"{DATE_RANGE[0]}/{DATE_RANGE[1]}", query={"eo:cloud_cover": {"lt": MAX_CLOUD_FILTER}}, ) items = list(search.items()) if not items: raise ValueError("No Sentinel-2 scenes found for AOI/year.") stack = ( stackstac.stack( items, assets=S2_ASSETS, bounds=aoi_bounds, resolution=TARGET_RESOLUTION, epsg=target_epsg, chunksize=(200, 1, 1024, 1024), fill_value=np.nan, rescale=False, properties=["datetime"], ) .astype("float32") .sortby("time") .persist() ) stack = stack.assign_coords(band=S2_ASSETS[: stack.sizes["band"]]) stack ``` ```ipython3 progress([stack]) ``` Persisting the stack up front forces Dask to scatter chunks across every available GPU before we start feature engineering. If we wait until the first compute call, the scheduler will add most of the chunks onto a single worker to prevent shuffle overhead, which in turn runs the risk of running out of memory. By persisting the data now onto the GPU memory, we keep the data evenly distributed for the rest of the pipeline. ### 5. Prepare Daily Spectral Composites Satellite imagery is usually captured in the form of tiles, a specific image captured by the satellite corresponding to a certain area, specified in the tile extent. In the previous step, we captured all the tiles that intersect with our AOI for all dates in 2022. However, there might be multiple tiles that intersect with our AOI captured on a specific date. These tiles also occasionally do intersect in area, so before calculating an annual median value for each pixel, we group this data by unique capture dates and use the median value for band-wise reflectance where tiles overlap. Sentinel-2 tiles also include a special band called the Sentinel-2 Scene Classification Layer (SCL), which tags each pixel with a surface or atmospheric class inferred from the Level-2A processing chain. It distinguishes nodata (0), saturated or defective pixels (1), dark areas (2), cloud shadow (3), vegetation (4), bare soils (5), water (6), cloud probabilities from low to high (7–9), thin cirrus (10), and snow or ice (11). Using SCL lets you mask out cloudy or otherwise unreliable observations before aggregating daily composites, so only the clear-sky land pixels (classes 4–6 and 11) contribute to the summaries. In the following steps, we assign a daily date coordinate, split out the Sentinel-2 [Scene Classification Layer](https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/scene-classification/)(`SCL`), and keep only the spectral bands used for features. We then apply the clear-sky mask (classes 3, 4, 5, 6, 11) so cloudy pixels become `NaN`, then group by day mosaic acquisitions on the same day. Run this cell to define the lazy Dask graph for daily composites; no computation occurs yet. ```ipython3 dates = stack["time"].dt.floor("D").values stack = stack.assign_coords(date=("time", dates)) scl = stack.sel(band="SCL") spectral = stack.drop_sel(band="SCL") clear = scl.isin([3, 4, 5, 6, 11]) spectral = spectral.where(clear, np.nan) daily_stack = spectral.groupby("date").median(dim="time", skipna=True) daily_stack ``` ### 6. Rechunk Daily Composites for GPU Throughput Adjust the chunk structure before any heavy computation runs. This cell groups dates in batches of 10, limits each task to two bands, and uses 1,024×1,024 pixels in the spatial dimension so Dask can stream work evenly across GPUs without each chunk being too small. Run it once to update the delayed graph. No data is computed yet, but downstream feature calculations will inherit this layout. ```ipython3 daily_stack = daily_stack.chunk({"date": 10, "band": 2, "y": 1024, "x": 1024}) daily_stack ``` ### 7. Aggregate by Year and Engineer Spectral Indices With daily composites defined, run this cell to collapse each pixel into an annual median and derive GPU-friendly features. The next cell stamps a `year` coordinate, groups by year to reduce seasonal noise, and then computes `NDVI` and `NDWI` alongside the raw bands. The dataset is rechunked so each task covers one year, two bands, and 1,024×1,024 tiles, matching the earlier layout and keeping downstream sampling efficient. Execution remains lazy here; you will trigger compute later when you materialize training data. **Notes on the spectral indices** - `NDVI = (NIR − Red) / (NIR + Red)` gauges vegetation vigor. High values indicate dense photosynthetically active biomass, while bare ground or urban areas trend toward zero or negative. - `NDWI = (Green − NIR) / (Green + NIR)` emphasizes surface water and moist vegetation. Positive values mark water bodies or saturated soils, whereas dry ground returns negatives. ```ipython3 daily_stack = daily_stack.assign_coords( year=("date", pd.DatetimeIndex(daily_stack["date"].values).year) ) yearly_stack = daily_stack.groupby("year").median(dim="date", skipna=True) red = yearly_stack.sel(band="B04") nir = yearly_stack.sel(band="B08") green = yearly_stack.sel(band="B03") feature_ds = xr.Dataset( { "bands": yearly_stack, "NDVI": (nir - red) / (nir + red), "NDWI": (green - nir) / (green + nir), } ) feature_ds = feature_ds.chunk({"year": 1, "band": 2, "y": 1024, "x": 1024}) feature_ds ``` ### 8. Retrieve and Distribute WorldCover Labels Search the Planetary Computer catalogue for ESA WorldCover tiles that intersect the AOI and cover the year 2023. ESA publishes each annual WorldCover release on January 1 to describe the land cover of the preceding year, so the 2023 layer is the right match for the 2022 imagery we processed above. Run this cell to download the overlapping items, guard against empty results, and build a raster stack that matches your AOI bounds, resolution, and projection. As with the imagery, `persist()` spreads the label blocks across GPUs right away so later sampling and joins avoid a single-worker bottleneck. ```ipython3 label_items = list( stac.search( collections=[WORLDCOVER_COLLECTION], intersects=aoi_geom_geojson, datetime="2023-01-01/2023-12-31", ).items() ) if not label_items: raise ValueError("No WorldCover tiles overlap the AOI/year.") label_stack = ( stackstac.stack( label_items, assets=[WORLDCOVER_ASSET], bounds=aoi_bounds, resolution=TARGET_RESOLUTION, epsg=target_epsg, chunksize=(1, 1, 1024, 1024), rescale=False, fill_value=np.nan, ) .astype("float32") .persist() ) label_stack = label_stack.assign_coords(band=["map"]) label_stack ``` ### 9. Align the Label Mosaic with Feature Grids Collapse the WorldCover stack into a single mosaic, reproject it to the same grid as the feature rasters, and expand it into a year-aligned cube. Run this cell after the data is persisted in the previous cell so the label mosaic matches the band layout and spatial axes of your feature dataset. The result is a `labels_cube` with one layer per feature year, ready to be matched to the Sentinel-2 data. ```ipython3 label_mosaic = stackstac.mosaic(label_stack, dim="time").squeeze("band", drop=True) template = feature_ds["bands"].sel(year=feature_ds.year[0], band="B02") label_mosaic = label_mosaic.rio.reproject_match( template, resampling=rasterio.enums.Resampling.nearest ) labels_cube = xr.DataArray( label_mosaic.values[None, :, :], dims=("year", "y", "x"), coords={"year": feature_ds.year, "y": label_mosaic.y, "x": label_mosaic.x}, name="worldcover", ) labels_cube = labels_cube.chunk({"year": 1, "y": 1024, "x": 1024}) labels_cube ``` ### 10. Strip JSON Metadata Before Writing to Zarr Before you write features and labels to Zarr files, sanitize the RasterSpec metadata so it contains only plain Python types. Zarr cannot serialize the nested JSON-like objects that `stackstac` attaches by default, so this utility walks through each variable’s attributes and rewrites any `RasterSpec` entries into dictionaries and tuples the store can handle. Run it once; the cleaned `feature_ds` and `labels_ds` will be ready for disk writes in the next step. ```ipython3 def raster_spec_to_plain(value): if isinstance(value, RasterSpec): if dataclasses.is_dataclass(value): data = dataclasses.asdict(value) else: data = {} for k, v in value.__dict__.items(): if hasattr(v, "to_gdal"): data[k] = tuple(v) elif isinstance(v, tuple): data[k] = list(v) else: data[k] = v return data return value for var in feature_ds.variables.values(): var.attrs = {k: raster_spec_to_plain(v) for k, v in var.attrs.items()} labels_ds = labels_cube.to_dataset(name="worldcover") for var in labels_ds.variables.values(): var.attrs = {k: raster_spec_to_plain(v) for k, v in var.attrs.items()} ``` ### 11. Materialize Features and Labels to Zarr Write the cleaned feature and label cubes to disk now so later stages can reload them without recomputation. This cell enqueues the `.to_zarr()` operations on the Dask cluster (without triggering them locally), hands the futures to the scheduler, and waits for both writes to finish. When it completes, you have consolidated Zarr stores under `FEATURES_ZARR` that store the data computed in the previous steps. ```ipython3 feature_path = FEATURES_ZARR / "sentinel2_2022_annual.zarr" labels_path = FEATURES_ZARR / "worldcover_2022_annual.zarr" feature_future = client.compute( feature_ds.to_zarr( feature_path, consolidated=True, mode="w", compute=False, ) ) labels_future = client.compute( labels_cube.to_dataset(name="worldcover").to_zarr( labels_path, consolidated=True, mode="w", compute=False, ) ) wait([feature_future, labels_future]) ``` ## Stage 2 · Train and Evaluate the Model Now that the data has no cloudy/no-data pixels, composited to an annual median over the entire AOI and stored in Zarr stores, we will focus on training a model for LULC classification. To keep with the theme of using the GPU for all aspects of this workflow, we will use cuML’s Random Forest model as the classifier. The following steps focus on loading data from the already prepared Zarr stores using Xarray, filtering relevant label classes from this data, flattening the data and sending the data to the GPU using cupy-xarray and finally training and evaluating the Random Forest model. The trained model is then saved to a pickle file for easy inference. ### 1. Define Training Targets and Class Metadata Set the random seed, split ratio, and the WorldCover classes valid after filtering when you reload the data. Adjust `target_year` if you want a different label slice, tweak `train_fraction` to control how many pixels feed the model versus evaluation, and customize the `worldcover_classes` mapping to reflect the categories you plan to predict. In this example, we ignore the snow/ice and cloud classes from the `worldcover_classes` variable below as we are calculating an annual median reflectance for each pixel and seasonal/ephermeral features will not be represented appropriately in this training data. Keeping the valid class list here ensures downstream sampling can discard nodata pixels and pixels with labels representing other classes. ```ipython3 target_year = 2022 random_state = 42 train_fraction = 0.8 worldcover_classes = { 1: "Water", 2: "Trees", 4: "Flooded vegetation", 5: "Crops", 7: "Built area", 8: "Bare ground", 11: "Rangeland", } nodata_value = 0 valid_classes = list(worldcover_classes.keys()) ``` ### 2. Reload Zarr Stores and Build Feature Stacks Open the feature and label Zarr stores and gather the bands and indices you need for modeling. Because our data has been reduced to a single annual median for each pixel, we can work on a single GPU now. Using the [cupy-xarray](https://cupy-xarray.readthedocs.io/latest/) library, we convert the Dask-backed Xarrays from the Zarr store into CuPy backed Xarrays. ```ipython3 feature_ds = xr.open_zarr(feature_path, consolidated=True, chunks=None).load() label_ds = xr.open_zarr(labels_path, consolidated=True, chunks=None).load() spectral = ( feature_ds["bands"] .sel(year=target_year) .assign_coords(band=[str(b) for b in feature_ds.band.values]) ) ndvi = feature_ds["NDVI"].sel(year=target_year).expand_dims(band=["NDVI"]) ndwi = feature_ds["NDWI"].sel(year=target_year).expand_dims(band=["NDWI"]) features = xr.concat([spectral, ndvi, ndwi], dim="band") labels = label_ds["worldcover"].sel(year=target_year) features = features.cupy.as_cupy() labels = labels.cupy.as_cupy() band_names = list(features.band.values) band_names ``` ### 3. Spot-Check Feature and Label Rasters Plot the NDVI, NDWI, and WorldCover layers side by side to make sure the composites and labels look reasonable before sampling. Run this cell to render quicklooks with consistent color ranges for the indices and a discrete palette for the classes. Scan the maps to verify cloud masking, feature contrasts, and class coverage; if something looks off, revisit the preprocessing before training. ```ipython3 ndvi_da = feature_ds["NDVI"].sel(year=target_year).squeeze(drop=True) ndwi_da = feature_ds["NDWI"].sel(year=target_year).squeeze(drop=True) worldcover_da = label_ds["worldcover"].sel(year=target_year).squeeze(drop=True) fig, axes = plt.subplots(1, 3, figsize=(18, 6)) # NDVI quicklook (values already in [-1, 1]) ndvi_da.plot.imshow( ax=axes[0], vmin=-1, vmax=1, cmap="RdYlGn", add_colorbar=True, cbar_kwargs={"label": "NDVI"}, ) axes[0].set_title(f"NDVI for {target_year}") axes[0].set_xlabel("x") axes[0].set_ylabel("y") # NDWI quicklook (same range) ndwi_da.plot.imshow( ax=axes[1], vmin=-1, vmax=1, cmap="BrBG", add_colorbar=True, cbar_kwargs={"label": "NDWI"}, ) axes[1].set_title(f"NDWI for {target_year}") axes[1].set_xlabel("x") axes[1].set_ylabel("y") # WorldCover visualization with discrete colors worldcover_colors = { 1: "#419bdf", 2: "#397d49", 4: "#7a87c6", 5: "#e49635", 7: "#c4281b", 8: "#a59b8f", 11: "#e3e2c3", } classes = list(worldcover_colors.keys()) cmap = ListedColormap([worldcover_colors[k] for k in classes]) norm = BoundaryNorm(classes + [classes[-1] + 1], cmap.N) worldcover_da.plot.imshow( ax=axes[2], cmap=cmap, norm=norm, add_colorbar=False, ) axes[2].set_title("ESA WorldCover") axes[2].set_xlabel("x") axes[2].set_ylabel("y") # Legend for WorldCover handles = [ plt.Line2D( [0], [0], marker="s", color="none", markerfacecolor=worldcover_colors[k], markersize=10, ) for k in classes ] axes[2].legend( handles, [f"{k}: {worldcover_classes[k]}" for k in classes], loc="upper right", frameon=True, ) plt.tight_layout() plt.show() ``` ### 3. Filter Valid Pixels and Move Samples to GPU Memory Run these cells together to filter usable training samples, flatten the rasters into tabular form, and stage the data on the GPU. First, build a mask that requires finite feature values, finite labels, and is a part of our valid class list. Then we apply that mask in place so cloudy or nodata pixels drop out. Next, rasters are stacked into a `(samples × bands)` layout and any rows with missing values are discarded. ```ipython3 valid_mask_cp = ( cp.isfinite(features.data).all(axis=0) & cp.isfinite(labels.data) & (labels.data != nodata_value) & cp.isin(labels.data, cp.asarray(valid_classes)) ) # Broadcast the mask over the band dimension features_data = cp.where(valid_mask_cp[None, :, :], features.data, cp.nan) labels_data = cp.where(valid_mask_cp, labels.data, nodata_value) features = xr.DataArray( features_data, coords=features.coords, dims=features.dims, name=features.name, ) labels = xr.DataArray( labels_data, coords=labels.coords, dims=labels.dims, name=labels.name, ) ``` ```ipython3 stacked_features = features.stack(sample=("y", "x")).transpose("sample", "band") stacked_labels = labels.stack(sample=("y", "x")) flat_features = stacked_features.data.astype(cp.float32, copy=False) flat_labels = stacked_labels.data.astype(cp.int32, copy=False) valid_rows = cp.isfinite(flat_labels) valid_rows &= cp.isfinite(flat_features).all(axis=1) flat_features = flat_features[valid_rows] flat_labels = flat_labels[valid_rows] ``` ### 4. Split Samples and Convert to cuDF Tables Next we shuffle the GPU-resident samples with the configured random seed, carve out the training/test split, and wrap the arrays in cuDF structures. Run these two cells to partition the CuPy features and labels according to `train_fraction`, then convert each subset into cuDF DataFrames and Series labeled with the band names. cuML estimators consume these cuDF objects directly, so you’re ready to fit the model next. ```ipython3 num_samples = flat_features.shape[0] perm = cp.random.RandomState(random_state).permutation(num_samples) train_size = int(train_fraction * num_samples) train_idx = perm[:train_size] test_idx = perm[train_size:] X_train_cp = flat_features[train_idx] y_train_cp = flat_labels[train_idx] X_test_cp = flat_features[test_idx] y_test_cp = flat_labels[test_idx] ``` ```ipython3 X_train_cudf = cudf.DataFrame(X_train_cp, columns=band_names) y_train_cudf = cudf.Series(y_train_cp, name="worldcover") X_test_cudf = cudf.DataFrame(X_test_cp, columns=band_names) y_test_cudf = cudf.Series(y_test_cp, name="worldcover") ``` ### 5. Inspect Class Balance Before Training Plot the per-class pixel counts for the training and validation splits to confirm you carried enough samples forward for each label. Run this cell to visualize the distributions side by side with annotated totals. The strong skew you see here comes from working with a single AOI over an urban area (Boston in this example) in which most pixels fall into the “Built Area” class, which is class 7 on our index. When you broaden the footprint or add a variety of different scenes, other classes accumulate more support to reduce data bias. ```ipython3 train_counts = y_train_cudf.value_counts().sort_index().to_pandas() test_counts = y_test_cudf.value_counts().sort_index().to_pandas() fig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True) train_counts.plot(kind="bar", ax=ax[0], color="steelblue", edgecolor="black") ax[0].set_title("Training Set Class Counts") ax[0].set_xlabel("WorldCover Class") ax[0].set_ylabel("Pixels") test_counts.plot(kind="bar", ax=ax[1], color="darkorange", edgecolor="black") ax[1].set_title("Validation Set Class Counts") ax[1].set_xlabel("WorldCover Class") # Annotate bars for axis, counts in zip(ax, [train_counts, test_counts], strict=False): for patch, val in zip(axis.patches, counts.values, strict=False): axis.annotate( f"{int(val):,}", (patch.get_x() + patch.get_width() / 2, patch.get_height()), ha="center", va="bottom", fontsize=9, xytext=(0, 4), textcoords="offset points", ) plt.tight_layout() plt.show() ``` ### 6. Train a cuML Random Forest on the GPU Instantiate the cuML `RandomForestClassifier` with your chosen tree count, histogram bins, and streams, then fit it on the cuDF training table. Run this cell to launch GPU-accelerated training; the estimator consumes the cuDF inputs directly and keeps the model resident on device for rapid evaluation in the following steps. ```ipython3 rf = RandomForestClassifier( n_estimators=300, n_bins=256, n_streams=4, bootstrap=True, split_criterion="gini", random_state=random_state, ) rf.fit(X_train_cudf, y_train_cudf) ``` While the model training is in progress, if you want to visualize the system hardware metrics and understand GPU and memory consumption within your Jupyterlab environment, consider using the [NVDashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) Jupyterlab extension. ### 7. Score the Model and Build a Confusion Matrix Evaluate the trained forest on the validation split by predicting in GPU memory and computing accuracy with cuML. Convert the predictions to Pandas dataframes for diagnostics, then build a confusion matrix aligned with your class list. Run this cell to print the headline accuracy and produce an `xarray.DataArray` you can visualize in the next step. ```ipython3 pred_gpu = rf.predict(X_test_cudf) val_acc = accuracy_score(y_test_cudf, pred_gpu) print(f"Validation accuracy: {val_acc:.3f}") pred_cpu = pred_gpu.to_pandas() test_cpu = y_test_cudf.to_pandas() cm = confusion_matrix(test_cpu, pred_cpu, labels=valid_classes) cm_da = xr.DataArray( cm, coords={"actual": valid_classes, "predicted": valid_classes}, dims=("actual", "predicted"), ) cm_da ``` ```myst-ansi Validation accuracy: 0.799 ``` ### 8. Visualize Confusion Matrix and Interpret Class Coverage Plot the confusion matrix to see where the model succeeds and where the sparse classes fall short. Run this cell to render the heatmap. #### NOTE Note how the undersampled types (flooded vegetation, bare ground) attract almost no predicted pixels. That’s a direct consequence of the skewed training data; expand the AOI or add more scenes when you need reliable performance on those categories. ```ipython3 cm = confusion_matrix(test_cpu, pred_cpu, labels=valid_classes) labels_pretty = [worldcover_classes[c] for c in valid_classes] plt.figure(figsize=(6, 5)) sns.heatmap( cm, annot=True, fmt="d", cmap="Blues", xticklabels=labels_pretty, yticklabels=labels_pretty, ) plt.title("Validation Confusion Matrix") plt.xlabel("Predicted") plt.ylabel("Actual") plt.tight_layout() plt.show() ``` ### 9. Review Precision/Recall by Class Summarize model performance with per-class precision and recall. Run this cell to chart the scores; it highlights that classes without any predicted pixels (flooded vegetation, crops, bare ground) drop to zero precision and recall. Use this metric to identify subclasses that need more samples to balance the dataset ```ipython3 report = classification_report( test_cpu, pred_cpu, labels=valid_classes, output_dict=True, ) pr_table = pd.DataFrame(report).loc[["precision", "recall"], map(str, valid_classes)].T pr_table.index = labels_pretty ax = pr_table.plot(kind="bar", linewidth=0.8, edgecolor="black", figsize=(7, 4)) ax.set_title("Precision & Recall per Class") ax.set_ylabel("Score") ax.set_ylim(0, 1) ax.set_xticklabels(labels_pretty, rotation=45, ha="right") ax.legend(loc="lower right") for i, container in enumerate(ax.containers): shift = -0.05 if i == 0 else 0.05 # adjust per container for patch, val in zip(container.patches, container.datavalues, strict=False): ax.annotate( f"{val:.2f}", ( patch.get_x() + patch.get_width() / 2 + shift, patch.get_height(), ), ha="center", va="bottom", fontsize=9, xytext=(0, 3), textcoords="offset points", ) plt.tight_layout() plt.show() ``` ```myst-ansi /raid/jjayabaskar/gis12/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1731: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0]) /raid/jjayabaskar/gis12/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1731: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0]) /raid/jjayabaskar/gis12/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1731: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0]) ``` ### 10. Persist the Trained Model Save the fitted Random Forest so you can reload it for inference. Run this cell to create the model directory if needed and pickle the cuML estimator. Keeping a serialized copy lets you deploy predictions without retraining or rerunning the feature pipeline. ```ipython3 with open(str(MODEL_PATH), "wb") as f: pickle.dump(rf, f) ``` ## Stage 3 · Run Inference and Publish the classification tile Now that we have finished training the model, in this stage we load the saved Random Forest model, pull a fresh Sentinel-2 tile, compute the familiar spectral features on the GPU, and classify each pixel. After reshaping the predictions into a raster, we compare them against the tile’s true-color composite to compare the two, then write the result to a Cloud-Optimized GeoTIFF (locally or to S3). ### 1. Reload the Trained Model Start by bringing the serialized Random Forest back into memory. Run this cell to unpickle the estimator you saved in Stage 2; the returned object keeps its GPU parameters and is ready to score new tiles without retraining. ```ipython3 with open(str(MODEL_PATH), "rb") as f: rf = pickle.load(f) rf ``` ### 2. Locate the Sentinel-2 Tile to Score Connect to the Planetary Computer STAC endpoint, supply the Sentinel-2 tile ID you want to score, and stage the COG output path. This example targets `S2B_MSIL2A_20251102T154319_R011_T18TWL_20251102T193858`, a Sentinel-2 tile captured over New York in 2025, so you can inspect the model on unseen data. Run these cells to verify the tile exists and fetch its STAC item before you start feature extraction. ```ipython3 client = Client.open(CATALOG_URL, modifier=planetary_computer.sign_inplace) tile_id = "S2B_MSIL2A_20251102T154319_R011_T18TWL_20251102T193858" # replace as needed UPLOAD_TO_S3 = False cog_path = INFERENCE_OUTPUT_DIR / f"lulc_{tile_id}.tif" ``` ```ipython3 search = client.search( collections=["sentinel-2-l2a"], ids=[tile_id], ) items = list(search.items()) if not items: raise ValueError(f"No Sentinel-2 L2A items found for '{tile_id}'") item = items[0] item ``` ### 3. Fetch and Stack the Target Tile Derive the tile’s EPSG code from the STAC metadata, then build a `stackstac` raster for the bands the model expects. Run this cell to fetch the single-scene cube (one time slice, four spectral bands) at 10 m resolution with 2,048×2,048 chunks. Any missing metadata raises an error so you can choose another tile. The result is ready for GPU feature engineering and matches the training band order. ```ipython3 epsg_code = item.properties.get("proj:epsg") if epsg_code is None: proj_code = item.properties.get("proj:code") if proj_code: epsg_code = int(proj_code.split(":")[-1]) else: raise ValueError("No proj:epsg/proj:code in item") stack = ( stackstac.stack( [item], assets=BANDS, resolution=10, epsg=epsg_code, fill_value=np.nan, chunksize=(1, 1, 2048, 2048), rescale=False, ) .squeeze("time") .assign_coords(band=BANDS) .astype("float32") ) stack ``` ### 4. Inspect the Tile with a Quick True-Color Preview Render a stretched RGB composite so you can sanity-check the tile before running inference. Execute this cell to pull the red, green, and blue bands, normalized to a 99th-percentile stretch, and display a true-color image for the tile corresponding to the `tile_id`. Use the preview to confirm the scene is cloud-free and matches the area you expect. If not, pick a different Sentinel-2 tile ID before proceeding. ```ipython3 red_np = stack.sel(band="B04").data.compute().astype(np.float32) green_np = stack.sel(band="B03").data.compute().astype(np.float32) blue_np = stack.sel(band="B02").data.compute().astype(np.float32) rgb_np = np.stack([red_np, green_np, blue_np], axis=0) stretch = np.nanpercentile(rgb_np, 99) rgb_np = np.clip(rgb_np / stretch, 0, 1) rgb_img = np.moveaxis(rgb_np, 0, -1) plt.figure(figsize=(6, 6)) plt.imshow(rgb_img) plt.title(f"Sentinel-2 True Color for {tile_id}") plt.axis("off") plt.show() ``` ### 5. Engineer Features on the GPU and Run Inference Compute the same band stack, NDVI, and NDWI features the model saw during training, flatten them to per-pixel rows, and drop any null values before prediction. Run these cells to convert the Sentinel tile into a cuDF table, call `rf.predict`, and inspect the label counts. The output shows the model heavily favors classes 7, 2, and 11, exactly the bias learned from the training AOI. ```ipython3 b02 = cp.asarray(stack.sel(band="B02").data) b03 = cp.asarray(stack.sel(band="B03").data) b04 = cp.asarray(stack.sel(band="B04").data) b08 = cp.asarray(stack.sel(band="B08").data) ndvi = (b08 - b04) / (b08 + b04) ndwi = (b03 - b08) / (b03 + b08) y_coords = stack.y.values x_coords = stack.x.values feature_stack = cp.stack([b02, b03, b04, b08, ndvi, ndwi], axis=0) flat = feature_stack.reshape(len(ALL_FEATURES), -1).T mask = cp.isfinite(flat).all(axis=1) flat_valid = flat[mask] ``` ```ipython3 features_df = cudf.DataFrame(flat_valid, columns=ALL_FEATURES) preds = rf.predict(features_df) preds ``` ```ipython3 preds.value_counts() ``` ### 6. Reshape Predictions Back to the Tile Grid Restore the flat predictions to their native image layout so you can visualize and export them. This cell fills a nodata-initialized array, drops the predictions into the valid-pixel slots, reshapes everything to the tile’s `y×x` grid, and wraps the result in an `xarray.DataArray` with coordinates and metadata (tile ID, model name, acquisition datetime). Upon running it, the returned `pred_da` matches the raster geometry needed for plotting and COG creation. ```ipython3 full = cp.full(mask.shape[0], NODATA_VALUE, dtype=cp.int16) full[mask] = preds.astype(cp.int16) h, w = len(y_coords), len(x_coords) grid = full.reshape(h, w) pred_da = xr.DataArray( cp.asnumpy(grid), coords={"y": y_coords, "x": x_coords}, dims=("y", "x"), name="worldcover_prediction", ).where(cp.asnumpy(mask.reshape(h, w))) pred_da.attrs.update( { "tile_id": item.id, "model": MODEL_PATH.name, "datetime": item.properties["datetime"], } ) pred_da ``` ### 7. Compare the True-Color Tile with Model Predictions Plot the Sentinel-2 RGB composite alongside the inferred land-cover map to sanity-check the output. Run this cell to render both views with matching coordinates and a legend keyed to the WorldCover classes. Use the side-by-side comparison to see where the classifier follows the imagery and where it inherits the training bias. Urban areas (class 7) dominate, while scarcely sampled categories remain rare. However, notice that the trained LULC classification model did a good job in distinguishing water and vegetation classes from built area. ```ipython3 # Build an RGB DataArray using the tile’s coordinates rgb_da = xr.DataArray( rgb_img, dims=("y", "x", "band"), coords={"y": stack.y.values, "x": stack.x.values, "band": ["R", "G", "B"]}, ) worldcover_colors = { 1: "#419bdf", 2: "#397d49", 4: "#7a87c6", 5: "#e49635", 7: "#c4281b", 8: "#a59b8f", 11: "#e3e2c3", } classes = list(worldcover_colors.keys()) cmap = ListedColormap([worldcover_colors[i] for i in classes]) norm = BoundaryNorm(classes + [classes[-1] + 1], cmap.N) fig, axes = plt.subplots(1, 2, figsize=(14, 6), constrained_layout=True) # Left: true-color tile (coordinates from stack) rgb_da.plot.imshow( ax=axes[0], rgb="band", add_colorbar=False, ) axes[0].set_title("Sentinel-2 True Color") axes[0].set_aspect("equal") axes[0].axis("off") # Right: predicted classes pred_da.plot.imshow( ax=axes[1], cmap=cmap, norm=norm, add_colorbar=False, ) axes[1].set_title("Model Prediction (WorldCover classes)") axes[1].set_aspect("equal") axes[1].axis("off") legend_handles = [ plt.Line2D( [0], [0], marker="s", color="none", markerfacecolor=worldcover_colors[c], markersize=12, linestyle="", label=f"{c}: {worldcover_classes[c]}", ) for c in classes ] axes[1].legend( handles=legend_handles, loc="lower center", bbox_to_anchor=(0.5, -0.05), ncol=3, frameon=False, ) plt.show() ``` ### 8. Export Predictions as a Cloud-Optimized GeoTIFF Write the labeled raster to disk so you can share or publish the map. This cell stamps CRS and transform metadata on `pred_da`, writes a temporary GeoTIFF, and uses `cog_translate` to produce the final COG either locally or directly to S3 (toggle `UPLOAD_TO_S3` as needed). Run it to generate the `lulc_.tif` output; the temporary file is removed automatically when saving locally. ```ipython3 INFERENCE_OUTPUT_DIR.mkdir(parents=True, exist_ok=True) temp_tif = cog_path.with_suffix(".tmp.tif") pred_da.rio.write_crs(stack.rio.crs, inplace=True) pred_da.rio.write_transform(stack.rio.transform(), inplace=True) pred_da.rio.to_raster(temp_tif, dtype="int16") if UPLOAD_TO_S3: s3_vsis = f"/vsis3/{S3_BUCKET}/{S3_PREFIX.rstrip('/')}/{cog_path.name}" cog_translate(temp_tif, s3_vsis, COG_PROFILE, in_memory=False, quiet=False) else: cog_translate(temp_tif, cog_path, COG_PROFILE, in_memory=False, quiet=False) temp_tif.unlink(missing_ok=True) print("Saved COG:", cog_path) ``` ```myst-ansi /raid/jjayabaskar/gis12/lib/python3.12/site-packages/rioxarray/raster_writer.py:301: RuntimeWarning: invalid value encountered in cast data = encode_cf_variable(out_data.variable).values.astype( Reading input: /raid/jjayabaskar/full-outputs/inference/lulc_S2B_MSIL2A_20251102T154319_R011_T18TWL_20251102T193858.tmp.tif Adding overviews... Updating dataset tags... Writing output to: /raid/jjayabaskar/full-outputs/inference/lulc_S2B_MSIL2A_20251102T154319_R011_T18TWL_20251102T193858.tif ``` ```myst-ansi Saved COG: /raid/jjayabaskar/full-outputs/inference/lulc_S2B_MSIL2A_20251102T154319_R011_T18TWL_20251102T193858.tif ``` ## Summary We have successfully built a end-to-end ML workflow on Sentinel‑2 and ESA WorldCover imagery from acquisition to generating a LULC classification map on an unseen tile, with all processing occurring exclusively on the GPU. We also touched upon a host of libraries within the RAPIDS ecosystem, with the workflow streaming scenes into the GPU using Dask, cleaning and compositing them with Dask and CuPy backed xarray, training a cuML random forest model, and generating predictions into a Cloud-Optimized GeoTIFF. This shows how the RAPIDS ecosystem can accelerate all aspects of a typical ML workflow including geospatial preprocessing, model training, and inference. ## Future Steps If you are interested in going further, the next step to improve the classification model is to correct the class imbalance that surfaced in evaluation and inference. This can be done by expanding the AOI, adding more tiles or seasons, or stratify sampling so that crops, flooded vegetation, and bare ground gain enough pixels. Re-run the GPU pipeline on that richer dataset, track the class histograms, and confirm the confusion matrix closes the gap. # index.html.md # Deploying End-to-End Kafka Streaming SI Detection Pipeline with cuDF, Morpheus, and Triton on EKS *June, 2025* In this example workflow, we demonstrate how to deploy an NVIDIA GPU-accelerated streaming pipeline for Sensitive Information (SI) detection using [Morpheus](https://docs.nvidia.com/morpheus/), [cuDF](https://docs.rapids.ai/api/cudf/stable/), and [Triton Inference Server](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/) on [Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html). We build upon the existing Morpheus [NLP SI Detection example](https://docs.nvidia.com/morpheus/examples/nlp_si_detection/readme.html) and enhance it to showcase a production-style end-to-end deployment integrated with [Apache Kafka](https://kafka.apache.org/) for data streaming. The pipeline, under `pipeline-dockerfile/run_pipeline_kafka.py` in the side panel, includes the following components: - **Kafka Data Streaming Source Stage**: We introduce Apache Kafka for streaming data. A custom Kafka producer was created to continuously publish network data to a Kafka topic. Code under `producer-dockerfile/producer.py` in the side panel. - **cuDF Message Filtering Stage**: The data stream first flows through a message filtering stage that leverages `cuDF` to preprocess and filter messages based on custom logic. Code under `pipeline-dockerfile/message_filter_stage.py` in the side panel. - **SI Detection with Morpheus and Triton**: The filtered data passes through multiple stages to prepare data for inference, perform the inference and classify the data. We use Morpheus’ provided NLP SI Detection model to identify potentially sensitive information in the network packet data. For more details on the model check the original example on the [Morpheus documentation](https://docs.nvidia.com/morpheus/examples/nlp_si_detection/readme.html#background) - **cuDF Network Traffic Analysis Stage**: We incorporate an additional analysis stage using `cuDF` to perform some network traffic analytics for enriched context and anomaly detection. Code under `pipeline-dockerfile/network_traffic_analyzer_stage.py` in the side panel. - **Kafka Output Sink**: Finally, the processed and enriched data, with SI detection results and traffic insights, is published to a downstream Kafka topic for further processing, alerting, or storage. The entire pipeline is containerized and deployed on **Amazon EKS**, leveraging Kubernetes for orchestration, scalability, and resiliency in a cloud-native environment. ## Deployment Components The pipeline is deployed on Amazon EKS using several Kubernetes manifests: ### Kafka Deployment (`k8s/kafka`) The Kafka cluster is deployed using the [Strimzi Operator](https://strimzi.io/), which simplifies Kafka deployment and management on Kubernetes. See instructions on section [Deploying on EKS](#deploying-on-eks) The deployment configuration includes: - Kafka cluster setup `kafka-single-node.yaml`. A modification of the file [https://strimzi.io/examples/latest/kafka/kafka-single-node.yaml](https://strimzi.io/examples/latest/kafka/kafka-single-node.yaml) where we modify: - Cluster name to `kafka-cluster`. - Modify the volume to use `type: ephemeral` and use `sizeLimit: 5Gi` (instead of `size: 100Gi` that corresponded to `type: persistent-claim`). - Kafka topics setup. - [Kafka UI](https://github.com/provectus/kafka-ui). ### Kafka Producer Deployment (`k8s/kafka-producer`) The Kafka producer is deployed as a separate Pod using the `kafka-producer.yaml` manifest. It continuously generates and publishes network data to the Kafka topic. - Uses `kafka-python` for message production. - Contains the producer script for generating network data. This producer script is containerized using a custom Docker image that is already built and public. But if you want to build and push this image yourself, you need: - Log in to docker `docker login`. - Download the `scripts` directory from the sidebar. - Navigate to the `producer-dockerfile` directory and run. ```default docker build -t /kafka-producer-image:latest . ``` - Push image to docker. - Replace the image link in the `kafka-producer/kafka-producer.yaml`. ### Triton-Morpheus Deployment (`k8s/triton`) The inference server is deployed using the NVIDIA Morpheus-Triton Inference Server docker image `nvcr.io/nvidia/morpheus/morpheus-tritonserver-models:25.02`. ### Morpheus Pipeline Deployment (`k8s/morpheus-pipeline`) The core processing pipeline is deployed as a separate Pod that, uses an image that uses a custom image we created for this purpose. - Runs the Morpheus nightly conda build. - Contains all pipeline and stage scripts `scripts/pipeline-dockerfile/*.py`. - Processes the streaming data through the various stages. This image is already built and public. But if you want to build and push this image yourself, you need: - Log in to docker `docker login`. - Download the `scripts` directory from the sidebar. - Navigate to the `pipeline-dockerfile` directory and run. ```default docker build -t /morpheus-pipeline-image:latest . ``` - Push image to docker - Replace the image link in the `morpheus-pipeline/morpheus-pipeline-deployment.yaml` ## Deploying on EKS ### Prerequisites You need to have the [`aws` CLI tool](https://aws.amazon.com/cli/) and [`eksctl` CLI tool](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/) for managing Kubernetes. ### Launch GPU enabled EKS cluster We launch a GPU enabled EKS cluster with `eksctl`. #### NOTE 1. You will need to create or import a public SSH key to be able to execute the following command. In your aws console under `EC2` in the side panel under **Network & Security** > **Key Pairs**, you can create a key pair or import (see “Actions” dropdown) one you’ve created locally. 2. If you are not using your default AWS profile, add `--profile ` to the following command. ```console $ eksctl create cluster morpheus-rapids \ --version 1.32 \ --nodes 2 \ --node-type=g4dn.xlarge \ --timeout=40m \ --ssh-access \ --ssh-public-key \ # Name assigned during creation of your key in aws console\ --region us-east-1 \ --zones=us-east-1c,us-east-1b,us-east-1d \ --auto-kubeconfig ``` To access the cluster we need to pull down the credentials. Add `--profile ` if you are not using the default profile. ```console $ aws eks --region us-east-1 update-kubeconfig --name morpheus-rapids ``` ### Deploy the Strimzi Operator [Strimzi](https://strimzi.io/) is an open-source project that provides a way to run Apache Kafka on Kubernetes. It simplifies the deployment and management of Kafka clusters by providing a Kubernetes operators that handle the complex tasks of setting up and maintaining Kafka. We use `kubectl` to deploy the operator. In our case we are deploying everything on the default namespace, and the entire pipeline is design for that. ```console $ kubectl create -f 'https://strimzi.io/install/latest?namespace=default' ``` ### Deploy the pipeline Get all the files in the `k8s` directory, you should be able to download them from the sidebar, or you can find them in [https://github.com/rapidsai/deployment/source/examples/rapids-morpheus-pipeline/k8s](https://github.com/rapidsai/deployment/source/examples/rapids-morpheus-pipeline/k8s) ```console $ kubectl apply -f k8s --recursive ``` This will take around 15 minutes to get all the Pods up and running, you will see for a while that the the `morpheus-pipeline` Pod fails and try to reconcile. This happens because the triton inference Pod takes a while to get up and running. ### Kafka UI: checking the pipeline results Once all the Pods are running, you can check the input topic and the results topic in the Kafka UI by forwarding the port to your local host ```console $ kubectl port-forward svc/kafka-ui 8080:80 ``` In your browser go to `http://localhost:8080/` and you will see: ![Kafka UI demo](images/morpheus-pipeline-KafkaUI_9MB.gif) ## Conclusion This example demonstrates how to build and deploy a production-like, GPU-accelerated streaming pipeline for sensitive information detection using NVIDIA RAPIDS, Morpheus, and Triton Inference Server on Amazon EKS while integrating Apache Kafka for data streaming capabilities. This architecture showcases how modern streaming technologies combine with GPU-accelerated inference to create efficient, production-grade solutions for sensitive information detection. # index.html.md # Train and Hyperparameter-Tune with RAPIDS on AzureML *August, 2023* Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on [Azure Machine Learning (AzureML)](https://azure.microsoft.com/en-us/products/machine-learning) service. # Prerequisites # Initialize Workspace Initialize `MLClient` [class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.mlclient?view=azure-python) to handle the workspace you created in the prerequisites step. You can manually provide the workspace details or call `MLClient.from_config(credential, path)` to create a workspace object from the details stored in `config.json` ```ipython3 from azure.ai.ml import MLClient from azure.identity import DefaultAzureCredential # Get a handle to the workspace. # # Azure ML places the workspace config at the default working # directory for notebooks by default. # # If it isn't found, open a shell and look in the # directory indicated by 'echo ${JUPYTER_SERVER_ROOT}'. ml_client = MLClient.from_config( credential=DefaultAzureCredential(), path="./config.json", ) ``` # Access Data from Datastore URI In this example, we will use 20 million rows of the airline dataset. The [datastore uri](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview) below references a data storage location (path) containing the parquet files ```ipython3 datastore_name = "workspaceartifactstore" dataset = "airline_20000000.parquet" # Datastore uri format: data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}" print("data uri:", "\n", data_uri) ``` # Create AML Compute You will need to create an Azure ML managed compute target ([AmlCompute](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?view=azureml-api-2&tabs=python)) to serve as the environment for training your model. This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_instances` based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota. `size` describes the virtual machine type and size that will be used in the cluster. See “System Requirements” in the RAPIDS docs ([link](https://docs.rapids.ai/install#system-req)) and “GPU optimized virtual machine sizes” in the Azure docs ([link](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu)) to identify an instance type. Let’s create an `AmlCompute` cluster of `Standard_NC12s_v3` (Tesla V100) GPU VMs: ```ipython3 from azure.ai.ml.entities import AmlCompute from azure.ai.ml.exceptions import MlException # specify aml compute name. target_name = "rapids-cluster" try: # let's see if the compute target already exists gpu_target = ml_client.compute.get(target_name) print(f"found compute target. Will use {gpu_target.name}") except MlException: print("Creating a new gpu compute target...") gpu_target = AmlCompute( name=target_name, type="amlcompute", size="STANDARD_NC12S_V3", max_instances=5, idle_time_before_scale_down=300, ) ml_client.compute.begin_create_or_update(gpu_target).result() print( f"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}" ) ``` # Prepare training script Make sure current directory contains your code to run on the remote resource. This includes the training script and all its dependencies files. In this example, the training script is provided: `train_rapids.py`- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML. We will log some parameters and metrics including highest accuracy, using mlflow within the training script: ```console import mlflow mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy)) ``` These run metrics will become particularly important when we begin hyperparameter tuning our model in the ‘Tune model hyperparameters’ section. # Train Model on Remote Compute ## Setup Environment We’ll be using a custom RAPIDS docker image to [setup the environment](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image). This is available in `rapidsai/base` repo on [DockerHub](https://hub.docker.com/r/rapidsai/base/). ```ipython3 %%bash # create a Dockerfile defining the image the code will run in cat > ./Dockerfile <=2024.4.4' \ && pip install azureml-mlflow EOF ``` Make sure you have the correct path to the docker build context as `os.getcwd()`. ```ipython3 import os from azure.ai.ml.entities import BuildContext, Environment env_docker_image = Environment( build=BuildContext(path=os.getcwd()), name="rapids-hpo", description="RAPIDS environment with azureml-mlflow", ) ml_client.environments.create_or_update(env_docker_image) ``` ## Submit the Training Job We will configure and run a training job using the`command`class. The [command](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) can be used to run standalone jobs or as a function inside pipelines. `inputs` is a dictionary of command-line arguments to pass to the training script. ```ipython3 from azure.ai.ml import Input, command command_job = command( environment=f"{env_docker_image.name}:{env_docker_image.version}", experiment_name="test_rapids_aml_hpo_cluster", code=os.getcwd(), inputs={ "data_dir": Input(type="uri_file", path=data_uri), "n_bins": 32, "compute": "single-GPU", # multi-GPU for algorithms via Dask "cv_folds": 5, "n_estimators": 100, "max_depth": 6, "max_features": 0.3, }, command="python train_rapids.py \ --data_dir ${{inputs.data_dir}} \ --n_bins ${{inputs.n_bins}} \ --compute ${{inputs.compute}} \ --cv_folds ${{inputs.cv_folds}} \ --n_estimators ${{inputs.n_estimators}} \ --max_depth ${{inputs.max_depth}} \ --max_features ${{inputs.max_features}}", compute=gpu_target.name, ) # submit the command returned_job = ml_client.jobs.create_or_update(command_job) # get a URL for the status of the job returned_job.studio_url ``` # Tune Model Hyperparameters We can optimize our model’s hyperparameters and improve the accuracy using Azure Machine Learning’s hyperparameter tuning capabilities. ## Start a Hyperparameter Sweep Let’s define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`. ```ipython3 from azure.ai.ml.sweep import Choice, Uniform command_job_for_sweep = command_job( n_estimators=Choice(values=range(50, 500)), max_depth=Choice(values=range(5, 19)), max_features=Uniform(min_value=0.2, max_value=1.0), ) # apply sweep parameter to obtain the sweep_job sweep_job = command_job_for_sweep.sweep( compute=gpu_target.name, sampling_algorithm="random", primary_metric="Accuracy", goal="Maximize", ) # Relax these limits to run more trials sweep_job.set_limits( max_total_trials=5, max_concurrent_trials=5, timeout=18000, trial_timeout=3600 ) # Specify your experiment details sweep_job.display_name = "RF-rapids-sweep-job" sweep_job.description = "Run RAPIDS hyperparameter sweep job" ``` This will launch the RAPIDS training script with parameters that were specified in the cell above. ```ipython3 # submit the hpo job returned_sweep_job = ml_client.create_or_update(sweep_job) ``` ## Monitor runs ```ipython3 print(f"Monitor your job at {returned_sweep_job.studio_url}") ``` ## Find and Register Best Model Download the best trial model output ```ipython3 ml_client.jobs.download(returned_sweep_job.name, output_name="model") ``` # Delete Cluster ```ipython3 ml_client.compute.begin_delete(gpu_target.name).wait() ``` # index.html.md # Deep Dive into Running Hyper Parameter Optimization on AWS SageMaker *February, 2023* [Hyper Parameter Optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization) (HPO) improves model quality by searching over hyperparameters, parameters not typically learned during the training process but rather values that control the learning process itself (e.g., model size/capacity). This search can significantly boost model quality relative to default settings and non-expert tuning; however, HPO can take a very long time on a non-accelerated platform. In this notebook, we containerize a RAPIDS workflow and run Bring-Your-Own-Container SageMaker HPO to show how we can overcome the computational complexity of model search. We accelerate HPO in two key ways: * by *scaling within a node* (e.g., multi-GPU where each GPU brings a magnitude higher core count relative to CPUs), and * by *scaling across nodes* and running parallel trials on cloud instances. By combining these two powers HPO experiments that feel unapproachable and may take multiple days on CPU instances can complete in just hours. For example, we find a **12x** speedup in wall clock time (6 hours vs 3+ days) and a **4.5x** reduction in cost when comparing between GPU and CPU [EC2 Spot instances](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) on 100 XGBoost HPO trials using 10 parallel workers on 10 years of the Airline Dataset (~63M flights) hosted in a S3 bucket. For additional details refer to the end of the notebook. With all these powerful tools at our disposal, every data scientist should feel empowered to up-level their model before serving it to the world! ## Preamble To get things rolling let’s make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region. ```ipython3 !docker images ``` ```ipython3 %pip install --upgrade boto3 ``` ```ipython3 import os import sagemaker from helper_functions import ( download_best_model, new_job_name_from_config, recommend_instance_type, summarize_choices, summarize_hpo_results, validate_dockerfile, ) ``` ```ipython3 execution_role = sagemaker.get_execution_role() session = sagemaker.Session() account = !(aws sts get-caller-identity --query Account --output text) region = !(aws configure get region) ``` ```ipython3 account, region ``` ### Key Choices Let’s go ahead and choose the configuration options for our HPO run. Below are two reference configurations showing a small and a large scale HPO (sized in terms of total experiments/compute). The default values in the notebook are set for the small HPO configuration, however you are welcome to scale them up. > **small HPO**: 1_year, XGBoost, 3 CV folds, singleGPU, max_jobs = 10, max_parallel_jobs = 2 > **large HPO**: 10_year, XGBoost, 10 CV folds, multiGPU, max_jobs = 100, max_parallel_jobs = 10 #### Dataset We offer free hosting for several demo datasets that you can try running HPO with, or alternatively you can bring your own dataset (BYOD). By default we leverage the `Airline` dataset, which is a large public tracker of US domestic flight logs which we offer in various sizes (1 year, 3 year, and 10 year) and in Parquet (compressed column storage) format. The machine learning objective with this dataset is to predict whether flights will be more than 15 minutes late arriving to their destination ([dataset link](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=), additional details in Section 1.1). As an alternative we also offer the `NYC Taxi` dataset which captures yellow cab trip details in Ney York in January 2020, stored in CSV format without any compression. The machine learning objective with this dataset is to predict whether a trip had an above average tip (>$2.20). We host the demo datasets in public S3 demo buckets in both the **us-east-1** (N. Virginia) or **us-west-2** (Oregon) regions (i.e., `sagemaker-rapids-hpo-us-east-1`, and `sagemaker-rapids-hpo-us-west-2`). You should run the SageMaker HPO workflow in either of these two regions if you wish to leverage the demo datasets since SageMaker requires that the S3 dataset and the compute you’ll be renting are co-located. Lastly, if you plan to use your own dataset refer to the BYOD checklist in the Appendix to help integrate into the workflow. | dataset | data_bucket | dataset_directory | # samples | storage type | time span | |------------------------|---------------|---------------------|-------------|----------------|--------------| | Airline Stats Small | demo | 1_year | 6.3M | Parquet | 2019 | | Airline Stats Medium | demo | 3_year | 18M | Parquet | 2019-2017 | | Airline Stats Large | demo | 10_year | 63M | Parquet | 2019-2010 | | NYC Taxi | demo | NYC_taxi | 6.3M | CSV | 2020 January | | Bring Your Own Dataset | custom | custom | custom | Parquet/CSV | custom | ```ipython3 # please choose dataset S3 bucket and directory data_bucket = "sagemaker-rapids-hpo-" + region[0] dataset_directory = "10_year" # '1_year', '3_year', '10_year', 'NYC_taxi' # please choose output bucket for trained model(s) model_output_bucket = session.default_bucket() ``` ```ipython3 s3_data_input = f"s3://{data_bucket}/{dataset_directory}" s3_model_output = f"s3://{model_output_bucket}/trained-models" best_hpo_model_local_save_directory = os.getcwd() ``` #### Algorithm From a ML/algorithm perspective, we offer [XGBoost](https://xgboost.readthedocs.io/en/latest/#), [RandomForest](https://docs.rapids.ai/api/cuml/nightly/cuml_blogs.html#tree-and-forest-models) and [KMeans](https://docs.rapids.ai/api/cuml/nightly/api.html?highlight=kmeans#cuml.KMeans). You are free to switch between these algorithm choices and everything in the example will continue to work. ```ipython3 # please choose learning algorithm algorithm_choice = "XGBoost" assert algorithm_choice in ["XGBoost", "RandomForest", "KMeans"] ``` We can also optionally increase robustness via reshuffles of the train-test split (i.e., [cross-validation folds](https://scikit-learn.org/stable/modules/cross_validation.html)). Typical values here are between 3 and 10 folds. ```ipython3 # please choose cross-validation folds cv_folds = 10 assert cv_folds >= 1 ``` #### ML Workflow Compute Choice We enable the option of running different code variations that unlock increasing amounts of parallelism in the compute workflow. * `singleCPU` = [pandas](https://pandas.pydata.org/) + [sklearn](https://scikit-learn.org/stable/) * `multiCPU` = [dask](https://dask.org/) + [pandas](https://pandas.pydata.org/) + [sklearn](https://scikit-learn.org/stable/) * RAPIDS `singleGPU` = [cudf](https://github.com/rapidsai/cudf) + [cuml](https://github.com/rapidsai/cuml) * RAPIDS `multiGPU` = [dask](https://dask.org/) + [cudf](https://github.com/rapidsai/cudf) + [cuml](https://github.com/rapidsai/cuml) All of these code paths are available in the `/workflows` directory for your reference. > \*\*Note that the single-CPU option will leverage multiple cores in the model training portion of the workflow; however, to unlock full parallelism in each stage of the workflow we use [Dask](https://dask.org/). ```ipython3 # please choose code variant ml_workflow_choice = "multiGPU" assert ml_workflow_choice in ["singleCPU", "singleGPU", "multiCPU", "multiGPU"] ``` #### Search Ranges and Strategy One of the most important choices when running HPO is to choose the bounds of the hyperparameter search process. Below we’ve set the ranges of the hyperparameters to allow for interesting variation, you are of course welcome to revise these ranges based on domain knowledge especially if you plan to plug in your own dataset. > Note that we support additional algorithm specific parameters (refer to the `parse_hyper_parameter_inputs` function in `HPOConfig.py`), but for demo purposes have limited our choice to the three parameters that overlap between the XGBoost and RandomForest algorithms. For more details see the documentation for [XGBoost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) and [RandomForest parameters](https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest). Since KMeans uses different parameters, we adjust accordingly. ```ipython3 # please choose HPO search ranges hyperparameter_ranges = { "max_depth": sagemaker.parameter.IntegerParameter(5, 15), "n_estimators": sagemaker.parameter.IntegerParameter(100, 500), "max_features": sagemaker.parameter.ContinuousParameter(0.1, 1.0), } # see note above for adding additional parameters ``` ```ipython3 if "XGBoost" in algorithm_choice: # number of trees parameter name difference b/w XGBoost and RandomForest hyperparameter_ranges["num_boost_round"] = hyperparameter_ranges.pop("n_estimators") ``` ```ipython3 if "KMeans" in algorithm_choice: hyperparameter_ranges = { "n_clusters": sagemaker.parameter.IntegerParameter(2, 20), "max_iter": sagemaker.parameter.IntegerParameter(100, 500), } ``` We can also choose between a Random and Bayesian search strategy for picking parameter combinations. **Random Search**: Choose a random combination of values from within the ranges for each training job it launches. The choice of hyperparameters doesn’t depend on previous results so you can run the maximum number of concurrent workers without affecting the performance of the search. **Bayesian Search**: Make a guess about which hyperparameter combinations are likely to get the best results. After testing the first set of hyperparameter values, hyperparameter tuning uses regression to choose the next set of hyperparameter values to test. ```ipython3 # please choose HPO search strategy search_strategy = "Random" assert search_strategy in ["Random", "Bayesian"] ``` #### Experiment Scale We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don’t accidentally spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional `max_parallel_jobs` parallel workers. ```ipython3 # please choose total number of HPO experiments[ we have set this number very low to allow for automated CI testing ] max_jobs = 100 ``` ```ipython3 # please choose number of experiments that can run in parallel max_parallel_jobs = 10 ``` Let’s also set the max duration for an individual job to 24 hours so we don’t have run-away compute jobs taking too long. ```ipython3 max_duration_of_experiment_seconds = 60 * 60 * 24 ``` #### Compute Platform Based on the dataset size and compute choice we will try to recommend an instance choice\*, you are of course welcome to select alternate configurations. > e.g., For the 10_year dataset option, we suggest ml.p3.8xlarge instances (4 GPUs) and ml.m5.24xlarge CPU instances ( we will need upwards of 200GB CPU RAM during model training). ```ipython3 # we will recommend a compute instance type, feel free to modify instance_type = recommend_instance_type(ml_workflow_choice, dataset_directory) ``` ```myst-ansi recommended instance type : ml.p3.8xlarge instance details : 4x GPUs [ V100 ], 64GB GPU memory, 244GB CPU memory ``` In addition to choosing our instance type, we can also enable significant savings by leveraging [AWS EC2 Spot Instances](https://aws.amazon.com/ec2/spot/). We **highly recommend** that you set this flag to `True` as it typically leads to 60-70% cost savings. Note, however that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) to enable Spot instances in SageMaker. ```ipython3 # please choose whether spot instances should be used use_spot_instances_flag = True ``` ## Validate ```ipython3 summarize_choices( s3_data_input, s3_model_output, ml_workflow_choice, algorithm_choice, cv_folds, instance_type, use_spot_instances_flag, search_strategy, max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds, ) ``` ```myst-ansi s3 data input = s3://sagemaker-rapids-hpo-us-west-2/10_year s3 model output = s3://sagemaker-us-west-2-561241433344/trained-models compute = multiGPU algorithm = XGBoost, 10 cv-fold instance = ml.p3.8xlarge spot instances = True hpo strategy = Random max_experiments = 100 max_parallel = 10 max runtime = 86400 sec ``` **1. ML Workflow** ![](_static/images/examples/rapids-sagemaker-hpo/ml_workflow.png) ### Dataset The default settings for this demo are built to utilize the Airline dataset (Carrier On-Time Performance 1987-2020, available from the [Bureau of Transportation Statistics](https://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)). Below are some additional details about this dataset, we plan to offer a companion notebook that does a deep dive on the data science behind this dataset. Note that if you are using an alternate dataset (e.g., NYC Taxi or BYOData) these details are not relevant. The public dataset contains logs/features about flights in the United States (17 airlines) including: * Locations and distance ( `Origin`, `Dest`, `Distance` ) * Airline / carrier ( `Reporting_Airline` ) * Scheduled departure and arrival times ( `CRSDepTime` and `CRSArrTime` ) * Actual departure and arrival times ( `DpTime` and `ArrTime` ) * Difference between scheduled & actual times ( `ArrDelay` and `DepDelay` ) * Binary encoded version of late, aka our target variable ( `ArrDelay15` ) Using these features we will build a classifier model to predict whether a flight is going to be more than 15 minutes late on arrival as it prepares to depart. ### Python ML Workflow To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be triggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. In order to work with SageMaker HPO, the entrypoint logic should parse hyperparameters (supplied by AWS SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting. We’ve already built multiple variations of this code. If you would like to make changes by adding your custom model logic feel free to modify the **train.py** and/or the specific workflow files in the `workflows` directory. You are also welcome to uncomment the cells below to load the read/review the code. First, let’s switch our working directory to the location of the Estimator entrypoint and library code. ```ipython3 # %load train.py ``` ```ipython3 # %load workflows/MLWorkflowSingleGPU.py ``` ## Build Estimator As we’ve already mentioned, the SageMaker Estimator represents the containerized software stack that AWS SageMaker will replicate to each worker node. The first step to building our Estimator, is to augment a RAPIDS container with our ML Workflow code from above, and push this image to Amazon Elastic Cloud Registry so it is available to SageMaker. ### Containerize and Push to ECR Now let’s turn to building our container so that it can integrate with the AWS SageMaker HPO API. Our container can either be built on top of the latest RAPIDS [ nightly ] image as a starting layer or the RAPIDS stable image. ```ipython3 rapids_base_container = "rapidsai/base:25.12a-cuda12-py3.13" ``` Let’s also decide on the full name of our container. ```ipython3 image_base = "sagemaker-rapids-mnmg-100" image_tag = rapids_base_container.split(":")[1] ``` ```ipython3 ecr_fullname = ( f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}" ) ``` #### Write Dockerfile We write out the Dockerfile to disk, and in a few cells execute the docker build command. Let’s now write our selected RAPDIS image layer as the first FROM statement in the the Dockerfile. ```ipython3 with open("Dockerfile", "w") as dockerfile: dockerfile.writelines( f"FROM {rapids_base_container} \n\n" f'ENV AWS_DATASET_DIRECTORY="{dataset_directory}"\n' f'ENV AWS_ALGORITHM_CHOICE="{algorithm_choice}"\n' f'ENV AWS_ML_WORKFLOW_CHOICE="{ml_workflow_choice}"\n' f'ENV AWS_CV_FOLDS="{cv_folds}"\n' ) ``` Next let’s append the remaining pieces of the Dockerfile, namely adding dependencies and our Python code. ```ipython3 %%writefile -a Dockerfile # ensure printed output/log-messages retain correct order ENV PYTHONUNBUFFERED=True # install a few more dependencies RUN conda install --yes -n base \ -c rapidsai-nightly -c conda-forge -c nvidia \ cupy \ dask-ml \ flask \ protobuf \ rapids-dask-dependency=${{ rapids_version }} \ 'sagemaker-python-sdk>=2.239.0' # path where SageMaker looks for code when container runs in the cloud ENV CLOUD_PATH="/opt/ml/code" # copy our latest [local] code into the container COPY . $CLOUD_PATH WORKDIR $CLOUD_PATH ENTRYPOINT ["./entrypoint.sh"] ``` ```myst-ansi Appending to Dockerfile ``` Lastly, let’s ensure that our Dockerfile correctly captured our base image selection. ```ipython3 validate_dockerfile(rapids_base_container) !cat Dockerfile ``` ```myst-ansi ARG RAPIDS_IMAGE FROM $RAPIDS_IMAGE as rapids ENV AWS_DATASET_DIRECTORY="10_year" ENV AWS_ALGORITHM_CHOICE="XGBoost" ENV AWS_ML_WORKFLOW_CHOICE="multiGPU" ENV AWS_CV_FOLDS="10" # ensure printed output/log-messages retain correct order ENV PYTHONUNBUFFERED=True # install a few more dependencies RUN conda install --yes -n base \ cupy \ flask \ protobuf \ 'sagemaker-python-sdk>=2.239.0' # path where SageMaker looks for code when container runs in the cloud ENV CLOUD_PATH="/opt/ml/code" # copy our latest [local] code into the container COPY . $CLOUD_PATH WORKDIR $CLOUD_PATH ENTRYPOINT ["./entrypoint.sh"] ``` #### Build and Tag The build step will be dominated by the download of the RAPIDS image (base layer). If it’s already been downloaded the build will take less than 1 minute. ```ipython3 !docker pull $rapids_base_container ``` ```ipython3 !docker images ``` ```ipython3 %%time !docker build -t $ecr_fullname . ``` ```ipython3 !docker images ``` #### Publish to Elastic Cloud Registry (ECR) Now that we’ve built and tagged our container its time to push it to Amazon’s container registry (ECR). Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments. Docker Login to ECR ```ipython3 docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email) ``` ```ipython3 !{docker_login_str[0]} ``` Create ECR repository [ if it doesn’t already exist] ```ipython3 repository_query = !(aws ecr describe-repositories --repository-names $image_base) if repository_query[0] == "": !(aws ecr create-repository --repository-name $image_base) ``` Let’s now actually push the container to ECR > Note the first push to ECR may take some time (hopefully less than 10 minutes). ```ipython3 !docker push $ecr_fullname ``` ### Create Estimator Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an Estimator instance. ```ipython3 !docker images ``` ```ipython3 # 'volume_size' - EBS volume size in GB, default = 30 estimator_params = { "image_uri": ecr_fullname, "role": execution_role, "instance_type": instance_type, "instance_count": 2, "input_mode": "File", "output_path": s3_model_output, "use_spot_instances": use_spot_instances_flag, "max_run": max_duration_of_experiment_seconds, # 24 hours "sagemaker_session": session, } if use_spot_instances_flag: estimator_params.update({"max_wait": max_duration_of_experiment_seconds + 1}) ``` ```ipython3 estimator = sagemaker.estimator.Estimator(**estimator_params) ``` ### Test Estimator Now we are ready to test by asking SageMaker to run the BYOContainer logic inside our Estimator. This is a useful step if you’ve made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. > Note: This verification step will use the default hyperparameter values declared in our custom train code, as SageMaker HPO will not be orchestrating a search for this single run. ```ipython3 summarize_choices( s3_data_input, s3_model_output, ml_workflow_choice, algorithm_choice, cv_folds, instance_type, use_spot_instances_flag, search_strategy, max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds, ) ``` ```myst-ansi s3 data input = s3://sagemaker-rapids-hpo-us-west-2/10_year s3 model output = s3://sagemaker-us-west-2-561241433344/trained-models compute = multiGPU algorithm = XGBoost, 10 cv-fold instance = ml.p3.8xlarge spot instances = True hpo strategy = Random max_experiments = 100 max_parallel = 10 max runtime = 86400 sec ``` ```ipython3 job_name = new_job_name_from_config( dataset_directory, region, ml_workflow_choice, algorithm_choice, cv_folds, instance_type, ) ``` ```myst-ansi generated job name : air-mGPU-XGB-10cv-31d03d8b015bfc ``` ```ipython3 estimator.fit(inputs=s3_data_input, job_name=job_name.lower()) ``` ## Run HPO With a working SageMaker Estimator in hand, the hardest part is behind us. In the key choices section we already defined our search strategy and hyperparameter ranges, so all that remains is to choose a metric to evaluate performance on. For more documentation check out the AWS SageMaker [Hyperparameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html). ![](_static/images/examples/rapids-sagemaker-hpo/run_hpo.png) ### Define Metric We only focus on a single metric, which we call ‘final-score’, that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add additional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker. ```ipython3 metric_definitions = [{"Name": "final-score", "Regex": "final-score: (.*);"}] ``` ```ipython3 objective_metric_name = "final-score" ``` ### Define Tuner Finally we put all of the elements we’ve been building up together into a HyperparameterTuner declaration. ```ipython3 hpo = sagemaker.tuner.HyperparameterTuner( estimator=estimator, metric_definitions=metric_definitions, objective_metric_name=objective_metric_name, objective_type="Maximize", hyperparameter_ranges=hyperparameter_ranges, strategy=search_strategy, max_jobs=max_jobs, max_parallel_jobs=max_parallel_jobs, ) ``` ### Run HPO ```ipython3 summarize_choices( s3_data_input, s3_model_output, ml_workflow_choice, algorithm_choice, cv_folds, instance_type, use_spot_instances_flag, search_strategy, max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds, ) ``` ```myst-ansi s3 data input = s3://sagemaker-rapids-hpo-us-west-2/10_year s3 model output = s3://sagemaker-us-west-2-561241433344/trained-models compute = multiGPU algorithm = XGBoost, 10 cv-fold instance = ml.p3.8xlarge spot instances = True hpo strategy = Random max_experiments = 100 max_parallel = 10 max runtime = 86400 sec ``` Let’s be sure we take a moment to confirm before launching all of our HPO experiments. Depending on your configuration options running this cell can kick off a massive amount of computation! > Once this process begins, we recommend that you use the SageMaker UI to keep track of the health of the HPO process and the individual workers. ```ipython3 # tuning_job_name = new_job_name_from_config(dataset_directory, region, ml_workflow_choice, # algorithm_choice, cv_folds, # # instance_type) # hpo.fit( inputs=s3_data_input, # job_name=tuning_job_name, # wait=True, # logs='All') # hpo.wait() # block until the .fit call above is completed ``` ### Results and Summary Once your job is complete there are multiple ways to analyze the results. Below we display the performance of the best job, as well printing each HPO trial/job as a row of a dataframe. ```ipython3 tuning_job_name = "air-mGPU-XGB-10cv-527fd372fa4d8d" ``` ```ipython3 hpo_results = summarize_hpo_results(tuning_job_name) ``` ```myst-ansi INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole ``` ```myst-ansi best score: 0.9203665256500244 best params: {'max_depth': '7', 'max_features': '0.29751893065195945', 'num_boost_round': '346'} best job-name: air-mGPU-XGB-10cv-527fd372fa4d8d-042-ed1ff13b ``` ```ipython3 sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe() ``` ```myst-ansi INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole ``` For a more in depth look at the HPO process we invite you to check out the HPO_Analyze_TuningJob_Results.ipynb notebook which shows how we can explore interesting things like the impact of each individual hyperparameter on the performance metric. ### Getting the Best Model Next let’s download the best trained model from our HPO runs. ```ipython3 local_filename, s3_path_to_best_model = download_best_model( model_output_bucket, s3_model_output, hpo_results, best_hpo_model_local_save_directory, ) ``` ```myst-ansi INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole ``` ```myst-ansi Successfully downloaded best model > filename: /home/ec2-user/SageMaker/cloud-ml-examples/aws/best_model.tar.gz > local directory : /home/ec2-user/SageMaker/cloud-ml-examples/aws full S3 path : s3://sagemaker-us-west-2-561241433344/trained-models/air-mGPU-XGB-10cv-527fd372fa4d8d-042-ed1ff13b/output/model.tar.gz ``` ### Model Serving With your best model in hand, you can now move on to [serving this model on SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). In the example below we show you how to build a [RealTimePredictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) using the best model found during the HPO search. We will add a lightweight Flask server to our RAPIDS Estimator (a.k.a., container) which will handle the incoming requests and pass them along to the trained model for inference. If you are curious about how this works under the hood check out the [Use Your Own Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html) documentation and reference the code in `serve.py`. If you are interested in additional serving options (e.g., large batch with batch-transform), we plan to add a companion notebook that will provide additional details. #### GPU serving ```ipython3 endpoint_model = sagemaker.model.Model( image_uri=ecr_fullname, role=execution_role, model_data=s3_path_to_best_model ) ``` Kick off an instance for prediction [ recommend ‘ml.g4dn.2xlarge’ ] ```ipython3 DEMO_SERVING_FLAG = True if DEMO_SERVING_FLAG: endpoint_model.deploy( initial_instance_count=1, instance_type="ml.g4dn.2xlarge" ) #'ml.p3.2xlarge' ``` ```myst-ansi INFO:sagemaker:Creating model with name: rapids-sagemaker-mnmg-100-2023-01-23-22-24-22-008 INFO:sagemaker:Creating endpoint-config with name rapids-sagemaker-mnmg-100-2023-01-23-22-24-22-498 INFO:sagemaker:Creating endpoint with name rapids-sagemaker-mnmg-100-2023-01-23-22-24-22-498 ``` ```myst-ansi ---------! ``` Perform the prediction and return the result(s). Below we’ve compiled examples to sanity test the trained model performance on the Airline dataset. > The first example is from a 2019 flight that departed nine minutes early, > The second example is from a 2018 flight that was more than two hours late to depart. When we run these samples we expect to see **b’[0.0, 1.0]** as the printed result. We encourage you to modify the queries below especially if you plug in your own dataset. ```ipython3 if DEMO_SERVING_FLAG: predictor = sagemaker.predictor.Predictor( endpoint_name=str(endpoint_model.endpoint_name), sagemaker_session=session ) if dataset_directory in ["1_year", "3_year", "10_year"]: on_time_example = [ 2019.0, 4.0, 12.0, 2.0, 3647.0, 20452.0, 30977.0, 33244.0, 1943.0, -9.0, 0.0, 75.0, 491.0, ] # 9 minutes early departure late_example = [ 2018.0, 3.0, 9.0, 5.0, 2279.0, 20409.0, 30721.0, 31703.0, 733.0, 123.0, 1.0, 61.0, 200.0, ] example_payload = str(list([on_time_example, late_example])) else: example_payload = "" # fill in a sample payload result = predictor.predict(example_payload) print(result) ``` ```myst-ansi b'[0.0, 1.0]' ``` Once we are finished with the serving example, we should be sure to clean up and delete the endpoint. ```ipython3 # if DEMO_SERVING_FLAG: # predictor.delete_endpoint() ``` ## Summary We’ve now successfully built a RAPIDS ML workflow, containerized it (as a SageMaker Estimator), and launched a set of HPO experiments to find the best hyperparamters for our model. If you are curious to go further, we invite you to plug in your own dataset and tweak the configuration settings to find your champion model! **HPO Experiment Details** As mentioned in the introduction we find a **12X** speedup in wall clock time and a **4.5x** reduction in cost when comparing between GPU and CPU instances on 100 HPO trials using 10 parallel workers on 10 years of the Airline Dataset (~63M flights). In these experiments we used the XGBoost algorithm with the multi-GPU vs multi-CPU Dask cluster and 10 cross validation folds. Below we offer a table with additional details. ![](_static/images/examples/rapids-sagemaker-hpo/results.png) In the case of the CPU runs, 12 jobs were stopped since they exceeded the 24 hour limit we set. CPU Job Summary Image In the case of the GPU runs, no jobs were stopped. GPU Job Summary Image Note that in both cases 1 job failed because a spot instance was terminated. But 1 failed job out of 100 is a minimal tradeoff for the significant cost savings. ## Appendix ### Bring Your Own Dataset Checklist If you plan to use your own dataset (BYOD) here is a checklist to help you integrate into the workflow: > - [ ] Dataset should be in either CSV or Parquet format. > - [ ] Dataset is already pre-processed (and all feature-engineering is done). > - [ ] Dataset is uploaded to S3 and `data_bucket` and `dataset_directory` have been set to the location of your data. > - [ ] Dataset feature and target columns have been enumerated in `/HPODataset.py` ### Rapids References > [More Cloud Deployment Workflow Examples](https://docs.rapids.ai/deployment/stable/examples/) > [RAPIDS HPO](https://rapids.ai/hpo) > [cuML Documentation](https://docs.rapids.ai/api/cuml/nightly/) ### SageMaker References > [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) > [Estimator Parameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) > Spot Instances [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), and [blog](https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/) # index.html.md # Getting Started with cudf.pandas and Snowflake *February, 2025* ## RAPIDS in Snowflake [RAPIDS](https://rapids.ai/) is a suite of libraries to execute end-to-end data science pipelines entirely on GPUs. If you have data in a [Snowflake](https://www.snowflake.com/) table that you want to explore with the RAPIDS, you can deploy RAPIDS in Snowflake using Snowpark Container Services. ## NYC Parking Tickets `cudf.pandas` Example If you have data in a Snowflake table, you can accelerate your ETL workflow with `cuDF.pandas`. With `cudf.pandas` you can accelerate the `pandas` ecosystem, with zero code changes. Just load `cudf.pandas` and you will have the benefits of GPU acceleration, with automatic CPU fallback if needed. For this example, we have a Snowflake table with the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y/about_data) dataset from NYC Open Data. ### Get data into a Snowflake table To follow along, you will need to have the NYC Parking Violations data into your snowflake account, and make sure that this data is accessible from the RAPIDS notebook Snowpark Service Container that you deployed following the [Run RAPIDS on Snowflake](../../platforms/snowflake.md) guide. In a Snowflake SQL sheet and with `ACCOUNTADMIN` role ```sql -- Create a database where the table would live -- CREATE DATABASE CUDF_SNOWFLAKE_EXAMPLE; USE DATABASE DATABASE CUDF_SNOWFLAKE_EXAMPLE; CREATE OR REPLACE FILE FORMAT my_parquet_format TYPE = 'PARQUET'; CREATE OR REPLACE STAGE my_s3_stage URL = 's3://rapidsai-data/datasets/nyc_parking/' FILE_FORMAT = my_parquet_format; -- Infer schema from parquet file to use when creating table later -- SELECT COLUMN_NAME, TYPE FROM TABLE( INFER_SCHEMA( LOCATION => '@my_s3_stage', FILE_FORMAT => 'my_parquet_format', FILES => ('nyc_parking_violations_2022.parquet') ) ); -- Create table using the inferred schema in the previous step -- CREATE OR REPLACE TABLE NYC_PARKING_VIOLATIONS USING TEMPLATE ( SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) FROM TABLE( INFER_SCHEMA( LOCATION => '@my_s3_stage', FILE_FORMAT => 'my_parquet_format', FILES => ('nyc_parking_violations_2022.parquet') ) )); -- Get data from the stage into the table -- COPY INTO NYC_PARKING_VIOLATIONS FROM @my_s3_stage FILES = ('nyc_parking_violations_2022.parquet') FILE_FORMAT = (TYPE = 'PARQUET') MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE; ``` ### Ensure access from container During the process of deploying RAPIDS in Snowflake, you created a `CONTAINER_USER_ROLE` and we need to make sure this role has access to the database, schema and table where the data is, to be able to query from it. ```sql -- Ensure the role has USAGE permissions on the database and schema GRANT USAGE ON DATABASE CUDF_SNOWFLAKE_EXAMPLE TO ROLE CONTAINER_USER_ROLE; GRANT USAGE ON SCHEMA CUDF_SNOWFLAKE_EXAMPLE.PUBLIC TO ROLE CONTAINER_USER_ROLE; -- Ensure the role has SELECT permission on the table GRANT SELECT ON TABLE CUDF_SNOWFLAKE_EXAMPLE.PUBLIC.NYC_PARKING_VIOLATIONS TO ROLE CONTAINER_USER_ROLE; ``` ### Read data and play around. Now that you have the data in a Snowflake table, and the RAPIDS Snowpark container up and running, create a new notebook in the `workspace` directory (anything that is added to this directory will persist), and follow the instructions below. ![](images/snowflake_jupyter.png) ### Load cudf.pandas In the first cell of your notebook, load the `cudf.pandas` extension ```ipython3 %load_ext cudf.pandas ``` ### Connect to Snowflake and create a Snowpark session ```ipython3 import os from pathlib import Path from snowflake.snowpark import Session connection_parameters = { "account": os.getenv("SNOWFLAKE_ACCOUNT"), "host": os.getenv("SNOWFLAKE_HOST"), "token": Path("/snowflake/session/token").read_text(), "authenticator": "oauth", "database": "CUDF_SNOWFLAKE_EXAMPLE", # the created database "schema": "PUBLIC", "warehouse": "CONTAINER_HOL_WH", } session = Session.builder.configs(connection_parameters).create() # Check the session print( f"Current session info: Warehouse: {session.get_current_warehouse()} " f"Database: {session.get_current_database()} " f"Schema: {session.get_current_schema()} " f"Role: {session.get_current_role()}" ) ``` ```ipython3 # Get some interesting columns from the table table = session.table("NYC_PARKING_VIOLATIONS").select( "Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number", ) table ``` Notice that up to this point, we have a snowpark dataframe. To get a pandas dataframe we use `.to_pandas()` #### WARNING At the moment, there is a known issue that is preventing us to accelerate the following step with cudf, and we hope to solve this issue soon. In the meantime we need to do a workaround to get the data into a pandas dataframe that `cudf.pandas` can understand. ```ipython3 from cudf.pandas.module_accelerator import disable_module_accelerator with disable_module_accelerator(): df = table.to_pandas() import pandas as pd df = pd.DataFrame(df) # this will take a few seconds ``` In the future the cell above will reduce to simple doing `df = table.to_pandas()`. But now we are ready to get see `cudf.pandas` in action. For the record, this dataset has `len(df) = 15435607` and you should see the following operations take in the order of milliseconds to run. **Which parking violation is most commonly committed by vehicles from various U.S states?** Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let’s say we want to get the most common type of offence for vehicles registered in different states. We can do: ```ipython3 %%time ( df[["Registration State", "Violation Description"]] # get only these two columns .value_counts() # get the count of offences per state and per type of offence .groupby("Registration State") # group by state .head( 1 ) # get the first row in each group (the type of offence with the largest count) .sort_index() # sort by state name .reset_index() ) ``` **Which vehicle body types are most frequently involved in parking violations?** We can also investigate which vehicle body types most commonly appear in parking violations ```ipython3 %%time ( df.groupby(["Vehicle Body Type"]) .agg({"Summons Number": "count"}) .rename(columns={"Summons Number": "Count"}) .sort_values(["Count"], ascending=False) ) ``` **How do parking violations vary across days of the week?** ```ipython3 %%time weekday_names = { 0: "Monday", 1: "Tuesday", 2: "Wednesday", 3: "Thursday", 4: "Friday", 5: "Saturday", 6: "Sunday", } df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]") df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names) df.groupby(["issue_weekday"])["Summons Number"].count().sort_values() ``` ## Conclusion With `cudf.pandas` you can GPU accelerated workflows that involve data that is in a Snowflake table, by just reading it into a pandas d When things start to get a little slow, just load the cudf.pandas and run your existing code on a GPU! To learn more, we encourage you to visit [rapids.ai/cudf-pandas](https://rapids.ai/cudf-pandas/).