Deploying GPUs with QHub

Updated: Mar 18

At Quansight, we have developed Qhub, a framework for deploying data science stacks that facilitates initialization and maintenance. Quansight uses QHub actively in multiple client projects as a tool for running Data Science and Machine Learning workloads. Qhub uses Terraform to make deployment of JupyterHub, JupyterLab, Conda environments and Dask on a Kubernetes cluster declarative. It provides Linux permissioning to facilitate easy collaboration among multiple users and groups. Qhub also has a Jitsi videoconference plugin enabled that makes it an excellent platform for live collaboration and training. The deployment is a one-step process powered by Github actions where you just change a single configuration file, merge/push it to the Github repository, and you are done. Currently, we have good support for Qhub on AWS, GCP, and Digital Ocean.


Problem statement


We needed to get GPUs up and running for a recent client training engagement highlighting PyTorch and OpenCV. We summarize here our approach to enabling GPU support for QHub on GCP.


Solution


We decompose the problem into smaller, more manageable subproblems:

  1. Assigning GPU quotas on GCP

  2. Adding GPUs to the Kubernetes nodes

  3. Installing Nvidia drivers on each node

  4. Adding a profile to JupyterHub

  5. Managing scheduling issues

  6. Verifying the Conda environment

Assigning GPU quotas on GCP


Setting GPU quotas on GCP is simple. We are able to make all the required changes using the Google Cloud console. We select IAM from the IAM & Admin menu and request an increase in the Quotas field.


QHub Architecture on GCP


The architecture of Qhub comprises three major components: the Terraform-state, the Kubernetes-cluster, and the Service account.

  • Terraform-state: This is a file stored on the GCP bucket that keeps track of the Terraform deployment and state of different resources deployed.

  • Kubernetes-cluster: This consists of all the Kubernetes components i.e. nodes running pods, services, ingress, egress, persistent volumes, etc. Jupyterhub, Dask and Kubessh are deployed on top of this.

  • Service Account: This is the account through which all GCP resources responsible are accessed and managed.

Deploying GPUs requires digging deeper into the Kubernetes-cluster component, namely the Kubernetes-cluster/GKE.


Adding GPUs to the Kubernetes nodes


This requires editing the Terraform config file to define appropriate node groups in the nodepool; an example is shown here.


node_groups = [
    # jupyterhub => hub, userscheduler and other pods
    {
      name          = "general"
      instance_type = "n1-standard-2"
      min_size      = 1
      max_size      = 1
    },
    # jupyterlab pods
    {
      name          = "user"
      instance_type = "n1-standard-2"
      min_size      = 0
      max_size      = 5
      guest_accelerators = [
        {
          type  = "nvidia-tesla-p4"
          count = 1
        }
      ]
    },
    # dask-worker pods
    {
      name          = "worker"
      instance_type = "n1-standard-2"
      min_size      = 0
      max_size      = 6
    },

The Terraform documentation provides more information about configuring a nodepool. We also have to specify a guest_accelerator in qhub-terraform-modules/modules/gcp/kubernetes/main.tf as below.


resource "google_container_node_pool" "main" {
  ...
  node_config {
    ...
    dynamic "guest_accelerator" {
      for_each = local.merged_node_groups[count.index].guest_accelerators
      content {
        type  = guest_accelerator.value.type
        count = guest_accelerator.value.count
      }
    }
  }
}


Installing Nvidia drivers on each node


To install Nvidia drivers, we create a daemonset to ensure that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to the daemonset; as nodes are removed from the cluster, pods are garbage collected. The Terraform code for daemonset deployment on QHub can be found here.


Adding a profile to JupyterHub


The next step is to create a QHub/Jupyterhub profile providing an option to create a JupyterLab session with a GPU guarantee. For this, we modify the config.yaml/jupyerhub.yaml file to override JupyterHub's default configuration (as specified in the Helm Chart). Each configuration is a set of options for Kubespawner, which defines how Kubernetes should launch a new user server pod. Any configuration options passed to the profileList configuration will overwrite the defaults in Kubespawner (or any configuration added elsewhere in the Helm Chart). For more information, check the documentation for Zero to JupyterHub with Kubernetes.


 QHUB_PROFILES = [{
   'display_name': 'GPU Instance',
   'description': 'Stable environment with 1.5 cpu / 2 GB ram',
   'kubespawner_override': {
     'cpu_limit': 4,
     'cpu_guarentee': 4,
     'mem_limit': '8G',
     'mem_guarantee': '8G',
     'image': 'gcr.io/qhub-training/qhub-quansight-training/qhub-jupyterlab:b2f154a6de5b08a8a09ad85c24321ad931e11372',
     'extra_resource_limits': {
       'nvidia.com/gpu': 1  # important
     }
   }
 }]


Managing scheduling issues


Our initial attempts to launch the spawner failed due to some JupyterHub customizations in QHub. Specifically, our default QHub configuration has user-scheduler pods running on user nodes. However, after adding GPUs to the user nodes, these nodes acquire a taint of Nvidia/GPU. Taints is a property of pods that permits a node to repel a set of pods; it ensures that pods are not scheduled onto inappropriate nodes,


Tolerations are applied to pods and allow (but do not require) the pods to schedule onto nodes with matching taints. As the user-scheduler has no toleration for nvidia.com/GPU, they became unschedulable on user nodes. Either we could add a tolerance to user-scheduler pods or we could make user-scheduler pods run on the general node; we choose the latter option.


$ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
$ export PATH=${PATH}:/usr/local/nvidia/bin
$ nvidia-smi


Verifying the Conda environment


With the GPUs in place, we need to add libnvidiacuda-dev to install other tools like nvcc, the cublas libraries, and other useful GPU-dependent packages. We had previously installed libnvidiacuda but were unable to use, for example, PyTorch or TensorFlow on our GPU-enabled system.


We use the environment management tool Conda to manage installation of various libraries and packages on QHub (using, in particular, a custom open-source package Conda-Store). Conda environments enable distinct versions of packages (including ones requiring pre-built binaries & their associated dependencies) to co-exist cleanly on the same system. This enables using, for example, custom applications that rely on completely distinct versions of packages from the Python data science stack (e.g., NumPy, Pandas, Scikit-Learn, PyTorch, TensorFlow, etc.) or even distinct versions of Python. Enabling custom ad-hoc user-specified software environments is a very difficult problem in general; Conda-Store supports building Conda environments on QHub declaratively and cleanly.


Here's an example script that shows a quick test that GPUs are in fact enabled correctly using a custom Conda environment that includes PyTorch.


panand@quansight.com@jupyter-panand-40quansight-2ecom:~$ cat torch_cuda_test.py
import torch, time
result = torch.cuda.is_available()
print('CUDA available (T/F):', result)
result = torch.cuda.device_count()
print('Number of CUDA devices available:', result)
a = torch.ones(10)
print(f'a.device: {a.device}')
N = 60000000
a = torch.ones((N))
print(f'a.device: {a.device}')
start_time = time.time()
print(a.sum().item())
end_time = time.time()
print('It took {} seconds'.format(end_time - start_time))
print('a is sitting on', a.device)
is_cuda = torch.cuda.is_available()
if not is_cuda:
    print('Nothing to do here')
else:
    a_ = a.cuda()
    print(f'a_.device: {a_.device}')
    start_time = time.time()
    print(a_.sum())
    end_time = time.time()
    print('It took {} seconds'.format(end_time - start_time))
    print('a is sitting on', a_.device)
panand@quansight.com@jupyter-panand-40quansight-2ecom:~$ python torch_cuda_test.py
CUDA available (T/F): True
Number of CUDA devices available: 1
a.device: cpu
a.device: cpu
60000000.0
It took 0.02083444595336914 seconds
a is sitting on cpu
a_.device: cuda:0
tensor(60000000., device='cuda:0')
It took 0.07117819786071777 seconds
a is sitting on cuda:0

Summary

Qhub is still a relatively new open source project, so some glitches are to be expected. With the process outlined above, you can, however, enable GPU use with a QHub deployment. Quansight is actively developing new features; you are welcome to get involved by contributing to QHub directly or through any related projects upstream:

  • https://qhub.dev

  • https://github.com/Quansight/qhub

  • https://gitter.im/Quansight/qhub

If you would like help installing, supporting, or building on top of QHub or JupyterHub in your organization, please reach out to Quansight for a free consultation by sending an email to connect@quansight.com.


203 views1 comment

© QUANSIGHT 2020

  • LinkedIn - White Circle
  • Facebook - White Circle
  • Twitter - White Circle
  • White YouTube Icon
github.png