GPU Access

Preface

NREC has kindly provided us with vGPU machines, which we will now be starting to make available to RAIL users

The magic happens when we enable the GPU operator in Elastisys’ welkin-apps, which then configures the nvidia-container-toolkit, and nvidia’s k8s-device-plugin. This toolkit contains a special container runtime, which lets pods running on the node access gpu resources when they request it

Accesss

Ask in the RAIL teams chat for permission to request GPU resources in your namespace(s).

Permissions

On the RAIL platform the containers are as locked down as possible, and there’s nothing that prevents you from using the GPU in a locked down container, but the container image may not have been built to account for that. If you can build the image to be unprivileged then that is preferred, but otherwise you can also reduce the capabilities of a pod like this:

securityContext:
  privileged: false
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

Taints

We have dedicated VMs for the GPU workloads, and as such most of the pods should not be scheduled on these nodes unless it’s required. To achieve this the nodes are tainted, and gpu workloads will need to add a toleration if they want to be scheduled on a node with gpu access. This toleration may become automated at some point

Time slicing

For a lot of use cases a dedicated GPU can be overkill, and as such we enable time slicing in the nvidia device plugin. The way this works is that it splits the GPU into multiple resources on the k8s side. The drawback is the lack of memory isolation, so you should be careful with using too much memory. Eventually we want to add an out of memory daemon, but for now we ask you to avoid using more memory than VRAM / time slice count. For example with tensorflow there is a memory_limit that you can set

When using a sliced GPU, the number in the resource request doesn’t indicate how much of the GPU can make use of, just whether you can access it at all, for this reason manifests requesting more than one slice will be rejected. Also keep in mind that the GPU will switch evenly among all the processes running on the GPU.

See NVIDIA’s documentation for details about the drawbacks and upsides regarding timeslicing

Full pod example

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
  namespace: adm-it-yournamespace
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    securityContext:
      privileged: false
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        cpu: 100m
        memory: 200Mi
        nvidia.com/mig-3g.20gb.shared: 1
  tolerations:
    - effect: NoSchedule
      key: elastisys.io/node-type
      operator: Equal
      value: gpu

This pod will execute a very simple cuda payload, which lets you see if the GPU works

Resource list

GPU Table bgo1-test

GPU

Resource Name

Count

Time slices

VRAM

NVIDIA L40S

nvidia.com/gpu.shared

1

8

25GB

GPU Table bgo1-prod

GPU

Resource Name

Count

Time slices

VRAM

NVIDIA L40S

nvidia.com/gpu.shared

5

5

25GB

NVIDIA L40S

nvidia.com/gpu

3

1

25GB