GPU Access¶
Preface¶
NREC has kindly provided us with vGPU machines, which we will now be starting to make available to RAIL users
The magic happens when we enable the GPU operator in Elastisys’ welkin-apps, which then configures the nvidia-container-toolkit, and nvidia’s k8s-device-plugin. This toolkit contains a special container runtime, which lets pods running on the node access gpu resources when they request it
Accesss¶
Ask in the RAIL teams chat for permission to request GPU resources in your namespace(s).
Permissions¶
On the RAIL platform the containers are as locked down as possible, and there’s nothing that prevents you from using the GPU in a locked down container, but the container image may not have been built to account for that. If you can build the image to be unprivileged then that is preferred, but otherwise you can also reduce the capabilities of a pod like this:
securityContext:
privileged: false
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
Taints¶
We have dedicated VMs for the GPU workloads, and as such most of the pods should not be scheduled on these nodes unless it’s required. To achieve this the nodes are tainted, and gpu workloads will need to add a toleration if they want to be scheduled on a node with gpu access. This toleration may become automated at some point
Time slicing¶
For a lot of use cases a dedicated GPU can be overkill, and as such we enable time slicing in the nvidia device plugin. The way this works is that it splits the GPU into multiple resources on the k8s side. The drawback is the lack of memory isolation, so you should be careful with using too much memory. Eventually we want to add an out of memory daemon, but for now we ask you to avoid using more memory than VRAM / time slice count. For example with tensorflow there is a memory_limit that you can set
When using a sliced GPU, the number in the resource request doesn’t indicate how much of the GPU can make use of, just whether you can access it at all, for this reason manifests requesting more than one slice will be rejected. Also keep in mind that the GPU will switch evenly among all the processes running on the GPU.
See NVIDIA’s documentation for details about the drawbacks and upsides regarding timeslicing
Full pod example¶
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
namespace: adm-it-yournamespace
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
securityContext:
privileged: false
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
cpu: 100m
memory: 200Mi
nvidia.com/mig-3g.20gb.shared: 1
tolerations:
- effect: NoSchedule
key: elastisys.io/node-type
operator: Equal
value: gpu
This pod will execute a very simple cuda payload, which lets you see if the GPU works
Resource list¶
GPU |
Resource Name |
Count |
Time slices |
VRAM |
NVIDIA L40S |
nvidia.com/gpu.shared |
1 |
8 |
25GB |
GPU |
Resource Name |
Count |
Time slices |
VRAM |
NVIDIA L40S |
nvidia.com/gpu.shared |
5 |
5 |
25GB |
NVIDIA L40S |
nvidia.com/gpu |
3 |
1 |
25GB |