From 5f3c3b0e91cda664c100365f95422059f8d3131b Mon Sep 17 00:00:00 2001 From: savagebidoof Date: Fri, 15 Dec 2023 04:42:23 +0100 Subject: [PATCH] Notes update --- Migrations/Say_HI_to_Proxmox/README.md | 368 ++++++++++++++++++++++++- 1 file changed, 353 insertions(+), 15 deletions(-) diff --git a/Migrations/Say_HI_to_Proxmox/README.md b/Migrations/Say_HI_to_Proxmox/README.md index 54b9cea..09103b9 100644 --- a/Migrations/Say_HI_to_Proxmox/README.md +++ b/Migrations/Say_HI_to_Proxmox/README.md @@ -127,43 +127,381 @@ Made this Ansible script: ## Deploy remaining services + their NFS mounts -- [x] Jellyfin + architecture selector +- [x] Jellyfin - [x] QBitTorrent - [x] Filebrowser -## [EXTRA] Deploy new slave node on the Proxmox server +## [EXTRA] Deploy new slave node on the Proxmox server (slave04) Decided to add ANOTHER VM as a slave to allow some flexibility between x64 nodes. -- [ ] Configure the VM to have Hardware Acceleration [0] [1] - [x] Created the VM and installed the OS -- [ ] ? -- [ ] Done +- [x] Set up GPU pass through for the newly created VM +- [x] Created a Kubernetes Node +- [x] Done + +## Set up the GPU available in the Kubernetes Node + +Very much what the title says. Steps below. + +- [x] Done + + +### Install nvidia drivers + +> **Note:** +> - Steps were performed in the VM Instance (Slave04). \ +> - Snapshots were performed on the Proxmox node, taking a snapshot of the affected VM. \ +> - `Kubectl` command(s) were performed on a computer of mine external to the Kubernetes Cluster/Nodes to interact with the Kubernetes Cluster. + +#### Take snapshot + +- [x] Done + +#### Repo thingies + +Enable `non-free` repo for debian. + +aka. idk you do that + +`non-free` and `non-free-firmware` are different things, so if `non-free-firmware` is already listed, but `non-free` not, slap that bitch in + `contrib`. + +```md +FROM: +deb http://ftp.au.debian.org/debian/ buster main +TO: +deb-src http://ftp.au.debian.org/debian/ buster main non-free contrib ``` -[0] -https://www.wundertech.net/how-to-set-up-gpu-passthrough-on-proxmox/ -https://www.virtualizationhowto.com/2023/10/proxmox-gpu-passthrough-step-by-step-guide/ +In my case that was enabled during the installation. + +Once repos set up, use: + +```shell +apt update && apt install nvidia-detect -y +``` + +##### [Error] Unable to locate package nvidia-detect + +Ensure both `non-free` and `contrib` are in the repo file. + +(File /etc/apt/sources.list) + +#### +```shell +nvidia-detect +``` +```text +Detected NVIDIA GPUs: +00:10.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1) + +Checking card: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1) +Your card is supported by all driver versions. +Your card is also supported by the Tesla drivers series. +Your card is also supported by the Tesla 470 drivers series. +It is recommended to install the + nvidia-driver +package. +``` + +### Install nvidia driver + +```shell +apt install nvidia-driver +``` + +We might receive a complaint regarding "conflicting modules". + +Just restart the VM. + +#### Reboot VM + +```shell +reboot +``` + +#### nvidia-smi + +VM has access to the Nvidia drivers/GPU + +```shell +nvidia-smi +``` + +```text +Fri Dec 15 00:00:36 2023 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|===============================+======================+======================| +| 0 NVIDIA GeForce ... On | 00000000:00:10.0 Off | N/A | +| 0% 38C P8 11W / 160W | 1MiB / 4096MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ + ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| No running processes found | ++-----------------------------------------------------------------------------+ +``` + +### Install Nvidia Container Runtime + +#### Take snapshot + +- [x] Done + +#### Install curl + +```shell +apt-get install curl +``` + +#### Add repo + +https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt + +```shell +curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ + && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ + sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` ```shell -echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf - - - +sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit ``` +### Update Containerd config + +#### Select nvidia-container-runtime as new runtime for Containerd + +> No clue if this is a requirement! as afterward also did more changes to the configuration. + +```shell +sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml +``` + +#### Reboot Containerd service + +```shell +sudo systemctl restart containerd +``` + +#### Check status from Containerd + +Check if Containerd has initialized correctly after restarting the service. + +```shell +sudo systemctl status containerd +``` + +### Test nvidia runtime + +#### Pull nvidia cuda image + +I used the Ubuntu based container since I didn't find one specific for Debian. + +```shell +sudo ctr images pull docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04 +``` + +```text +docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04: resolved |++++++++++++++++++++++++++++++++++++++| +index-sha256:0654b44e2515f03b811496d0e2d67e9e2b81ca1f6ed225361bb3e3bb67d22e18: done |++++++++++++++++++++++++++++++++++++++| +manifest-sha256:7d8fdd2a5e96ec57bc511cda1fc749f63a70e207614b3485197fd734359937e7: done |++++++++++++++++++++++++++++++++++++++| +layer-sha256:25ad149ed3cff49ddb57ceb4418377f63c897198de1f9de7a24506397822de3e: done |++++++++++++++++++++++++++++++++++++++| +layer-sha256:1698c67699a3eee2a8fc185093664034bb69ab67c545ab6d976399d5500b2f44: done |++++++++++++++++++++++++++++++++++++++| +config-sha256:d13839a3c4fbd332f324c135a279e14c432e90c8a03a9cedc43ddf3858f882a7: done |++++++++++++++++++++++++++++++++++++++| +layer-sha256:ba7b66a9df40b8a1c1a41d58d7c3beaf33a50dc842190cd6a2b66e6f44c3b57b: done |++++++++++++++++++++++++++++++++++++++| +layer-sha256:c5f2ffd06d8b1667c198d4f9a780b55c86065341328ab4f59d60dc996ccd5817: done |++++++++++++++++++++++++++++++++++++++| +layer-sha256:520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8: done |++++++++++++++++++++++++++++++++++++++| +elapsed: 4.2 s +``` + +#### Start container + +Containerd already has access to the nvidia gpu/drivers + +```shell +sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi nvidia-smi +``` + +```text +Thu Dec 14 23:18:55 2023 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|===============================+======================+======================| +| 0 NVIDIA GeForce ... On | 00000000:00:10.0 Off | N/A | +| 0% 41C P8 11W / 160W | 1MiB / 4096MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ + ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| No running processes found | ++-----------------------------------------------------------------------------+ +``` + +### Set the GPU available in the Kubernetes Node + +We `still` don't have the GPU added/available in the Node. + +```shell +kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t +``` + +```text +Node Available(GPUs) Used(GPUs) +pi4.filter.home 0 0 +slave01.filter.home 0 0 +slave02.filter.home 0 0 +slave03.filter.home 0 0 +slave04.filter.home 0 0 +``` + +#### Update + +Set Containerd config with the following settings. + +Obv do a backup of the config before proceeding to modify the file. + +```toml +# /etc/containerd/config.toml +version = 2 +[plugins] + [plugins."io.containerd.grpc.v1.cri"] + [plugins."io.containerd.grpc.v1.cri".containerd] + default_runtime_name = "nvidia" + + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] + privileged_without_host_devices = false + runtime_engine = "" + runtime_root = "" + runtime_type = "io.containerd.runc.v2" + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] + BinaryName = "/usr/bin/nvidia-container-runtime" +``` +#### Restart containerd (again) + +```shell +sudo systemctl restart containerd +``` + +#### Check status from Containerd + +Check if Containerd has initialized correctly after restarting the service. + +```shell +sudo systemctl status containerd +``` + +#### Set some labels to avoid spread + +We will deploy Nvidia CRDs so will tag the Kubernetes nodes that **won't** have a GPU available to avoid running GPU related stuff on them. + +```shell +kubectl label nodes slave0{1..3}.filter.home nvidia.com/gpu.deploy.operands=false +``` + +#### Deploy nvidia operators + +"Why this `--set` flags?" + +- Cause that's what worked out for me. Don't like it? Want to explore? Just try which combination works for you idk. + +```shell +helm install --wait --generate-name \ + nvidia/gpu-operator \ + --set operator.defaultRuntime="containerd"\ + -n gpu-operator \ + --set driver.enabled=false \ + --set toolkit.enabled=false +``` + +### Check running pods + +Check all the pods are running (or have completed) + +```shell +kubectl get pods -n gpu-operator -owide +``` +```text +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +gpu-feature-discovery-4nctr 1/1 Running 0 9m34s 172.16.241.67 slave04.filter.home +gpu-operator-1702608759-node-feature-discovery-gc-79d6bb94h6fht 1/1 Running 0 9m57s 172.16.176.63 slave03.filter.home +gpu-operator-1702608759-node-feature-discovery-master-64c5nwww4 1/1 Running 0 9m57s 172.16.86.110 pi4.filter.home +gpu-operator-1702608759-node-feature-discovery-worker-72wqk 1/1 Running 0 9m57s 172.16.106.5 slave02.filter.home +gpu-operator-1702608759-node-feature-discovery-worker-7snt4 1/1 Running 0 9m57s 172.16.86.111 pi4.filter.home +gpu-operator-1702608759-node-feature-discovery-worker-9ngnw 1/1 Running 0 9m56s 172.16.176.5 slave03.filter.home +gpu-operator-1702608759-node-feature-discovery-worker-csnfq 1/1 Running 0 9m56s 172.16.241.123 slave04.filter.home +gpu-operator-1702608759-node-feature-discovery-worker-k6dxf 1/1 Running 0 9m57s 172.16.247.8 slave01.filter.home +gpu-operator-fcbd9bbd7-fv5kb 1/1 Running 0 9m57s 172.16.86.116 pi4.filter.home +nvidia-cuda-validator-xjfkr 0/1 Completed 0 5m37s 172.16.241.126 slave04.filter.home +nvidia-dcgm-exporter-q8kk4 1/1 Running 0 9m35s 172.16.241.125 slave04.filter.home +nvidia-device-plugin-daemonset-vvz4c 1/1 Running 0 9m35s 172.16.241.127 slave04.filter.home +nvidia-operator-validator-8899m 1/1 Running 0 9m35s 172.16.241.124 slave04.filter.home +``` + +### Done! + +```shell +kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t +``` + +```text +Node Available(GPUs) Used(GPUs) +pi4.filter.home 0 0 +slave01.filter.home 0 0 +slave02.filter.home 0 0 +slave03.filter.home 0 0 +slave04.filter.home 1 0 +``` + +### vGPU + +I could use vGPU and split my GPU among multiple VMs, but, it would also mean that the GPU no longer posts to the Physical Monitor attached to the Proxmox PC/Server, which I would like to avoid. + +Meanwhile, it's certainly not a requirement (and I only use the monitor on emergencies/whenever I need to touch the BIOS/Install a new OS), I **still** don't own a Serial connector, therefore I will consider making the change to use vGPU **in the future** (whenever I receive the package from Aliexpress, and I confirm it works). + + + +[//]: # (```shell) + +[//]: # (kubectl events pods --field-selector status.phase!=Running -n gpu-operator) + +[//]: # (```) + +[//]: # () +[//]: # (```shell) + +[//]: # (kubectl get pods --field-selector status.phase!=Running -n gpu-operator | awk '{print $1}' | tail -n +2 | xargs kubectl events -n gpu-operator pods) +[//]: # (```) + ## Jellyfin GPU Acceleration -- [ ] Configured Jellyfin with GPU acceleration -- [ ] Apply the same steps for the VM01 previously deployed +- [x] Configured Jellyfin with GPU acceleration ## Deploy master node on the Proxmox server - +2 Cores + 4GB RAM ## Update rest of the stuff/configs as required to match the new Network distribution