Notes update

This commit is contained in:
savagebidoof 2023-12-15 04:42:23 +01:00
parent b367990028
commit 5f3c3b0e91

View File

@ -127,43 +127,381 @@ Made this Ansible script:
## Deploy remaining services + their NFS mounts
- [x] Jellyfin + architecture selector
- [x] Jellyfin
- [x] QBitTorrent
- [x] Filebrowser
## [EXTRA] Deploy new slave node on the Proxmox server
## [EXTRA] Deploy new slave node on the Proxmox server (slave04)
Decided to add ANOTHER VM as a slave to allow some flexibility between x64 nodes.
- [ ] Configure the VM to have Hardware Acceleration [0] [1]
- [x] Created the VM and installed the OS
- [ ] ?
- [ ] Done
- [x] Set up GPU pass through for the newly created VM
- [x] Created a Kubernetes Node
- [x] Done
## Set up the GPU available in the Kubernetes Node
Very much what the title says. Steps below.
- [x] Done
### Install nvidia drivers
> **Note:**
> - Steps were performed in the VM Instance (Slave04). \
> - Snapshots were performed on the Proxmox node, taking a snapshot of the affected VM. \
> - `Kubectl` command(s) were performed on a computer of mine external to the Kubernetes Cluster/Nodes to interact with the Kubernetes Cluster.
#### Take snapshot
- [x] Done
#### Repo thingies
Enable `non-free` repo for debian.
aka. idk you do that
`non-free` and `non-free-firmware` are different things, so if `non-free-firmware` is already listed, but `non-free` not, slap that bitch in + `contrib`.
```md
FROM:
deb http://ftp.au.debian.org/debian/ buster main
TO:
deb-src http://ftp.au.debian.org/debian/ buster main non-free contrib
```
[0]
https://www.wundertech.net/how-to-set-up-gpu-passthrough-on-proxmox/
https://www.virtualizationhowto.com/2023/10/proxmox-gpu-passthrough-step-by-step-guide/
In my case that was enabled during the installation.
Once repos set up, use:
```shell
apt update && apt install nvidia-detect -y
```
##### [Error] Unable to locate package nvidia-detect
Ensure both `non-free` and `contrib` are in the repo file.
(File /etc/apt/sources.list)
####
```shell
nvidia-detect
```
```text
Detected NVIDIA GPUs:
00:10.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
Checking card: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)
Your card is supported by all driver versions.
Your card is also supported by the Tesla drivers series.
Your card is also supported by the Tesla 470 drivers series.
It is recommended to install the
nvidia-driver
package.
```
### Install nvidia driver
```shell
apt install nvidia-driver
```
We might receive a complaint regarding "conflicting modules".
Just restart the VM.
#### Reboot VM
```shell
reboot
```
#### nvidia-smi
VM has access to the Nvidia drivers/GPU
```shell
nvidia-smi
```
```text
Fri Dec 15 00:00:36 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:00:10.0 Off | N/A |
| 0% 38C P8 11W / 160W | 1MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
### Install Nvidia Container Runtime
#### Take snapshot
- [x] Done
#### Install curl
```shell
apt-get install curl
```
#### Add repo
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt
```shell
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
```
```shell
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
```
### Update Containerd config
#### Select nvidia-container-runtime as new runtime for Containerd
> No clue if this is a requirement! as afterward also did more changes to the configuration.
```shell
sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
```
#### Reboot Containerd service
```shell
sudo systemctl restart containerd
```
#### Check status from Containerd
Check if Containerd has initialized correctly after restarting the service.
```shell
sudo systemctl status containerd
```
### Test nvidia runtime
#### Pull nvidia cuda image
I used the Ubuntu based container since I didn't find one specific for Debian.
```shell
sudo ctr images pull docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04
```
```text
docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:0654b44e2515f03b811496d0e2d67e9e2b81ca1f6ed225361bb3e3bb67d22e18: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:7d8fdd2a5e96ec57bc511cda1fc749f63a70e207614b3485197fd734359937e7: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:25ad149ed3cff49ddb57ceb4418377f63c897198de1f9de7a24506397822de3e: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:1698c67699a3eee2a8fc185093664034bb69ab67c545ab6d976399d5500b2f44: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:d13839a3c4fbd332f324c135a279e14c432e90c8a03a9cedc43ddf3858f882a7: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:ba7b66a9df40b8a1c1a41d58d7c3beaf33a50dc842190cd6a2b66e6f44c3b57b: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:c5f2ffd06d8b1667c198d4f9a780b55c86065341328ab4f59d60dc996ccd5817: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 4.2 s
```
#### Start container
Containerd already has access to the nvidia gpu/drivers
```shell
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi nvidia-smi
```
```text
Thu Dec 14 23:18:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:00:10.0 Off | N/A |
| 0% 41C P8 11W / 160W | 1MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
### Set the GPU available in the Kubernetes Node
We `still` don't have the GPU added/available in the Node.
```shell
kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t
```
```text
Node Available(GPUs) Used(GPUs)
pi4.filter.home 0 0
slave01.filter.home 0 0
slave02.filter.home 0 0
slave03.filter.home 0 0
slave04.filter.home 0 0
```
#### Update
Set Containerd config with the following settings.
Obv do a backup of the config before proceeding to modify the file.
```toml
# /etc/containerd/config.toml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
```
#### Restart containerd (again)
```shell
sudo systemctl restart containerd
```
#### Check status from Containerd
Check if Containerd has initialized correctly after restarting the service.
```shell
sudo systemctl status containerd
```
#### Set some labels to avoid spread
We will deploy Nvidia CRDs so will tag the Kubernetes nodes that **won't** have a GPU available to avoid running GPU related stuff on them.
```shell
kubectl label nodes slave0{1..3}.filter.home nvidia.com/gpu.deploy.operands=false
```
#### Deploy nvidia operators
"Why this `--set` flags?"
- Cause that's what worked out for me. Don't like it? Want to explore? Just try which combination works for you idk.
```shell
helm install --wait --generate-name \
nvidia/gpu-operator \
--set operator.defaultRuntime="containerd"\
-n gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
```
### Check running pods
Check all the pods are running (or have completed)
```shell
kubectl get pods -n gpu-operator -owide
```
```text
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-4nctr 1/1 Running 0 9m34s 172.16.241.67 slave04.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-gc-79d6bb94h6fht 1/1 Running 0 9m57s 172.16.176.63 slave03.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-master-64c5nwww4 1/1 Running 0 9m57s 172.16.86.110 pi4.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-worker-72wqk 1/1 Running 0 9m57s 172.16.106.5 slave02.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-worker-7snt4 1/1 Running 0 9m57s 172.16.86.111 pi4.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-worker-9ngnw 1/1 Running 0 9m56s 172.16.176.5 slave03.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-worker-csnfq 1/1 Running 0 9m56s 172.16.241.123 slave04.filter.home <none> <none>
gpu-operator-1702608759-node-feature-discovery-worker-k6dxf 1/1 Running 0 9m57s 172.16.247.8 slave01.filter.home <none> <none>
gpu-operator-fcbd9bbd7-fv5kb 1/1 Running 0 9m57s 172.16.86.116 pi4.filter.home <none> <none>
nvidia-cuda-validator-xjfkr 0/1 Completed 0 5m37s 172.16.241.126 slave04.filter.home <none> <none>
nvidia-dcgm-exporter-q8kk4 1/1 Running 0 9m35s 172.16.241.125 slave04.filter.home <none> <none>
nvidia-device-plugin-daemonset-vvz4c 1/1 Running 0 9m35s 172.16.241.127 slave04.filter.home <none> <none>
nvidia-operator-validator-8899m 1/1 Running 0 9m35s 172.16.241.124 slave04.filter.home <none> <none>
```
### Done!
```shell
kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t
```
```text
Node Available(GPUs) Used(GPUs)
pi4.filter.home 0 0
slave01.filter.home 0 0
slave02.filter.home 0 0
slave03.filter.home 0 0
slave04.filter.home 1 0
```
### vGPU
I could use vGPU and split my GPU among multiple VMs, but, it would also mean that the GPU no longer posts to the Physical Monitor attached to the Proxmox PC/Server, which I would like to avoid.
Meanwhile, it's certainly not a requirement (and I only use the monitor on emergencies/whenever I need to touch the BIOS/Install a new OS), I **still** don't own a Serial connector, therefore I will consider making the change to use vGPU **in the future** (whenever I receive the package from Aliexpress, and I confirm it works).
[//]: # (```shell)
[//]: # (kubectl events pods --field-selector status.phase!=Running -n gpu-operator)
[//]: # (```)
[//]: # ()
[//]: # (```shell)
[//]: # (kubectl get pods --field-selector status.phase!=Running -n gpu-operator | awk '{print $1}' | tail -n +2 | xargs kubectl events -n gpu-operator pods)
[//]: # (```)
## Jellyfin GPU Acceleration
- [ ] Configured Jellyfin with GPU acceleration
- [ ] Apply the same steps for the VM01 previously deployed
- [x] Configured Jellyfin with GPU acceleration
## Deploy master node on the Proxmox server
2 Cores + 4GB RAM
## Update rest of the stuff/configs as required to match the new Network distribution