Setting up an H100 GPU TEE (Trusted Execution Environment) from scratch, while experiencing the joy of Artifact Evaluation. This post documents the pitfalls I encountered—referencing the official docs for anything they already cover well. My knowledge of VMs and GPUs was essentially zero going in. After two days, I finally got stuck at a point requiring NVIDIA support.
References
System Setting
My server has an H100 GPU, two EPYC 9144 CPUs, housed in an ASUS ESC8000A-E12 chassis.
BIOS
Most modern BIOS firmware already supports SEV (SEV-SNP). Just follow the docs to enable the relevant settings. One gotcha not clearly stated in the docs: you also need to enable IOMMU to allow PCIe device passthrough to VMs.
Host Side
Basically follow the docs to build a customized 5.19 kernel. This step is straightforward, but there’s one thing that might be missing from the built kernel: vfio-pci.
It’s not a huge issue, but the docs don’t mention it, and I spent a lot of time figuring out the fix. TL;DR: just run sudo modprobe vfio-pci.
Many online VFIO setup guides suggest adding amd_iommu=on iommu=pt to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub. That’s not entirely wrong, but here’s the catch: SNP initialization doesn’t allow IOMMU to be set to pt (passthrough). Use dmesg liberally to diagnose issues.
Another oddity: the H100’s device ID is listed as 2336 in the docs, but my unit reported 2331. This is likely the difference between engineering and production samples—don’t worry about it, but remember to use the correct device ID in subsequent steps.
Guest Side
Here’s where the docs are most frustrating. They specify Ubuntu 22.04.2 LTS and provide a download link. However, the link is completely broken, and finding that specific version on Ubuntu’s site is difficult since the current release (as of 2023-08-18) is 22.04.3 LTS.
I figured one minor version shouldn’t matter. I was dead wrong. After installing the driver a million times in the guest, the GPU simply wouldn’t show up. When I finally realized the version mismatch might be the issue, I switched to 22.04.2 LTS and tried again.
Still didn’t work. The installer apparently upgraded the kernel during installation—by the time I booted in, the kernel was 6.2 instead of the required 5.19. After more wrangling, I finally got it working.
A helpful colleague shared these lifesaver links:
NVIDIA Firmware
Just when I thought everything was sorted, the firmware version turned out to be wrong. I probably bought the GPU too early, and the docs reference a newer firmware version. Waiting for NVIDIA’s response. RIP.
Complaints
- NVIDIA’s docs on GitHub and the official website docs are different!!!!