I recently had a customer complaining they had less available framebuffer on their NVIDIA vGPU than the specs indicated.
As this is a question/remark that I frequently get from customers that invest in NVIDIA vGPU’s, I wanted to create a quick blog about it.
Some additional details, the customer bought NVIDIA T4 GPUs with 16GB of the framebuffer.
They were installed in HPE Proliant DL360’s running vSphere ESXi 7.0 u3.
The issue was noticed when they tried to spin up 4 VMs utilizing the T4-A4 profile, which gave an error that no suitable hosts were present.
From a VM configuration standpoint, the following specifications are configured:
As you can see in the image above, the VM is configured with an NVIDIA vGPU 4A profile. So each T4 should be able to host 4 VMs ( 4x 4GB = 16GB). But when looking at the available graphics resources on the ESXi nodes themselves, we can see only 15GB was available.
To make sure everything was correctly configured, I validated if the driver on the VM and Hypervisor were working correctly using nvidia-smi:
This all looked fine, generally if something is wrong with the drivers. the Nvidia-smi output will be something like” unable to initialize graphics card”.
So the next thing, I double-checked was that all settings on the ESXi and the hardware platform itself:
Within vCenter you need to configure the host graphics settings to “shared direct” in order to get vGPU working correctly.
NOTE, if you are using a Dell server validate if the IOMMU setting is configured correctly within the BIOS:
configure the “Memory Mapped IO base” (IOMMU) to a less great value than configured.
With all settings validated, I double-checked if the customer did change the default ECC memory configuration that is shipped with the non-density-based GPU’s like the T4.
And there it was, they forgot to disable the ECC memory checking as this causes a percentage (0.66%) to be reserved for ECC calculations:
More details can be found here: https://docs.nvidia.com/grid/9.0/grid-software-quick-start-guide/index.html#disabling-enabling-ecc-memory
So to disable the ECC operations we ran: nvidia-smi -e 0 on the ESXi hypervisor and rebooted the server.
After rebooting, the customer and I validated if we had 16GB of framebuffer available, which indeed was the case!
If you still have issues with your NVIDIA vGPU, maybe one of my other blogs might help you!
- Nvidia GRID – Square resolution screen
- Nvidia GRID Troubleshooting
- Nvidia GRID Troubleshooting – basics
I hope this blog has helped you, nevertheless thank you for reading this far!