NVIDIA vGPU less framebuffer available

Solving NVIDIA ECC issues

I recently had a customer complaining they had less available framebuffer on their NVIDIA vGPU than the specs indicated.
As this is a question/remark that I frequently get from customers that invest in NVIDIA vGPU’s, I wanted to create a quick blog about it.

Some additional details, the customer bought NVIDIA T4 GPUs with 16GB of the framebuffer.
They were installed in HPE Proliant DL360’s running vSphere ESXi 7.0 u3.

The issue was noticed when they tried to spin up 4 VMs utilizing the T4-A4 profile, which gave an error that no suitable hosts were present.
From a VM configuration standpoint, the following specifications are configured:

As you can see in the image above, the VM is configured with an NVIDIA vGPU 4A profile. So each T4 should be able to host 4 VMs ( 4x 4GB = 16GB). But when looking at the available graphics resources on the ESXi nodes themselves, we can see only 15GB was available.
To make sure everything was correctly configured, I validated if the driver on the VM and Hypervisor were working correctly using nvidia-smi:

nvidia-smi output

This all looked fine, generally if something is wrong with the drivers. the Nvidia-smi output will be something like” unable to initialize graphics card”.

So the next thing, I double-checked was that all settings on the ESXi and the hardware platform itself:
Within vCenter you need to configure the host graphics settings to “shared direct” in order to get vGPU working correctly.

ESXi host graphics settings shared direct

NOTE, if you are using a Dell server validate if the IOMMU setting is configured correctly within the BIOS:

configure the “Memory Mapped IO base” (IOMMU) to a less great value than configured.

Solution: 

With all settings validated, I double-checked if the customer did change the default ECC memory configuration that is shipped with the non-density-based GPU’s like the T4.

nvidia ecc enabled
And there it was, they forgot to disable the ECC memory checking as this causes a percentage (0.66%) to be reserved for ECC calculations:

More details can be found here: https://docs.nvidia.com/grid/9.0/grid-software-quick-start-guide/index.html#disabling-enabling-ecc-memory

So to disable the ECC operations we ran: nvidia-smi -e 0 on the ESXi hypervisor and rebooted the server.

nvidia-smi ecc command

After rebooting, the customer and I validated if we had 16GB of framebuffer available, which indeed was the case!

NVIDIA 16GB framebuffer

If you still have issues with your NVIDIA vGPU, maybe one of my other blogs might help you!

I hope this blog has helped you, nevertheless thank you for reading this far!

Leave a Reply

Your email address will not be published. Required fields are marked *