I recently was involved in an escalation support case where a customer was experiencing issues with his vSAN stretched cluster. In combination with Horizon, the issue was that app stacks were unable to be mounted in a VDI during maintenance of the cluster.
Some insights into the setup to better understand the architecture of the environment:
10-node vSAN stretched cluster ( vSphere 7 U3)
Horizon 2212 ESB utilizing Windows 10 VDI’s
AppVolumes 2212 providing core applications.
The default and core vSAN SBPM policy was configured as follows:
PFFT = 1 ( site mirroring )
SFTT = RAID-5 erasure coding
And no specific advanced settings: thin provisioning
The issue the customer was facing was when they placed multiple hosts into maintenance mode, the VDI’s in the other fault domain were unable to mount appstacks.
The fault domain where the 2 nodes were put into maintenance mode, will have insufficient hosts available to meet the SBPM requirements.
So this would indeed cause an issue on objects that were not site mirrored, but the app stacks as well were utilizing the default SBPM. So Site mirrored the healthy fault domain.
The error we got was the following: Failed to add disk scsi0:2. Out of resources Cannot open the disk
That being said the appstacks should still be able to be mounted to the VDI’s. Unfortunately, this was not the case.
Both fault domains/sides were impacted by the single maintenance operation. Resulting in a failure of the app volume services and impacting the overall horizon solution.
Finding the eventual root cause of the issue did take a lot of time and investigation to pinpoint at what moment the issue was triggered or caused.
TLDR, the issue was caused by the applied SBPM policy on the appstack template, and thus the appstacks themselves inherited a bad SBPM.
So how did we even find this, well it all came down to validating the vSAN objects utilizing the esxcli command. VMware documentation
Esxcli vsan debug object list –all > /tmp/objout1457.txt
As you can see, the SBPM policy is set to “force provisioning = 0”, and this was the culprit of our issue.
why? Well because the appstacks were utilizing the default SBPM, they also utilized the force provisioning = 0 setting.
Thus when a single vSAN fault domain becomes incompliant ( putting two nodes in Maintenance mode on a five-node fault domain), this causes all RAID-5 objects to become incompliant.
So, what has this to do with the mounting of my appstacks on the VDI machines?
Well, the VMDKs are being mounted in the VDI, but this triggers a reconfiguring of the VDI. On Which vSAN will always apply the SBPM. Which is currently incompatible.
Resulting in an error and the inability to mount the appstack VMDK, while still being perfectly accessible on the secondary fault domain.
So the solution was quite easy, let’s change the default “force provisioning value of 0 to 1” for all the appstacks.
Unfortunately, the execution of this is a bit more challenging than said.
The appvolumes template like all templates located on a vSAN datastore will receive its SBPM policy at the moment they are created or converted to the template format.
Resulting the moment you upload the template the applied SBPM is frozen in time. So modifying the default SBPM will not trigger a reconfigure of the template.vmdk.
So the only option is removing the template and reuploading the template through Appvolumes manager.
You can always validate if the SBPM was correctly applied with the esxcli vsan debug object command.
So the tricky part in this issue was, what with the existing appstacks? The customer was not really eager to recreate all their applications and do the necessary DTAP again.
After some investigation, I eventually found a solution that allowed the appstacks to be modified with the correct SBPM without the need of recreation.
Note: Appvolumes needs to be entered into maintenance for this to succeed. Meaning no appstack can be mounted at a certain point in the procedure.
So how did we do this?
Well to better understand how appstack and vSAN work together, the moment a VDI gets an appstack mounted within the OS.
Some sort of point is created to the vSAN datastore and the explicit VMDK files. These vmdk files with get the REDO format being some sort of “snapshot” of the vmdk.
As long as a VDI and thus a live mount of the appstack is present, this REDO format will keep to exist.
So how did I change the SBPM policy on the base vmdk of all the appstacks?
Well, the answer is quite easy, we just need to attach the vmdk as an “existing disk” to a temporary machine without an Appvolumes agent.
This will trigger the SBPM reconfiguration on the vmdk, because vSAN / vSphere think it’s a VM vmdk. The operation itself will fail but the SBPM will be applied in the first step of the mount process.
Step 1: Modify the existing vSAN default storage policy configured on that cluster with the “Force provisioning =1” setting.
Step 2: Mount the appstack VMDK to a “jumphost” without any appvolumes agent as an “existing disk”
Step 3: Wait until the mount process fails and gives an error.
But as shown in the vSAN debug output above, the base vmdk file, we just attached did correctly get the new vSAN default storage policy applied.
Changed from Force Provisioning 0 to 1.
Step 4: Getting rid of the REDO vmdk’s on the vSAN datastore.
This step unfortunately is the step that causes the most impact to the environment, as we need to schedule a moment where all appstack mounts are removed.
Only after all VDI sessions and thus the appstacks mount have been freed up, only then will the REDO vmdk be deleted.
The new REDO object after we have logged of all users ( removed all active appstack mounts) and logged in with 1 test user.
Step 5: rollback default SBPM policy change, by setting Force Provisioning back to 0
With the SBPM policy modified on the appstacks. Now when a vSAN fault domain goes incompliant, all remaining VDI’s on both the unhealthy and healthy fault domain will still be able to mount appstacks. Causing no business impact to appvolumes and thus application not being available within the VDI sessions.
I will be following up on this topic and creating a best-practise Appvolumes and vSAN Stretched Architecture post to give a quick summary of important take-aways.
I hope this blog has helped you, like and comment!
Read more about Optimizing Microsoft O365 Licensing on RDS/VDI using PWA