Recently I did a Horizon upgrade from 6.2.4 to 7.7 using the cloud pod architecture. Each pod contained an own vCenter with a 6 host VSAN cluster. The horizon backend and VDI Desktop were hosted on this cluster. The plan was to bring 1 POD down for maintenance and move all active sessions to the other active pod.
Note: Unfortunately this is not the 1 click-fix blogpost, or even a blog containing the solution. I still want to give details about the entire progress of troubleshooting and the decisions we have made.
The in place upgrade went without any major issues until we started validating all functionalities of the upgraded pod. When I opened the global entitlements error, following error presented itself:
Troubleshooting:
I began troubleshooting by starting where everybody starts. What do the logs say:
ERROR (0EF0-1D48) <pool-20-thread-12> [FaultUtil] EntityNotFound(entityId: UserOrGroup/Uy0xLTUtMjEtMTgwNzg4OTk2Mi0xMTY3MzU3NjM2LTUwOTE5MDI1M): Could not find user or group in AD
I tried using powerCLI and the Horizon REST API to find any reference of an object containing that UserorGroupID, but didn’t find anything.
I eventually started thinking about possible issues with the LDS replication between the 2 connection servers in the pod and the Global interpod replication of the global ADAM database. All were running without any issues and the replication was still happening. I tested this by creating/removing a Global entitlement on the Horizon 6 pod, this change got replicated automatically to the other pod.
As the GUI error prevented me from finding out which entries were present and which not. I needed validate the entries with LMVutil.exe trough command line. Here I was able to retrieve the content of the Global assignments and entitlement objects in the database. But they were the same on both sides, so not providing me any additional information.
To be 100% sure, I validated this again with ADSI edit connecting to both POD’s, and confirmed it for a second time. The entries were identical, as seen in the image below.
The unjoin and rejoin of the pod was the next step, but gave the exact same error. I even tried reinstalling the connection servers after the unjoin (rolling reinstall, so local ADAM was not wiped), same result.
So in the end I bit the bullet and decided to redeploy the entire pod from scratch. Started with new VM’s, then installed the first and second connection server. All went well, until the moment of truth, joining the new pod into the existing cloud pod architecture. You can imagine what the outcome was.. Exactly the same error as we had….
During my troubleshooting on day one, we opened a VMware case to get some feedback regarding the root cause and a possible solution. After hours of troubleshooting, sending over logs, …
The conclusion of VMware was unfortunately following:
“We tried to recover the current environment which has errors to access the global entitlement. However, we notice that there are stale entries in the Global/Local Adam database which is unable to retrieve to correspond the global entitlement.”
Conclusion
In the end we decided to create a new Cloud pod architecture and copy over all entitlements. This went without any real issues and the new Horizon 7 pod is working as a charm.
Therefore I would like to end this post by saying due to the robust design of VMware Horizon cloud pod, it is sometimes easier to redeploy an entire new pod than trying to fix it.
If you have any remarks or even a possible solution, please let me know in the comments!
Also check out my UEM printer blogpost.