vCenter error 400 failed to connect to VMware Lookup service

VCSA error 400 failed to connect to VMware lookup service

In this blog post, we will take a look at an issue, I encountered during a VCSA migration and PSC cleanup (see other blogpost).
During the reboot of a migrated vCenter appliance, we got an error 400 failed to connect to VMware Lookup service when connecting using the browser.

VCSA error 400 lookup service failed to start

But first, let me give you some context on how we got here in the first place.
I was tasked with a multi VCSA with external PSC migration from 6.7 to 7. As well as perform some separations of workloads, resulting in deploying an additional VCSA in the SSO domain. This is where we encountered the issue when we joined the new VCSA into the SSO domain.

The join went successfully, but after a clean reboot we received the error 400 – failed to connect to VMware Lookup service.
I quickly checked if the VAMI interface (https://VCSA:5480) was online, which it was and the health was ok, but the SSO domain had a Status: unknown..
VCSA sso domain status unknownSo, the First thing that I checked was DNS, NTP. Perhaps I made a typo during the deployment and resulted in some strange behavior during boot up.
But a quick check showed me that both settings were correct. The VCSA had the correct Time and was able to do a reverse and forward DNS query.

So continuing the troubleshoot, we opened a SSH to the vCenter.

So a quick check of the running services indicated that only the following services were successfully started during the boot.

root@VCSA [ /var/log/vmware/vmdird ]# service-control –status –all
Running:
lwsmd vmafdd
Stopped:
applmgmt lookupsvc observability observability-vapi pschealth vlcm vmcad vmcam vmdird vmonapi vmware-analytics vmware-certificateauthority vmware-certificatemanagement vmware-cis-license vmware-content-library vmware-eam vmware-envoy vmware-hvc vmware-imagebuilder vmware-infraprofile vmware-netdumper vmware-perfcharts vmware-pod vmware-postgres-archiver vmware-rbd-watchdog vmware-rhttpproxy vmware-sca vmware-sps vmware-statsmonitor vmware-stsd vmware-topologysvc vmware-trustmanagement vmware-updatemgr vmware-vapi-endpoint vmware-vcha vmware-vdtc vmware-vmon vmware-vpostgres vmware-vpxd vmware-vpxd-svcs vmware-vsan-health vmware-vsm vsphere-ui vstats vtsdb wcp

A quick google showed me that the next service to start was the vmdird service. Thanks to David Pasek, Link to the blog article.

  1. lwsmd (Likewise Service Manager)
  2. vmafdd (VMware Authentication Framework)
  3. vmdird (VMware Directory Service)
  4. vmcad (VMware Certificate Service)
  5. vmware-sts-idmd (VMware Identity Management Service)
  6. vmware-stsd (VMware Security Token Service)
  7. vmdnsd (VMware Domain Name Service)
  8. vmware-psc-client (VMware Platform Services Controller Client)
  9. vmon (VMware Service Lifecycle Manager)

So, the next thing to try was getting the VMDIRD service started or get a glimpse of why it was failing to start.
Here, I tried multiple commands: service-control –start –all  and service-control –start vmdird but both gave me the same error:

2021-02-23T08:58:42.857Z {
“detail”: [
{
“id”: “install.ciscommon.command.errinvoke”,
“translatable”: “An error occurred while invoking external command : ‘%(0)s'”,
“args”: [
“Stderr: Job for vmdird.service failed because the control process exited with error code.\nSee \”systemctl status vmdird.service\” and \”journalctl -xe\” for details.\n”
],
“localized”: “An error occurred while invoking external command : ‘Stderr: Job for vmdird.service failed because the control process exited with error code.\nSee \”systemctl status vmdird.service\” and \”journalctl -xe\” for details.\n'”
}
],
“componentKey”: null,
“problemId”: null,
“resolution”: null
}
Error executing start on service vmdird. Details {
“detail”: [
{
“id”: “install.ciscommon.service.failstart”,
“translatable”: “An error occurred while starting service ‘%(0)s'”,
“args”: [
“vmdird”
],
“localized”: “An error occurred while starting service ‘vmdird'”
}
],
“componentKey”: null,
“problemId”: null,
“resolution”: null
}

The error did not really provide me any indication of what the cause was but it referred to systemctl and journalctl.
Well, the first just shows you the system status of that service, so not really helpful.

But the second journalctl is a log that captures all of the messages produced by the kernel, services, etc.
So after a quick look here, we found the issue:
Journalctl old VCSA entry

Here the log referred to an old PSC entry that the customer had removed some time ago. So the entry was indeed unavailable as it was long gone and deleted from the environment.

Solution

With the issue, Identified we had to somehow trick the VCSA in skipping the LDAP communication.
Thanks to GSS engineer Michael O’Sullivan, we had to disconnect the NIC from the VCSA VM and restart the services once more (temporary solution).
Offline boot VCSA - stale PSC entry
Succes, now the VCSA was able to boot in an offline mode. All services did boot successfully and the web interface can up without any more issues.
After the boot, we reconnected the NIC of the VCSA and the linked enhanced mode between the VCSA worked again.

Of course, once we rebooted the Center again we would be faced with the same issue as long as the Stale PSC entry is located in the SSO domain. If you would like to know how to resolve the rootcause, head over to my blogpost: Resolving stale PSC entries from your vSphere environment

Thanks for reading!

Leave a Reply

Your email address will not be published.