Tuesday, September 4, 2012

vSphere 4.0.0 and vmInventory.xml

Yesterday, one of my ESXi 4.0 hosts decided to seg fault.  Fortunately, this was during labor day when almost everyone (except us SysAdmins) was enjoying a hard earned day off.  Also, I am fortunate that no production services happened to be running on that host.

After bringing the server back up, I discovered a few items needed to be fixed:

  • I had a corrupted software iSCSI configuration.
  • I had a corrupted vSwitch (the one I am using for vMotions).
  • I now had a corrupted vmInventory.xml

As I have HA configured on the cluster, the VMs somehow evacuated the host, and while the running states were migrated over, the VMs now showed as orphaned in vCenter.  Shortly after that, the VMs disappeared entirely from the inventory.  Thankfully, all services somehow managed to stay up.  I could ping and SSH all day long on the VMs  that were supposedly down.

I used the instructions in the following VMware KB to attempt to repair vmInventory.xml:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007541

Unfortunately this did not work.  After a few hours of repairing software iSCSI, and restoring and rebuilding by vSwitch (in addition to checking the integrity of other vSwitches), it dawned on me that I could perhaps just brows through my LUNs and re-add orphaned VMs to the appropriate host.  Well, some of the vmx files were locked, and it was fun getting those unlocked.  Finally, after all files were unlocked, I successfully added one of my VMs to the inventory...

... only, it stated the VM was invalid.  Damn it!  So close.  I verified all critical VMs were up and running, then I went home to see my wife and daughter, ate a very belated dinner, and then went to sleep.

This afternoon, it occurred to me that the object of my quest resided in vmware.log.  On a hunch, I browsed to the datastore of the VM, pulled down a copy of vmware.log, and I was not surprised to find the log contained information about the last running ESXi host, and the details of the vmx (in case it had to be rebuilt).  I then attempted to re-add the VM back to the proper host, and vCenter indicated that I had indeed placed the VM on the right host.

As a side effect, I learned a few things about my environment that I need to adjust, and ways I can improve my monitoring and my infrastructure, not the least of which is scheduling some time to upgrade ESXi and vCenter.  Now that the "vRAM Tax" has been eliminated in vSphere 5.1, I am seriously considering renewing my service contract instead of picking a new hypervisor platform.

No comments:

Post a Comment