[ZBX-9779] VMWare Host status Gray after update to 2.2.10 Created: 2015 Aug 14  Updated: 2020 Nov 27  Resolved: 2015 Oct 16

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: API (A), Proxy (P), Server (S)
Affects Version/s: 2.2.10
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Tobias Wigand Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: vmware
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.04 LTS, MySQL


Attachments: File vmware-status_2.sh    
Issue Links:
Duplicate
duplicates ZBX-7446 vmware.hv.status does not report hard... Closed

 Description   

Hi,

we have just updated our Zabbix installation from 2.2.9 to 2.2.10. After that 10 ramdom ESX 5.1 and 6.0 hosts suddenly show their status as Gray. Nothing to find in the VCenter, though, all is OK as it was before the update.
Tried to delete the hosts and also one of the VCenter servers but the same hosts reappear in Gray.

Server Parameters changed from Default:

StartVMwareCollectors=4
VMwareCacheSize=256M
VMwareTimeout=30

Template is a modified Template Virt VMware without "Discover VMware VMs" discorvery item as we do not need that.



 Comments   
Comment by richlv [ 2015 Aug 14 ]

something like that was sort of supposed to be fixed by ZBX-7446

Comment by Tobias Wigand [ 2015 Aug 14 ]

Saw that bug, too. But for us the problems appeared with 2.2.10, 2.2.9 was OK. Anything we can do to safely reverse the fix from 7446 and keep using 2.2.10? We would rather not want to downgrade to our 2.2.9 backup and lose all data collected with 2.2.10 so far.

Comment by richlv [ 2015 Aug 14 ]

you don't have to restore from the backup - database in 2.2.9 and 2.2.10 is exactly the same, so you can just downgrade zabbix server to 2.2.9 and see whether that helps (it would also be a useful thing to test)

note that you can keep 2.2.10 frontend, agents and proxies - downgrading the server only is perfectly fine

Comment by Tobias Wigand [ 2015 Aug 14 ]

Great, thank you! We have installed the 2.2.9 debs for the server and also one sqlite3 proxy that monitors and all problematic hosts almost instantly switched to Green again.
How can we help to debug this further?

Comment by Oleksii Zagorskyi [ 2015 Sep 16 ]

I'm almost sure this one is duplicate of ZBX-7446.
I had almost the same user case with yours. What I heard is:

Those that give us NOTOK value, do not show the hardware information on vcenter.
Others show hardware information and have OK status.

So actually in 2.2.9 returned status was incorrect - always OK.
Now, in 2.2.10, returned value is correct, including "gray" one.

Comment by Tobias Wigand [ 2015 Sep 17 ]

I just double checked that, for us it is not true. Hardware Status for those hosts is displayed and every item (i.E. Processor, Memory, PCI, etc) shows a green checkmark and status "Normal".
System summary says: "No alerts or warnings out of 221 sensors".

Comment by Oleksii Zagorskyi [ 2015 Sep 17 ]

Ok, reopening this issue.

Comment by Oleksii Zagorskyi [ 2015 Sep 17 ]

See the script in attachments.
It was used for investigation during ZBX-7446 development.

Notes about the script:

1. It requires curl utility.
2. Please open it and change variables:
# VMware URL, for example https://1.2.3.4/sdk"
URL=???
# VMware login user/password
USER=???
PASSWORD=???
3. Leave uncommented # vCenter settings in case of vCenter
4. Uncomment # hypervisor settings in case of hypervisor.

If you are using vCenter - do not forget to replace HOSTSYSTEM with correct value of your VMware host system.
Run it once for a green host status and once for gray host status.

Redirect output to files, then attach them here.
If attaching those files is not allowed for you, you could send them to $upp0rt@ mail box, if you like.

Comment by Tobias Wigand [ 2015 Sep 17 ]

OK, the script seems to be working, had to install curl. Used option 3 as we are running vCenter Servers.
LoginResponse looks OK to me, but then it says something like "object has been deleted or has not been fully created yet" (roughly translated from German), not matter what I insert into HOSTSYSTEM. I have tried the DNS Name (the way it looks in vSphereClient), also the Hostname without Domain ending and IP address.
What would I need to enter there?
Tried two hosts that are part of a Cluster that resides in on of our Datacenters.

Comment by Oleksii Zagorskyi [ 2015 Sep 23 ]

The only I can help is how to get hypervisor names to make sure you are using correct names.
Following data you need to use in a script similar to attached one, in VMWARE_GETHV variable.
Be ready to understand all details of the script yourself.

From devs internal notes:
Retrieves list of available hypervisors.
For vSphere it returns hardcoded hypervisor name “ha-host”, while for vCenter the list of hypervisors is retrieved by following request:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:vim25" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <soapenv:Header/>
   <soapenv:Body>
      <urn:RetrievePropertiesEx>
          <urn:_this type="PropertyCollector">propertyCollector</urn:_this>
         <urn:specSet>
            <urn:propSet>
               <urn:type>HostSystem</urn:type>
            </urn:propSet>
            <urn:objectSet>
               <urn:obj type="Folder">group-d1</urn:obj>
               <urn:skip>false</urn:skip>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>visitFolders</urn:name>
                  <urn:type>Folder</urn:type>
                  <urn:path>childEntity</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>visitFolders</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>dcToHf</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>dcToVmf</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>crToH</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>crToRp</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>dcToDs</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>hToVm</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>rpToVm</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>dcToVmf</urn:name>
                  <urn:type>Datacenter</urn:type>
                  <urn:path>vmFolder</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>visitFolders</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>dcToDs</urn:name>
                  <urn:type>Datacenter</urn:type>
                  <urn:path>datastore</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>visitFolders</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>dcToHf</urn:name>
                  <urn:type>Datacenter</urn:type>
                  <urn:path>hostFolder</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>visitFolders</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>crToH</urn:name>
                  <urn:type>ComputeResource</urn:type>
                  <urn:path>host</urn:path>
                  <urn:skip>false</urn:skip>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>crToRp</urn:name>
                  <urn:type>ComputeResource</urn:type>
                  <urn:path>resourcePool</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>rpToRp</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>rpToVm</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>rpToRp</urn:name>
                  <urn:type>ResourcePool</urn:type>
                  <urn:path>resourcePool</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>rpToRp</urn:name>
                  </urn:selectSet>
                  <urn:selectSet>
                     <urn:name>rpToVm</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>hToVm</urn:name>
                  <urn:type>HostSystem</urn:type>
                  <urn:path>vm</urn:path>
                  <urn:skip>false</urn:skip>
                  <urn:selectSet>
                     <urn:name>visitFolders</urn:name>
                  </urn:selectSet>
               </urn:selectSet>
               <urn:selectSet xsi:type="urn:TraversalSpec">
                  <urn:name>rpToVm</urn:name>
                  <urn:type>ResourcePool</urn:type>
                  <urn:path>vm</urn:path>
                  <urn:skip>false</urn:skip>
               </urn:selectSet>
            </urn:objectSet>
         </urn:specSet>
         <urn:options/>
       </urn:RetrievePropertiesEx>
   </soapenv:Body>
</soapenv:Envelope>
Comment by Tobias Wigand [ 2015 Sep 23 ]

Many thanks for your help. I was not able to adapt the script you attached but I remembered our old VMWare API install on an ancient Nagios host and gave that a shot.
While the overall status is reported green on both the problematic and also the good host, there seems to be a health issue on the problematic host. That issue turns out to be a bug in the old VCenter install our VMWare guys are running. They'll try to update in the next days.
But as the VMware API shows overall status green on both hosts, maybe Zabbix has become to picky in the latest release? The item Hosts status should reflect the overall status and that should be green in this case, shouldn't it?

# /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host1 -f credentials-file -l runtime 
CHECK_VMWARE_API.PL OK - 2/2 VMs up, overall status=green, connection state=connected, maintenance=no, 1 health issue(s), no config issues | vmcount=2units;; health_issues=1;; config_issues=0;;

# /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host2 -f credentials-file -l runtime 
CHECK_VMWARE_API.PL OK - 1/1 VMs up, overall status=green, connection state=connected, maintenance=no, All 138 health checks are Green, no config issues | vmcount=1units;; health_issues=0;; config_issues=0;;

		
# /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host1 -f credentials-file -l runtime -s health
CHECK_VMWARE_API.PL OK - 1 health issue(s) found in 138 checks:
1) UNKNOWN[system] Status of VMware Rollup Health State: Über den aktuellen Zustand des Elements kann nicht berichtet werden | Alerts=1;;

# /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host2 -f credentials-file -l runtime -s health
CHECK_VMWARE_API.PL OK - All 138 health checks are GREEN: fan (1x); system (1x); CPU (2x); Processors (6x); Software Components (108x); Memory (1x); Storage (5x); power (1x); Management Subsystem Health (3x); temperature (10x); | Alerts=0;;

Related VCenter Bug:
http://forums.veeam.com/veeam-one-f28/vmware-rollup-health-state-unknown-after-latest-5-1-patch-t18976.html

Comment by Oleksii Zagorskyi [ 2015 Sep 24 ]

In 3rd command output we see: "1 health issue(s) found in 138 checks".
There is mentioned "VMware Rollup Health State" parameter with "UNKNOWN" value.
The "VMware Rollup Health State" is exactly what zabbix server looking for to return the key value.

Google translation from German to English:
"Über den aktuellen Zustand des Elements kann nicht berichtet werden"
"About the current state of the element can not be reported"

For Unknown state (label) on English vSphere I can see this description (summary):
"Cannot report on the current health state of the element"

I'm not sure we need to look into the nagios plugin how does it estimate "overall status"

Comment by Oleksii Zagorskyi [ 2015 Oct 16 ]

The discussion looks like finished.
My conclusion - zabbix is ok.
Closed again as duplicate.

Comment by Thomas Lohmüller [ 2016 Jan 27 ]

We just upgraded our Zabbix from 2.2.3 to the current 2.4.7 and had the same issue. 36 of our 55 hypervisor hosts changed from "green" to "grey". All of them showed "green" in VCenter.

So we started to dig deeper and uses this tool to inspect data from the API:
https://github.com/BaldMansMojo/check_vmware_esx

First we let it list all the sensors from one specific hosts which was reported as "grey":

$ ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health --listsensors
WARNING: [Unknown] [Type: system] [Name: VMware Rollup Health State] [Label: Unbekannt] [Summary: ▒ber den aktuellen Zustand des Elements kann nicht berichtet werden]
[Ok] [Type: System] [Name: System Board 0 SUPER_CAP_FLT - Predictive failure deasserted] [Label: Gr▒n] [Summary: Sensor wird unter normalen Bedingungen betrieben]
[Ok] [Type: Platform Alert] [Name: System Board 0 POWER_ON_FAIL - Predictive failure deasserted] [Label: Gr▒n] [Summary: Sensor wird unter normalen Bedingungen betrieben]
[Ok] [Type: CPU] [Name: CPU1] [Label: ] [Summary: Physisches Element funktioniert wie erwartet]
... lots of [OK] lines ...

The troublemaker is the first line. As one of the 117 checks is "Unknown" (aka "grey") Zabbix (and also this check_vmware_esx.pl script) report the host as grey.

$ ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health
WARNING: 1 health issue(s) found in 117 checks: 1) [Unknown] [Type: system] [Name: VMware Rollup Health State] [Label: Unbekannt] [Summary: ▒ber den aktuellen Zustand des Elements kann nicht berichtet werden]

This page from the VMware KnowledgeBase describes exactly this problem: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1037330

So we issued the following PowerShell command:

(Get-View (Get-VMHost -Name host.fqdn | Get-View).ConfigManager.HealthStatusSystem).RefreshHealthStatusSystem()

And the same check_vmware_esx.pl command as before. This time, the result is correct:

./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health
OK: All 117 health checks are GREEN: System (1x), system (1x), Platform Alert (1x), CPU (2x), voltage (31x), Processors (18x), Memory (1x), other (17x), Storage (14x), power (11x), temperature (20x)

Zabbix now also correctly reports this hypervisor host as "green".

So it looks like this is a caching problem on the VCenter itself. We have this issue on VCenter 5.5 U3 and also on VCenter 6.0. VMware labels it as a feature, not a bug. So I don't think they will resolve this "issue". Is there a chance to implement this refreshing (as above PowerShell command) into Zabbix?

Comment by Andris Zeila [ 2016 Jan 27 ]

Thanks for investiagting this issue!

We also have found that on some systems VMware Rollup Health State sensor has unknown state with a message "Cannot report on the current health state of the element". So Zabbix reports the hypervisor state as gray, while in vSphere client the host is shown as green.

We did ask VMware support what would be the correct way to handle this situation, lets see what will they answer.

Regarding implementation of sensor refreshing in Zabbix - most probably it could be done, but there is a question how taxing this operation is on vCenters.

Comment by richlv [ 2016 Jan 27 ]

thanks for digging into this. adding something like that to zabbix sounds a bit risky, but we could surely document it.
i'd suggest reopening this issue and setting it to the documentation component.

Comment by Thomas Lohmüller [ 2016 Jan 27 ]

Some more strange issues...

One of our hosts did not respond to above PowerShell code. It still reported as "grey". Listing all the sensors using...

$ ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health --listsensors

... revealed that the line labeled "VMware Rollup Health State" was completely missing on this specific host. We had to remove the host from the VCenter and re-add it. Now the "VMware Rollup Health State" is back again. This API interface on the VCenter feels quite unreliable.

And as we now know, there is (wrong) cached data reported using this API. And this cached value is what Zabbix "sees". So we also don't know if it will report alerts reliably if the overall state of a host changes. It may still report the old, cached value ("green").

Generated at Tue Apr 29 08:26:24 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.