[ZBX-9779] VMWare Host status Gray after update to 2.2.10 Created: 2015 Aug 14 Updated: 2020 Nov 27 Resolved: 2015 Oct 16 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | API (A), Proxy (P), Server (S) |
Affects Version/s: | 2.2.10 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Major |
Reporter: | Tobias Wigand | Assignee: | Unassigned |
Resolution: | Duplicate | Votes: | 0 |
Labels: | vmware | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Ubuntu 14.04 LTS, MySQL |
Attachments: |
![]() |
||||||||
Issue Links: |
|
Description |
Hi, we have just updated our Zabbix installation from 2.2.9 to 2.2.10. After that 10 ramdom ESX 5.1 and 6.0 hosts suddenly show their status as Gray. Nothing to find in the VCenter, though, all is OK as it was before the update. Server Parameters changed from Default: StartVMwareCollectors=4 Template is a modified Template Virt VMware without "Discover VMware VMs" discorvery item as we do not need that. |
Comments |
Comment by richlv [ 2015 Aug 14 ] |
something like that was sort of supposed to be fixed by |
Comment by Tobias Wigand [ 2015 Aug 14 ] |
Saw that bug, too. But for us the problems appeared with 2.2.10, 2.2.9 was OK. Anything we can do to safely reverse the fix from 7446 and keep using 2.2.10? We would rather not want to downgrade to our 2.2.9 backup and lose all data collected with 2.2.10 so far. |
Comment by richlv [ 2015 Aug 14 ] |
you don't have to restore from the backup - database in 2.2.9 and 2.2.10 is exactly the same, so you can just downgrade zabbix server to 2.2.9 and see whether that helps (it would also be a useful thing to test) note that you can keep 2.2.10 frontend, agents and proxies - downgrading the server only is perfectly fine |
Comment by Tobias Wigand [ 2015 Aug 14 ] |
Great, thank you! We have installed the 2.2.9 debs for the server and also one sqlite3 proxy that monitors and all problematic hosts almost instantly switched to Green again. |
Comment by Oleksii Zagorskyi [ 2015 Sep 16 ] |
I'm almost sure this one is duplicate of
So actually in 2.2.9 returned status was incorrect - always OK. |
Comment by Tobias Wigand [ 2015 Sep 17 ] |
I just double checked that, for us it is not true. Hardware Status for those hosts is displayed and every item (i.E. Processor, Memory, PCI, etc) shows a green checkmark and status "Normal". |
Comment by Oleksii Zagorskyi [ 2015 Sep 17 ] |
Ok, reopening this issue. |
Comment by Oleksii Zagorskyi [ 2015 Sep 17 ] |
See the script in attachments. Notes about the script: 1. It requires curl utility. 2. Please open it and change variables: # VMware URL, for example https://1.2.3.4/sdk" URL=??? # VMware login user/password USER=??? PASSWORD=??? 3. Leave uncommented # vCenter settings in case of vCenter 4. Uncomment # hypervisor settings in case of hypervisor. If you are using vCenter - do not forget to replace HOSTSYSTEM with correct value of your VMware host system. Redirect output to files, then attach them here. |
Comment by Tobias Wigand [ 2015 Sep 17 ] |
OK, the script seems to be working, had to install curl. Used option 3 as we are running vCenter Servers. |
Comment by Oleksii Zagorskyi [ 2015 Sep 23 ] |
The only I can help is how to get hypervisor names to make sure you are using correct names. From devs internal notes: <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:vim25" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <soapenv:Header/> <soapenv:Body> <urn:RetrievePropertiesEx> <urn:_this type="PropertyCollector">propertyCollector</urn:_this> <urn:specSet> <urn:propSet> <urn:type>HostSystem</urn:type> </urn:propSet> <urn:objectSet> <urn:obj type="Folder">group-d1</urn:obj> <urn:skip>false</urn:skip> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>visitFolders</urn:name> <urn:type>Folder</urn:type> <urn:path>childEntity</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>visitFolders</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>dcToHf</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>dcToVmf</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>crToH</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>crToRp</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>dcToDs</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>hToVm</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>rpToVm</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>dcToVmf</urn:name> <urn:type>Datacenter</urn:type> <urn:path>vmFolder</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>visitFolders</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>dcToDs</urn:name> <urn:type>Datacenter</urn:type> <urn:path>datastore</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>visitFolders</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>dcToHf</urn:name> <urn:type>Datacenter</urn:type> <urn:path>hostFolder</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>visitFolders</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>crToH</urn:name> <urn:type>ComputeResource</urn:type> <urn:path>host</urn:path> <urn:skip>false</urn:skip> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>crToRp</urn:name> <urn:type>ComputeResource</urn:type> <urn:path>resourcePool</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>rpToRp</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>rpToVm</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>rpToRp</urn:name> <urn:type>ResourcePool</urn:type> <urn:path>resourcePool</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>rpToRp</urn:name> </urn:selectSet> <urn:selectSet> <urn:name>rpToVm</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>hToVm</urn:name> <urn:type>HostSystem</urn:type> <urn:path>vm</urn:path> <urn:skip>false</urn:skip> <urn:selectSet> <urn:name>visitFolders</urn:name> </urn:selectSet> </urn:selectSet> <urn:selectSet xsi:type="urn:TraversalSpec"> <urn:name>rpToVm</urn:name> <urn:type>ResourcePool</urn:type> <urn:path>vm</urn:path> <urn:skip>false</urn:skip> </urn:selectSet> </urn:objectSet> </urn:specSet> <urn:options/> </urn:RetrievePropertiesEx> </soapenv:Body> </soapenv:Envelope> |
Comment by Tobias Wigand [ 2015 Sep 23 ] |
Many thanks for your help. I was not able to adapt the script you attached but I remembered our old VMWare API install on an ancient Nagios host and gave that a shot. # /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host1 -f credentials-file -l runtime CHECK_VMWARE_API.PL OK - 2/2 VMs up, overall status=green, connection state=connected, maintenance=no, 1 health issue(s), no config issues | vmcount=2units;; health_issues=1;; config_issues=0;; # /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host2 -f credentials-file -l runtime CHECK_VMWARE_API.PL OK - 1/1 VMs up, overall status=green, connection state=connected, maintenance=no, All 138 health checks are Green, no config issues | vmcount=1units;; health_issues=0;; config_issues=0;; # /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host1 -f credentials-file -l runtime -s health CHECK_VMWARE_API.PL OK - 1 health issue(s) found in 138 checks: 1) UNKNOWN[system] Status of VMware Rollup Health State: Über den aktuellen Zustand des Elements kann nicht berichtet werden | Alerts=1;; # /usr/lib/nagios/plugins/check_vmware_api.pl -D VCenter -H Host2 -f credentials-file -l runtime -s health CHECK_VMWARE_API.PL OK - All 138 health checks are GREEN: fan (1x); system (1x); CPU (2x); Processors (6x); Software Components (108x); Memory (1x); Storage (5x); power (1x); Management Subsystem Health (3x); temperature (10x); | Alerts=0;; Related VCenter Bug: |
Comment by Oleksii Zagorskyi [ 2015 Sep 24 ] |
In 3rd command output we see: "1 health issue(s) found in 138 checks". Google translation from German to English: For Unknown state (label) on English vSphere I can see this description (summary): I'm not sure we need to look into the nagios plugin how does it estimate "overall status" |
Comment by Oleksii Zagorskyi [ 2015 Oct 16 ] |
The discussion looks like finished. |
Comment by Thomas Lohmüller [ 2016 Jan 27 ] |
We just upgraded our Zabbix from 2.2.3 to the current 2.4.7 and had the same issue. 36 of our 55 hypervisor hosts changed from "green" to "grey". All of them showed "green" in VCenter. So we started to dig deeper and uses this tool to inspect data from the API: First we let it list all the sensors from one specific hosts which was reported as "grey": $ ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health --listsensors WARNING: [Unknown] [Type: system] [Name: VMware Rollup Health State] [Label: Unbekannt] [Summary: ▒ber den aktuellen Zustand des Elements kann nicht berichtet werden] [Ok] [Type: System] [Name: System Board 0 SUPER_CAP_FLT - Predictive failure deasserted] [Label: Gr▒n] [Summary: Sensor wird unter normalen Bedingungen betrieben] [Ok] [Type: Platform Alert] [Name: System Board 0 POWER_ON_FAIL - Predictive failure deasserted] [Label: Gr▒n] [Summary: Sensor wird unter normalen Bedingungen betrieben] [Ok] [Type: CPU] [Name: CPU1] [Label: ] [Summary: Physisches Element funktioniert wie erwartet] ... lots of [OK] lines ... The troublemaker is the first line. As one of the 117 checks is "Unknown" (aka "grey") Zabbix (and also this check_vmware_esx.pl script) report the host as grey. $ ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health WARNING: 1 health issue(s) found in 117 checks: 1) [Unknown] [Type: system] [Name: VMware Rollup Health State] [Label: Unbekannt] [Summary: ▒ber den aktuellen Zustand des Elements kann nicht berichtet werden] This page from the VMware KnowledgeBase describes exactly this problem: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1037330 So we issued the following PowerShell command: (Get-View (Get-VMHost -Name host.fqdn | Get-View).ConfigManager.HealthStatusSystem).RefreshHealthStatusSystem() And the same check_vmware_esx.pl command as before. This time, the result is correct: ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health OK: All 117 health checks are GREEN: System (1x), system (1x), Platform Alert (1x), CPU (2x), voltage (31x), Processors (18x), Memory (1x), other (17x), Storage (14x), power (11x), temperature (20x) Zabbix now also correctly reports this hypervisor host as "green". So it looks like this is a caching problem on the VCenter itself. We have this issue on VCenter 5.5 U3 and also on VCenter 6.0. VMware labels it as a feature, not a bug. So I don't think they will resolve this "issue". Is there a chance to implement this refreshing (as above PowerShell command) into Zabbix? |
Comment by Andris Zeila [ 2016 Jan 27 ] |
Thanks for investiagting this issue! We also have found that on some systems VMware Rollup Health State sensor has unknown state with a message "Cannot report on the current health state of the element". So Zabbix reports the hypervisor state as gray, while in vSphere client the host is shown as green. We did ask VMware support what would be the correct way to handle this situation, lets see what will they answer. Regarding implementation of sensor refreshing in Zabbix - most probably it could be done, but there is a question how taxing this operation is on vCenters. |
Comment by richlv [ 2016 Jan 27 ] |
thanks for digging into this. adding something like that to zabbix sounds a bit risky, but we could surely document it. |
Comment by Thomas Lohmüller [ 2016 Jan 27 ] |
Some more strange issues... One of our hosts did not respond to above PowerShell code. It still reported as "grey". Listing all the sensors using... $ ./check_vmware_esx.pl -f authfile -D host.fqdn -H vcenter.fqdn -S runtime -s health --listsensors ... revealed that the line labeled "VMware Rollup Health State" was completely missing on this specific host. We had to remove the host from the VCenter and re-add it. Now the "VMware Rollup Health State" is back again. This API interface on the VCenter feels quite unreliable. And as we now know, there is (wrong) cached data reported using this API. And this cached value is what Zabbix "sees". So we also don't know if it will report alerts reliably if the overall state of a host changes. It may still report the old, cached value ("green"). |