[ZBX-25128] Zabbix proxy no longer sends data due to VMware monitoring Created: 2024 Aug 28  Updated: 2025 Jan 12

Status: Open
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P)
Affects Version/s: 6.4.18
Fix Version/s: None

Type: Problem report Priority: Trivial
Reporter: Marcel Renner Assignee: Tomass Janis Bross
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SUSE Linux Enterprise Server 15 SP5
Linux 5.14.21-150500.55.73-default #1 SMP PREEMPT_DYNAMIC Tue Aug 6 15:51:33 UTC 2024 (a0ede6a) x86_64 x86_64 x86_64 GNU/Linux
VMware vCenter 7.0.3


Attachments: Zip Archive zabbix_proxy_vmware_debug.log.zip    
Issue Links:
Duplicate
duplicates ZBX-25865 vmware.hv.network.linkspeed metric fa... Closed

 Description   

We have a vCenter with approx. 50 ESXi hosts and 800 VMs. Under normal circumstances, monitoring works perfectly. Due to the size of the vCenter, we run a dedicated Zabbix proxy just for VMware monitoring.

It has now happened that the vCenter was updated to 7.0.3, which interrupted the monitoring via the Zabbix proxy due to the reboot of the vCenter. But the Zabbix proxy did not come back with proxy internal and VMware metric data. I restarted the Zabbix proxy service, but it only worked for approx. 60 seconds.

It is not only VMware monitoring that is affected. To be precise, the Zabbix proxy is also no longer able to send internal data (like Zabbix proxy statistics). But Zabbix agent metrics are working fine. As a result, I am not able to troubleshoot the problem. All metrics of the Zabbix proxy processes are missing. When I restart the Zabbix proxy, I get a few values for about 60 seconds. Then the Zabbix Proxy becomes silent again. If I set StartVMwareCollectors to 0, the proxy works perfectly again and no longer hangs. If I set it back to 2, it works for 60 seconds and then it's quiet again.

We have had this effect before. At that time I solved it by deleting all discovered hosts from VMware monitoring so that Zabbix could make a fresh discovery of VMware. When all hosts are discovered again, Zabbix proxy has no problems or performance bottlenecks to monitor the vCenter. At least nothing that I can see in the Zabbix proxy template or zabbix_proxy.log.

My theory is that it is simply due to the number of hosts and items. After a restart, some Zabbix process overloads, which I can't see and then it is silent.

The current configuration is not special either. I also increased StartPollers and StartVMwareCollectors for testing and all other values which are > 0, but nothing changed. Same effect.

CacheSize=64M
HistoryCacheSize=16M
HistoryIndexCacheSize=4M
StartDiscoverers=0
StartHTTPPollers=1
StartIPMIPollers=1
StartJavaPollers=0
StartODBCPollers=0
StartPingers=1
StartPollers=1
StartPollersUnreachable=1
StartPreprocessors=1
StartSNMPTrapper=0
StartTrappers=1
StartVMwareCollectors=2
VMwareCacheSize=48M
DBHost=
DBName=xxx
DBUser=xxx
EnableRemoteCommands=1
Hostname=XX-XXX-VCSA
Server=server.example.com
ProxyConfigFrequency=10
ProxyOfflineBuffer=168
TLSConnect=psk
TLSAccept=psk
TLSPSKIdentity=proxy.example.com
TLSPSKFile=/etc/zabbix/zabbix_proxy.psk 

As mentioned, if I delete everything and have it rediscovered, it also works. But when restarting, Zabbix Proxy seems to overload itself completely.

I don't know, but I didn't see any long waiting times in the profiler.

 13591:20240828:110746.366 === Profiling statistics for configuration syncer ===
lock_log() mutex : locked:244 holding:0.068031 sec waiting:0.000095 sec
DCreset_interfaces_availability() rwlock : locked:108 holding:0.013740 sec waiting:0.000012 sec
DCsync_configuration() rwlock : locked:1080 holding:0.004209 sec waiting:0.000471 sec
sm_sync_lock() mutex : locked:108 holding:0.000045 sec waiting:0.000069 sec
rwlocks : locked:1188 holding:0.017949 sec waiting:0.000483 sec
mutexes : locked:352 holding:0.068075 sec waiting:0.000164 sec
locking total : locked:1540 holding:0.086024 sec waiting:0.000647 sec
 13623:20240828:110747.564 === Profiling statistics for availability manager ===
lock_log() mutex : locked:1125 holding:0.038500 sec waiting:0.000303 sec
sm_sync_lock() mutex : locked:1090 holding:0.000350 sec waiting:0.000772 sec
rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec
mutexes : locked:2215 holding:0.038850 sec waiting:0.001075 sec
locking total : locked:2215 holding:0.038850 sec waiting:0.001075 sec
 13596:20240828:110747.687 === Profiling statistics for data sender ===
lock_log() mutex : locked:1240 holding:0.039787 sec waiting:0.001019 sec
DCget_interfaces_availability() rwlock : locked:1199 holding:0.067390 sec waiting:0.000721 sec
DCconfig_get_items_by_itemids() rwlock : locked:198 holding:0.002250 sec waiting:0.000087 sec
reset_proxy_history_count() mutex : locked:198 holding:0.000119 sec waiting:0.000129 sec
sm_sync_lock() mutex : locked:1201 holding:0.000366 sec waiting:0.001008 sec
rwlocks : locked:1397 holding:0.069640 sec waiting:0.000808 sec
mutexes : locked:2639 holding:0.040271 sec waiting:0.002155 sec
locking total : locked:4036 holding:0.109912 sec waiting:0.002964 sec
 13594:20240828:110737.413 === Profiling statistics for trapper ===
lock_log() mutex : locked:330 holding:0.024679 sec waiting:0.000208 sec
zbx_dc_items_update_nextcheck() rwlock : locked:145 holding:0.000558 sec waiting:0.000008 sec
zbx_dc_config_history_recv_get_items_by_itemids() rwlock : locked:145 holding:0.001134 sec waiting:0.000032 sec
zbx_dc_get_or_create_session() rwlock : locked:326 holding:0.000294 sec waiting:0.000034 sec
DCget_expressions_by_names() rwlock : locked:181 holding:0.000017 sec waiting:0.000012 sec
DCconfig_get_hostid_by_name() rwlock : locked:163 holding:0.000239 sec waiting:0.000052 sec
DCget_psk_by_identity() rwlock : locked:345 holding:0.000975 sec waiting:0.000217 sec
DCconfig_update_autoreg_host() rwlock : locked:9 holding:0.000012 sec waiting:0.000003 sec
DCis_autoreg_host_changed() rwlock : locked:181 holding:0.000305 sec waiting:0.000039 sec
DCcheck_host_permissions() rwlock : locked:181 holding:0.001283 sec waiting:0.000077 sec
DCget_host_by_hostid() rwlock : locked:18 holding:0.000057 sec waiting:0.000001 sec
sm_sync_lock() mutex : locked:315 holding:0.000072 sec waiting:0.000064 sec
rwlocks : locked:1694 holding:0.004875 sec waiting:0.000474 sec
mutexes : locked:645 holding:0.024750 sec waiting:0.000272 sec
locking total : locked:2339 holding:0.029625 sec waiting:0.000746 sec
 13608:20240828:110740.548 === Profiling statistics for task manager ===
lock_log() mutex : locked:254 holding:0.023885 sec waiting:0.000049 sec
sm_sync_lock() mutex : locked:218 holding:0.000087 sec waiting:0.000147 sec
rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec
mutexes : locked:472 holding:0.023973 sec waiting:0.000196 sec
locking total : locked:472 holding:0.023973 sec waiting:0.000196 sec
 13610:20240828:110728.434 === Profiling statistics for unreachable poller ===
lock_log() mutex : locked:251 holding:0.020669 sec waiting:0.000066 sec
DCconfig_get_poller_items() rwlock : locked:216 holding:0.000175 sec waiting:0.002376 sec
DCconfig_get_poller_nextcheck() rwlock : locked:216 holding:0.000072 sec waiting:0.000057 sec
sm_sync_lock() mutex : locked:216 holding:0.000081 sec waiting:0.000149 sec
rwlocks : locked:432 holding:0.000247 sec waiting:0.002432 sec
mutexes : locked:467 holding:0.020749 sec waiting:0.000215 sec
locking total : locked:899 holding:0.020996 sec waiting:0.002647 sec
 13597:20240828:110728.947 === Profiling statistics for ipmi manager ===
lock_log() mutex : locked:1115 holding:0.032684 sec waiting:0.001707 sec
DCconfig_get_ipmi_poller_items() rwlock : locked:1080 holding:0.001201 sec waiting:0.004095 sec
sm_sync_lock() mutex : locked:1080 holding:0.000370 sec waiting:0.000806 sec
rwlocks : locked:1080 holding:0.001201 sec waiting:0.004095 sec
mutexes : locked:2195 holding:0.033053 sec waiting:0.002513 sec
locking total : locked:3275 holding:0.034254 sec waiting:0.006608 sec
 13599:20240828:110731.306 === Profiling statistics for http poller ===
lock_log() mutex : locked:251 holding:0.017954 sec waiting:0.000074 sec
zbx_dc_httptest_next() rwlock : locked:216 holding:0.000180 sec waiting:0.000159 sec
sm_sync_lock() mutex : locked:216 holding:0.000086 sec waiting:0.000220 sec
rwlocks : locked:216 holding:0.000180 sec waiting:0.000159 sec
mutexes : locked:467 holding:0.018040 sec waiting:0.000294 sec
locking total : locked:683 holding:0.018220 sec waiting:0.000453 sec
 13591:20240828:110736.291 received configuration data from server at "zbx-srv-prod.solco.global.nttdata.com", datal
en 11690
 13601:20240828:110727.433 === Profiling statistics for history syncer ===
dbsyncer_thread() processing : busy:0.115313 sec
lock_log() mutex : locked:35 holding:0.012651 sec waiting:0.002466 sec
zbx_dc_config_history_sync_get_items_by_itemids() rwlock : locked:44 holding:0.000607 sec waiting:0.000023 sec
DCconfig_items_apply_changes() rwlock : locked:44 holding:0.000084 sec waiting:0.000035 sec
change_proxy_history_count() mutex : locked:44 holding:0.000010 sec waiting:0.000009 sec
sync_proxy_history() mutex : locked:1124 holding:0.001068 sec waiting:0.000414 sec
sm_sync_lock() mutex : locked:1080 holding:0.000253 sec waiting:0.000930 sec
rwlocks : locked:88 holding:0.000691 sec waiting:0.000058 sec
mutexes : locked:2283 holding:0.013982 sec waiting:0.003818 sec
locking total : locked:2371 holding:0.014674 sec waiting:0.003876 sec
 13605:20240828:110727.343 === Profiling statistics for self-monitoring ===
lock_log() mutex : locked:1115 holding:0.030636 sec waiting:0.000306 sec
sm_sync_lock() mutex : locked:2160 holding:0.002041 sec waiting:0.001215 sec
rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec
mutexes : locked:3275 holding:0.032677 sec waiting:0.001521 sec
locking total : locked:3275 holding:0.032677 sec waiting:0.001521 sec
 13612:20240828:110726.557 === Profiling statistics for icmp pinger ===
lock_log() mutex : locked:247 holding:0.018005 sec waiting:0.000052 sec
DCrequeue_items() rwlock : locked:108 holding:0.000910 sec waiting:0.000020 sec
DCconfig_get_poller_items() rwlock : locked:215 holding:0.000967 sec waiting:0.000151 sec
DCconfig_get_poller_nextcheck() rwlock : locked:215 holding:0.000058 sec waiting:0.000055 sec
DCconfig_get_items_by_itemids() rwlock : locked:108 holding:0.000432 sec waiting:0.000042 sec
sm_sync_lock() mutex : locked:251 holding:0.000092 sec waiting:0.000244 sec
rwlocks : locked:646 holding:0.002367 sec waiting:0.000269 sec
mutexes : locked:498 holding:0.018097 sec waiting:0.000296 sec
locking total : locked:1144 holding:0.020464 sec waiting:0.000566 sec 

I once also enabled the debug log for the VMware collector (see zabbix_proxy_vmware_debug.log.zip). I did not see anything there. No timeouts, everything on SUCCEED. What was strange was that if the metrics stop after 60 seconds, there is also silence in the debug log. No more messages from the VMware collector.



 Comments   
Comment by Tomass Janis Bross [ 2024 Dec 11 ]

Hello Marcel,

If you are monitoring 50 ESXi hosts and 800VMs, having only 2 VMwareCollectors started means that the proxy is heavily underconfigured. The recommended formula is as follows: Amount of services < StartVMwareCollectors < (Amount of services * 2)
Can you please try configuring more VMwareCollectors, and then please let us know if this continues to happen.

Seems like VMwareCollector crashes, or something weird happens, since logs are not generated for the process anymore. Behaviour like this is not graceful in any way, and there should be some information if the process crashes.

Comment by Alexander Vladishev [ 2025 Jan 12 ]

Based on the symptoms, this closely resembles the issue described in ticket ZBX-25865. Most likely, it’s not the VMware collectors that hang but the pollers after querying the vmware.hv.network.linkspeed metric. This happens when the metric becomes unsupported - for example, if an invalid interface name is provided in the second parameter.

Generated at Wed May 14 07:38:30 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.