-
Problem report
-
Resolution: Unresolved
-
Trivial
-
None
-
6.4.18
-
None
-
SUSE Linux Enterprise Server 15 SP5
Linux 5.14.21-150500.55.73-default #1 SMP PREEMPT_DYNAMIC Tue Aug 6 15:51:33 UTC 2024 (a0ede6a) x86_64 x86_64 x86_64 GNU/Linux
VMware vCenter 7.0.3
We have a vCenter with approx. 50 ESXi hosts and 800 VMs. Under normal circumstances, monitoring works perfectly. Due to the size of the vCenter, we run a dedicated Zabbix proxy just for VMware monitoring.
It has now happened that the vCenter was updated to 7.0.3, which interrupted the monitoring via the Zabbix proxy due to the reboot of the vCenter. But the Zabbix proxy did not come back with proxy internal and VMware metric data. I restarted the Zabbix proxy service, but it only worked for approx. 60 seconds.
It is not only VMware monitoring that is affected. To be precise, the Zabbix proxy is also no longer able to send internal data (like Zabbix proxy statistics). But Zabbix agent metrics are working fine. As a result, I am not able to troubleshoot the problem. All metrics of the Zabbix proxy processes are missing. When I restart the Zabbix proxy, I get a few values for about 60 seconds. Then the Zabbix Proxy becomes silent again. If I set StartVMwareCollectors to 0, the proxy works perfectly again and no longer hangs. If I set it back to 2, it works for 60 seconds and then it's quiet again.
We have had this effect before. At that time I solved it by deleting all discovered hosts from VMware monitoring so that Zabbix could make a fresh discovery of VMware. When all hosts are discovered again, Zabbix proxy has no problems or performance bottlenecks to monitor the vCenter. At least nothing that I can see in the Zabbix proxy template or zabbix_proxy.log.
My theory is that it is simply due to the number of hosts and items. After a restart, some Zabbix process overloads, which I can't see and then it is silent.
The current configuration is not special either. I also increased StartPollers and StartVMwareCollectors for testing and all other values which are > 0, but nothing changed. Same effect.
CacheSize=64M HistoryCacheSize=16M HistoryIndexCacheSize=4M StartDiscoverers=0 StartHTTPPollers=1 StartIPMIPollers=1 StartJavaPollers=0 StartODBCPollers=0 StartPingers=1 StartPollers=1 StartPollersUnreachable=1 StartPreprocessors=1 StartSNMPTrapper=0 StartTrappers=1 StartVMwareCollectors=2 VMwareCacheSize=48M DBHost= DBName=xxx DBUser=xxx EnableRemoteCommands=1 Hostname=XX-XXX-VCSA Server=server.example.com ProxyConfigFrequency=10 ProxyOfflineBuffer=168 TLSConnect=psk TLSAccept=psk TLSPSKIdentity=proxy.example.com TLSPSKFile=/etc/zabbix/zabbix_proxy.psk
As mentioned, if I delete everything and have it rediscovered, it also works. But when restarting, Zabbix Proxy seems to overload itself completely.
I don't know, but I didn't see any long waiting times in the profiler.
13591:20240828:110746.366 === Profiling statistics for configuration syncer === lock_log() mutex : locked:244 holding:0.068031 sec waiting:0.000095 sec DCreset_interfaces_availability() rwlock : locked:108 holding:0.013740 sec waiting:0.000012 sec DCsync_configuration() rwlock : locked:1080 holding:0.004209 sec waiting:0.000471 sec sm_sync_lock() mutex : locked:108 holding:0.000045 sec waiting:0.000069 sec rwlocks : locked:1188 holding:0.017949 sec waiting:0.000483 sec mutexes : locked:352 holding:0.068075 sec waiting:0.000164 sec locking total : locked:1540 holding:0.086024 sec waiting:0.000647 sec 13623:20240828:110747.564 === Profiling statistics for availability manager === lock_log() mutex : locked:1125 holding:0.038500 sec waiting:0.000303 sec sm_sync_lock() mutex : locked:1090 holding:0.000350 sec waiting:0.000772 sec rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec mutexes : locked:2215 holding:0.038850 sec waiting:0.001075 sec locking total : locked:2215 holding:0.038850 sec waiting:0.001075 sec 13596:20240828:110747.687 === Profiling statistics for data sender === lock_log() mutex : locked:1240 holding:0.039787 sec waiting:0.001019 sec DCget_interfaces_availability() rwlock : locked:1199 holding:0.067390 sec waiting:0.000721 sec DCconfig_get_items_by_itemids() rwlock : locked:198 holding:0.002250 sec waiting:0.000087 sec reset_proxy_history_count() mutex : locked:198 holding:0.000119 sec waiting:0.000129 sec sm_sync_lock() mutex : locked:1201 holding:0.000366 sec waiting:0.001008 sec rwlocks : locked:1397 holding:0.069640 sec waiting:0.000808 sec mutexes : locked:2639 holding:0.040271 sec waiting:0.002155 sec locking total : locked:4036 holding:0.109912 sec waiting:0.002964 sec 13594:20240828:110737.413 === Profiling statistics for trapper === lock_log() mutex : locked:330 holding:0.024679 sec waiting:0.000208 sec zbx_dc_items_update_nextcheck() rwlock : locked:145 holding:0.000558 sec waiting:0.000008 sec zbx_dc_config_history_recv_get_items_by_itemids() rwlock : locked:145 holding:0.001134 sec waiting:0.000032 sec zbx_dc_get_or_create_session() rwlock : locked:326 holding:0.000294 sec waiting:0.000034 sec DCget_expressions_by_names() rwlock : locked:181 holding:0.000017 sec waiting:0.000012 sec DCconfig_get_hostid_by_name() rwlock : locked:163 holding:0.000239 sec waiting:0.000052 sec DCget_psk_by_identity() rwlock : locked:345 holding:0.000975 sec waiting:0.000217 sec DCconfig_update_autoreg_host() rwlock : locked:9 holding:0.000012 sec waiting:0.000003 sec DCis_autoreg_host_changed() rwlock : locked:181 holding:0.000305 sec waiting:0.000039 sec DCcheck_host_permissions() rwlock : locked:181 holding:0.001283 sec waiting:0.000077 sec DCget_host_by_hostid() rwlock : locked:18 holding:0.000057 sec waiting:0.000001 sec sm_sync_lock() mutex : locked:315 holding:0.000072 sec waiting:0.000064 sec rwlocks : locked:1694 holding:0.004875 sec waiting:0.000474 sec mutexes : locked:645 holding:0.024750 sec waiting:0.000272 sec locking total : locked:2339 holding:0.029625 sec waiting:0.000746 sec 13608:20240828:110740.548 === Profiling statistics for task manager === lock_log() mutex : locked:254 holding:0.023885 sec waiting:0.000049 sec sm_sync_lock() mutex : locked:218 holding:0.000087 sec waiting:0.000147 sec rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec mutexes : locked:472 holding:0.023973 sec waiting:0.000196 sec locking total : locked:472 holding:0.023973 sec waiting:0.000196 sec 13610:20240828:110728.434 === Profiling statistics for unreachable poller === lock_log() mutex : locked:251 holding:0.020669 sec waiting:0.000066 sec DCconfig_get_poller_items() rwlock : locked:216 holding:0.000175 sec waiting:0.002376 sec DCconfig_get_poller_nextcheck() rwlock : locked:216 holding:0.000072 sec waiting:0.000057 sec sm_sync_lock() mutex : locked:216 holding:0.000081 sec waiting:0.000149 sec rwlocks : locked:432 holding:0.000247 sec waiting:0.002432 sec mutexes : locked:467 holding:0.020749 sec waiting:0.000215 sec locking total : locked:899 holding:0.020996 sec waiting:0.002647 sec 13597:20240828:110728.947 === Profiling statistics for ipmi manager === lock_log() mutex : locked:1115 holding:0.032684 sec waiting:0.001707 sec DCconfig_get_ipmi_poller_items() rwlock : locked:1080 holding:0.001201 sec waiting:0.004095 sec sm_sync_lock() mutex : locked:1080 holding:0.000370 sec waiting:0.000806 sec rwlocks : locked:1080 holding:0.001201 sec waiting:0.004095 sec mutexes : locked:2195 holding:0.033053 sec waiting:0.002513 sec locking total : locked:3275 holding:0.034254 sec waiting:0.006608 sec 13599:20240828:110731.306 === Profiling statistics for http poller === lock_log() mutex : locked:251 holding:0.017954 sec waiting:0.000074 sec zbx_dc_httptest_next() rwlock : locked:216 holding:0.000180 sec waiting:0.000159 sec sm_sync_lock() mutex : locked:216 holding:0.000086 sec waiting:0.000220 sec rwlocks : locked:216 holding:0.000180 sec waiting:0.000159 sec mutexes : locked:467 holding:0.018040 sec waiting:0.000294 sec locking total : locked:683 holding:0.018220 sec waiting:0.000453 sec 13591:20240828:110736.291 received configuration data from server at "zbx-srv-prod.solco.global.nttdata.com", datal en 11690 13601:20240828:110727.433 === Profiling statistics for history syncer === dbsyncer_thread() processing : busy:0.115313 sec lock_log() mutex : locked:35 holding:0.012651 sec waiting:0.002466 sec zbx_dc_config_history_sync_get_items_by_itemids() rwlock : locked:44 holding:0.000607 sec waiting:0.000023 sec DCconfig_items_apply_changes() rwlock : locked:44 holding:0.000084 sec waiting:0.000035 sec change_proxy_history_count() mutex : locked:44 holding:0.000010 sec waiting:0.000009 sec sync_proxy_history() mutex : locked:1124 holding:0.001068 sec waiting:0.000414 sec sm_sync_lock() mutex : locked:1080 holding:0.000253 sec waiting:0.000930 sec rwlocks : locked:88 holding:0.000691 sec waiting:0.000058 sec mutexes : locked:2283 holding:0.013982 sec waiting:0.003818 sec locking total : locked:2371 holding:0.014674 sec waiting:0.003876 sec 13605:20240828:110727.343 === Profiling statistics for self-monitoring === lock_log() mutex : locked:1115 holding:0.030636 sec waiting:0.000306 sec sm_sync_lock() mutex : locked:2160 holding:0.002041 sec waiting:0.001215 sec rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec mutexes : locked:3275 holding:0.032677 sec waiting:0.001521 sec locking total : locked:3275 holding:0.032677 sec waiting:0.001521 sec 13612:20240828:110726.557 === Profiling statistics for icmp pinger === lock_log() mutex : locked:247 holding:0.018005 sec waiting:0.000052 sec DCrequeue_items() rwlock : locked:108 holding:0.000910 sec waiting:0.000020 sec DCconfig_get_poller_items() rwlock : locked:215 holding:0.000967 sec waiting:0.000151 sec DCconfig_get_poller_nextcheck() rwlock : locked:215 holding:0.000058 sec waiting:0.000055 sec DCconfig_get_items_by_itemids() rwlock : locked:108 holding:0.000432 sec waiting:0.000042 sec sm_sync_lock() mutex : locked:251 holding:0.000092 sec waiting:0.000244 sec rwlocks : locked:646 holding:0.002367 sec waiting:0.000269 sec mutexes : locked:498 holding:0.018097 sec waiting:0.000296 sec locking total : locked:1144 holding:0.020464 sec waiting:0.000566 sec
I once also enabled the debug log for the VMware collector (see zabbix_proxy_vmware_debug.log.zip). I did not see anything there. No timeouts, everything on SUCCEED. What was strange was that if the metrics stop after 60 seconds, there is also silence in the debug log. No more messages from the VMware collector.