Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-25128

Zabbix proxy no longer sends data due to VMware monitoring

XMLWordPrintable

    • Icon: Problem report Problem report
    • Resolution: Unresolved
    • Icon: Trivial Trivial
    • None
    • 6.4.18
    • Proxy (P)
    • None
    • SUSE Linux Enterprise Server 15 SP5
      Linux 5.14.21-150500.55.73-default #1 SMP PREEMPT_DYNAMIC Tue Aug 6 15:51:33 UTC 2024 (a0ede6a) x86_64 x86_64 x86_64 GNU/Linux
      VMware vCenter 7.0.3

      We have a vCenter with approx. 50 ESXi hosts and 800 VMs. Under normal circumstances, monitoring works perfectly. Due to the size of the vCenter, we run a dedicated Zabbix proxy just for VMware monitoring.

      It has now happened that the vCenter was updated to 7.0.3, which interrupted the monitoring via the Zabbix proxy due to the reboot of the vCenter. But the Zabbix proxy did not come back with proxy internal and VMware metric data. I restarted the Zabbix proxy service, but it only worked for approx. 60 seconds.

      It is not only VMware monitoring that is affected. To be precise, the Zabbix proxy is also no longer able to send internal data (like Zabbix proxy statistics). But Zabbix agent metrics are working fine. As a result, I am not able to troubleshoot the problem. All metrics of the Zabbix proxy processes are missing. When I restart the Zabbix proxy, I get a few values for about 60 seconds. Then the Zabbix Proxy becomes silent again. If I set StartVMwareCollectors to 0, the proxy works perfectly again and no longer hangs. If I set it back to 2, it works for 60 seconds and then it's quiet again.

      We have had this effect before. At that time I solved it by deleting all discovered hosts from VMware monitoring so that Zabbix could make a fresh discovery of VMware. When all hosts are discovered again, Zabbix proxy has no problems or performance bottlenecks to monitor the vCenter. At least nothing that I can see in the Zabbix proxy template or zabbix_proxy.log.

      My theory is that it is simply due to the number of hosts and items. After a restart, some Zabbix process overloads, which I can't see and then it is silent.

      The current configuration is not special either. I also increased StartPollers and StartVMwareCollectors for testing and all other values which are > 0, but nothing changed. Same effect.

      CacheSize=64M
      HistoryCacheSize=16M
      HistoryIndexCacheSize=4M
      StartDiscoverers=0
      StartHTTPPollers=1
      StartIPMIPollers=1
      StartJavaPollers=0
      StartODBCPollers=0
      StartPingers=1
      StartPollers=1
      StartPollersUnreachable=1
      StartPreprocessors=1
      StartSNMPTrapper=0
      StartTrappers=1
      StartVMwareCollectors=2
      VMwareCacheSize=48M
      DBHost=
      DBName=xxx
      DBUser=xxx
      EnableRemoteCommands=1
      Hostname=XX-XXX-VCSA
      Server=server.example.com
      ProxyConfigFrequency=10
      ProxyOfflineBuffer=168
      TLSConnect=psk
      TLSAccept=psk
      TLSPSKIdentity=proxy.example.com
      TLSPSKFile=/etc/zabbix/zabbix_proxy.psk 

      As mentioned, if I delete everything and have it rediscovered, it also works. But when restarting, Zabbix Proxy seems to overload itself completely.

      I don't know, but I didn't see any long waiting times in the profiler.

       13591:20240828:110746.366 === Profiling statistics for configuration syncer ===
      lock_log() mutex : locked:244 holding:0.068031 sec waiting:0.000095 sec
      DCreset_interfaces_availability() rwlock : locked:108 holding:0.013740 sec waiting:0.000012 sec
      DCsync_configuration() rwlock : locked:1080 holding:0.004209 sec waiting:0.000471 sec
      sm_sync_lock() mutex : locked:108 holding:0.000045 sec waiting:0.000069 sec
      rwlocks : locked:1188 holding:0.017949 sec waiting:0.000483 sec
      mutexes : locked:352 holding:0.068075 sec waiting:0.000164 sec
      locking total : locked:1540 holding:0.086024 sec waiting:0.000647 sec
       13623:20240828:110747.564 === Profiling statistics for availability manager ===
      lock_log() mutex : locked:1125 holding:0.038500 sec waiting:0.000303 sec
      sm_sync_lock() mutex : locked:1090 holding:0.000350 sec waiting:0.000772 sec
      rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec
      mutexes : locked:2215 holding:0.038850 sec waiting:0.001075 sec
      locking total : locked:2215 holding:0.038850 sec waiting:0.001075 sec
       13596:20240828:110747.687 === Profiling statistics for data sender ===
      lock_log() mutex : locked:1240 holding:0.039787 sec waiting:0.001019 sec
      DCget_interfaces_availability() rwlock : locked:1199 holding:0.067390 sec waiting:0.000721 sec
      DCconfig_get_items_by_itemids() rwlock : locked:198 holding:0.002250 sec waiting:0.000087 sec
      reset_proxy_history_count() mutex : locked:198 holding:0.000119 sec waiting:0.000129 sec
      sm_sync_lock() mutex : locked:1201 holding:0.000366 sec waiting:0.001008 sec
      rwlocks : locked:1397 holding:0.069640 sec waiting:0.000808 sec
      mutexes : locked:2639 holding:0.040271 sec waiting:0.002155 sec
      locking total : locked:4036 holding:0.109912 sec waiting:0.002964 sec
       13594:20240828:110737.413 === Profiling statistics for trapper ===
      lock_log() mutex : locked:330 holding:0.024679 sec waiting:0.000208 sec
      zbx_dc_items_update_nextcheck() rwlock : locked:145 holding:0.000558 sec waiting:0.000008 sec
      zbx_dc_config_history_recv_get_items_by_itemids() rwlock : locked:145 holding:0.001134 sec waiting:0.000032 sec
      zbx_dc_get_or_create_session() rwlock : locked:326 holding:0.000294 sec waiting:0.000034 sec
      DCget_expressions_by_names() rwlock : locked:181 holding:0.000017 sec waiting:0.000012 sec
      DCconfig_get_hostid_by_name() rwlock : locked:163 holding:0.000239 sec waiting:0.000052 sec
      DCget_psk_by_identity() rwlock : locked:345 holding:0.000975 sec waiting:0.000217 sec
      DCconfig_update_autoreg_host() rwlock : locked:9 holding:0.000012 sec waiting:0.000003 sec
      DCis_autoreg_host_changed() rwlock : locked:181 holding:0.000305 sec waiting:0.000039 sec
      DCcheck_host_permissions() rwlock : locked:181 holding:0.001283 sec waiting:0.000077 sec
      DCget_host_by_hostid() rwlock : locked:18 holding:0.000057 sec waiting:0.000001 sec
      sm_sync_lock() mutex : locked:315 holding:0.000072 sec waiting:0.000064 sec
      rwlocks : locked:1694 holding:0.004875 sec waiting:0.000474 sec
      mutexes : locked:645 holding:0.024750 sec waiting:0.000272 sec
      locking total : locked:2339 holding:0.029625 sec waiting:0.000746 sec
       13608:20240828:110740.548 === Profiling statistics for task manager ===
      lock_log() mutex : locked:254 holding:0.023885 sec waiting:0.000049 sec
      sm_sync_lock() mutex : locked:218 holding:0.000087 sec waiting:0.000147 sec
      rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec
      mutexes : locked:472 holding:0.023973 sec waiting:0.000196 sec
      locking total : locked:472 holding:0.023973 sec waiting:0.000196 sec
       13610:20240828:110728.434 === Profiling statistics for unreachable poller ===
      lock_log() mutex : locked:251 holding:0.020669 sec waiting:0.000066 sec
      DCconfig_get_poller_items() rwlock : locked:216 holding:0.000175 sec waiting:0.002376 sec
      DCconfig_get_poller_nextcheck() rwlock : locked:216 holding:0.000072 sec waiting:0.000057 sec
      sm_sync_lock() mutex : locked:216 holding:0.000081 sec waiting:0.000149 sec
      rwlocks : locked:432 holding:0.000247 sec waiting:0.002432 sec
      mutexes : locked:467 holding:0.020749 sec waiting:0.000215 sec
      locking total : locked:899 holding:0.020996 sec waiting:0.002647 sec
       13597:20240828:110728.947 === Profiling statistics for ipmi manager ===
      lock_log() mutex : locked:1115 holding:0.032684 sec waiting:0.001707 sec
      DCconfig_get_ipmi_poller_items() rwlock : locked:1080 holding:0.001201 sec waiting:0.004095 sec
      sm_sync_lock() mutex : locked:1080 holding:0.000370 sec waiting:0.000806 sec
      rwlocks : locked:1080 holding:0.001201 sec waiting:0.004095 sec
      mutexes : locked:2195 holding:0.033053 sec waiting:0.002513 sec
      locking total : locked:3275 holding:0.034254 sec waiting:0.006608 sec
       13599:20240828:110731.306 === Profiling statistics for http poller ===
      lock_log() mutex : locked:251 holding:0.017954 sec waiting:0.000074 sec
      zbx_dc_httptest_next() rwlock : locked:216 holding:0.000180 sec waiting:0.000159 sec
      sm_sync_lock() mutex : locked:216 holding:0.000086 sec waiting:0.000220 sec
      rwlocks : locked:216 holding:0.000180 sec waiting:0.000159 sec
      mutexes : locked:467 holding:0.018040 sec waiting:0.000294 sec
      locking total : locked:683 holding:0.018220 sec waiting:0.000453 sec
       13591:20240828:110736.291 received configuration data from server at "zbx-srv-prod.solco.global.nttdata.com", datal
      en 11690
       13601:20240828:110727.433 === Profiling statistics for history syncer ===
      dbsyncer_thread() processing : busy:0.115313 sec
      lock_log() mutex : locked:35 holding:0.012651 sec waiting:0.002466 sec
      zbx_dc_config_history_sync_get_items_by_itemids() rwlock : locked:44 holding:0.000607 sec waiting:0.000023 sec
      DCconfig_items_apply_changes() rwlock : locked:44 holding:0.000084 sec waiting:0.000035 sec
      change_proxy_history_count() mutex : locked:44 holding:0.000010 sec waiting:0.000009 sec
      sync_proxy_history() mutex : locked:1124 holding:0.001068 sec waiting:0.000414 sec
      sm_sync_lock() mutex : locked:1080 holding:0.000253 sec waiting:0.000930 sec
      rwlocks : locked:88 holding:0.000691 sec waiting:0.000058 sec
      mutexes : locked:2283 holding:0.013982 sec waiting:0.003818 sec
      locking total : locked:2371 holding:0.014674 sec waiting:0.003876 sec
       13605:20240828:110727.343 === Profiling statistics for self-monitoring ===
      lock_log() mutex : locked:1115 holding:0.030636 sec waiting:0.000306 sec
      sm_sync_lock() mutex : locked:2160 holding:0.002041 sec waiting:0.001215 sec
      rwlocks : locked:0 holding:0.000000 sec waiting:0.000000 sec
      mutexes : locked:3275 holding:0.032677 sec waiting:0.001521 sec
      locking total : locked:3275 holding:0.032677 sec waiting:0.001521 sec
       13612:20240828:110726.557 === Profiling statistics for icmp pinger ===
      lock_log() mutex : locked:247 holding:0.018005 sec waiting:0.000052 sec
      DCrequeue_items() rwlock : locked:108 holding:0.000910 sec waiting:0.000020 sec
      DCconfig_get_poller_items() rwlock : locked:215 holding:0.000967 sec waiting:0.000151 sec
      DCconfig_get_poller_nextcheck() rwlock : locked:215 holding:0.000058 sec waiting:0.000055 sec
      DCconfig_get_items_by_itemids() rwlock : locked:108 holding:0.000432 sec waiting:0.000042 sec
      sm_sync_lock() mutex : locked:251 holding:0.000092 sec waiting:0.000244 sec
      rwlocks : locked:646 holding:0.002367 sec waiting:0.000269 sec
      mutexes : locked:498 holding:0.018097 sec waiting:0.000296 sec
      locking total : locked:1144 holding:0.020464 sec waiting:0.000566 sec 

      I once also enabled the debug log for the VMware collector (see zabbix_proxy_vmware_debug.log.zip). I did not see anything there. No timeouts, everything on SUCCEED. What was strange was that if the metrics stop after 60 seconds, there is also silence in the debug log. No more messages from the VMware collector.

            tbross Tomass Janis Bross
            Taikocuya Marcel Renner
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: