Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-16374

Zabbix trapper processes possible hangup

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Incomplete
    • Icon: Trivial Trivial
    • None
    • 4.2.3
    • Server (S)
    • Zabbix Server (RHEL 7.6), AWS EC2 c4.xlarge
      Zabbix frontend (RHEL 7.6), AWS EC2 m4.large
      Zabbix DB: AWS RDS (MariaDB 10.0.24), db.m4.xlarge

      Server info:

      Incident details:

      1. Zabbix server trapper busy processes increased significantly from its typical value (~1-3%) to 100%.
      1. Zabbix proxies cannot connect to Zabbix server. Zabbix_proxy.log contains the following lines:
      10957:20190714:052308.169 received configuration data from server at "*******", datalen 426293
       10958:20190714:052602.513 cannot send heartbeat message to server at "*******": ZBX_TCP_READ() timed out
       10958:20190714:052702.517 cannot send heartbeat message to server at "*******": ZBX_TCP_READ() timed out
       10958:20190714:052802.520 cannot send heartbeat message to server at "*******": ZBX_TCP_READ() timed out
       10958:20190714:052902.523 cannot send heartbeat message to server at "*******": ZBX_TCP_READ() timed out
       10958:20190714:053002.526 cannot send heartbeat message to server at "*******": ZBX_TCP_READ() timed out
       10958:20190714:053102.529 cannot send heartbeat message to server at "*******": ZBX_TCP_READ() timed out
      

       Agents that are monitored by Zabbix server itself stay green and data is being collected. Zabbix server logs contains no suspicious records, just info about items that became unsupported.

      The environment stays in that condition for several hours and Zabbix server starts receiving data from Zabbix proxies only after a reboot via

      # systemctl restart zabbix-server

      Proxy log:

      10959:20190715:061449.816 Unable to connect to the server [******************]:10051 [cannot connect to [[******************]:10051]: [111] Connection refused]. Will retry every 1 second(s)
       10959:20190715:061455.824 Connection restored.
       10957:20190715:061824.192 received configuration data from server at "******************", datalen 426293
       10960:20190715:062122.585 executing housekeeper

       

       

      It could be related to network issues, but we couldn't find any evidences.

      Network traffic on Zabbix server:

      It's unclear for me why Zabbix trappers suddenly increased to 100% and stuck at that value because TrapperTimeout=15 should prevent trappers from hanging. Could you please tell me what do you think about this situation? I would like to prevent such situation in the future, so maybe you'll point me at where else should I look in order to investigate this incident properly.
      Full config:
      zabbix_server.conf

        1. image-2019-07-15-14-40-58-485.png
          image-2019-07-15-14-40-58-485.png
          24 kB
        2. image-2019-07-15-14-45-08-944.png
          image-2019-07-15-14-45-08-944.png
          102 kB
        3. image-2019-07-15-15-16-25-366.png
          image-2019-07-15-15-16-25-366.png
          205 kB
        4. image-2019-07-15-15-16-57-661.png
          image-2019-07-15-15-16-57-661.png
          94 kB
        5. zabbix_server.conf
          16 kB
        6. Zbx_server_perf_stats.jpg
          Zbx_server_perf_stats.jpg
          141 kB
        7. zabbix_proxy.conf
          17 kB
        8. zabbix_proxy.conf
          17 kB
        9. image-2019-07-18-16-42-29-891.png
          image-2019-07-18-16-42-29-891.png
          206 kB
        10. image-2019-07-18-16-43-05-500.png
          image-2019-07-18-16-43-05-500.png
          195 kB
        11. image-2019-07-18-16-44-21-009.png
          image-2019-07-18-16-44-21-009.png
          154 kB
        12. image-2019-07-18-16-44-54-449.png
          image-2019-07-18-16-44-54-449.png
          108 kB
        13. Zbx_srv_cache_usage.jpg
          Zbx_srv_cache_usage.jpg
          86 kB
        14. Zbx_srv_data_gathering_process_busy.jpg
          Zbx_srv_data_gathering_process_busy.jpg
          124 kB
        15. Zbx_srv_internal_process_busy.jpg
          Zbx_srv_internal_process_busy.jpg
          124 kB
        16. Zbx_srv_performance.jpg
          Zbx_srv_performance.jpg
          88 kB
        17. zbx_prox_performance.jpg
          zbx_prox_performance.jpg
          70 kB
        18. zbx_prox_cache.jpg
          zbx_prox_cache.jpg
          60 kB
        19. zbx_prox_data_gathering.jpg
          zbx_prox_data_gathering.jpg
          115 kB
        20. zbx_prox_internal.jpg
          zbx_prox_internal.jpg
          90 kB

            edgar.akhmetshin Edgar Akhmetshin
            vzabauski Valeriy Zabawski
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: