Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-5120

Incoming items not ( or very slowly ) being processed.

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Won't fix
    • Icon: Trivial Trivial
    • None
    • 1.8.10
    • Server (S)
    • None
    • Ubuntu 10.04 LTS

      All our hosts ( 48 ) have an item agent.ping that gets (passively) updated every 30 seconds. We have a trigger on it "Host is unreachable" like 'ping.agent.nodata(180)'.

      Suddenly all hosts started becoming unreachable, coming back to reachable state a bit after, and flapping between the 2 states. A while later, all 48 hosts were marked as being unreachable.

      I noticed the MySQL partition ( on the same server ) being 98% utilized loadwise. This was the processlist at the time:

      mysql> show processlist;
      --------------------------------------------------------------------------------------------------------------------------------------------------------

      Id User Host db Command Time State Info Rows_sent Rows_examined Rows_read

      --------------------------------------------------------------------------------------------------------------------------------------------------------

      8838 zabbix localhost zabbix Sleep 17   NULL 47 70 71
      8840 zabbix localhost zabbix Sleep 41   NULL 0 60 61
      8841 zabbix localhost zabbix Sleep 85   NULL 0 53 54
      8842 zabbix localhost zabbix Sleep 81   NULL 0 75 76
      8843 zabbix localhost zabbix Sleep 0   NULL 1 42 4
      8844 zabbix localhost zabbix Sleep 7   NULL 0 0 1
      8845 zabbix localhost zabbix Sleep 2   NULL 0 0 1
      8846 zabbix localhost zabbix Sleep 2   NULL 0 0 3
      8847 zabbix localhost zabbix Query 26 Sending data select value from history_uint where itemid=5935 and clock<=1338728695 44152 0 44153
      8848 zabbix localhost zabbix Sleep 6384   NULL 0 0 1
      8849 zabbix localhost zabbix Sleep 2   NULL 0 53 54
      8850 zabbix localhost zabbix Sleep 204   NULL 0 53 54
      8851 zabbix localhost zabbix Sleep 6384   NULL 0 0 1
      8852 zabbix localhost zabbix Sleep 17   NULL 0 0 1
      8854 zabbix localhost zabbix Sleep 29   NULL 0 0 3
      8855 zabbix localhost zabbix Sleep 0   NULL 0 0 1
      8856 zabbix localhost zabbix Sleep 22   NULL 1 1 2
      8857 zabbix localhost zabbix Sleep 3676   NULL 0 0 2
      8858 zabbix localhost zabbix Sleep 488   NULL 3163 3853 49
      8859 zabbix localhost zabbix Sleep 1448   NULL 2468 3020 49
      8860 zabbix localhost zabbix Sleep 128   NULL 3127 3853 49
      8861 zabbix localhost zabbix Sleep 608   NULL 3154 3853 49
      8862 zabbix localhost zabbix Sleep 7   NULL 3129 3853 49
      11756 root localhost zabbix Query 0 NULL show processlist 0 0 1

      --------------------------------------------------------------------------------------------------------------------------------------------------------

      As you could see, 1 query blocking most of the other queries. Each of these queries took around 30-50 seconds. This made all graphs fall behind over 20 minutes.
      I noticed every query was on of 3 itemids, attached to 2 particular hosts:

      mysql> select * from items where itemid IN (5934, 5935, 5932);
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      itemid type snmp_community snmp_oid snmp_port hostid description key_ delay history trends lastvalue lastclock prevvalue status value_type trapper_hosts units multiplier delta prevorgvalue snmpv3_securityname snmpv3_securitylevel snmpv3_authpassphrase snmpv3_privpassphrase formula error lastlogsize logtimefmt templateid valuemapid delay_flex params ipmi_sensor data_type authtype username password publickey privatekey mtime

      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      5932 0     161 10065 Mogstored process proc.num[mogstored] 60 30 365 1 1338729172 1 1 3     0 0 NULL   0     1   0   5897 0       0 0         0
      5934 0     161 10065 MogileFSD process proc.num[mogilefsd] 60 14 365 18 1338731695 18 1 3     0 0 NULL   0     1   0   5896 0       0 0         0
      5935 0     161 10066 MogileFSD process proc.num[mogilefsd] 60 14 365 18 1338731695 18 1 3     0 0 NULL   0     1   0   5896 0       0 0         0

      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      Both hosts ( and only those ) had the Mogile Template containing these items.

      This went on for several hours, crippling our monitoring. I decided to disable these 2 hosts, and in a matter of couple of seconds, all agent.ping triggers went back to OK, and all graphs were fully up to date.

      For now i haven't enabled these 2 hosts again, i would like to wait and see your opinion on this first.

      Thanks!

            Unassigned Unassigned
            verwilst Bart Verwilst
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: