-
Incident report
-
Resolution: Won't fix
-
Trivial
-
None
-
1.8.10
-
None
-
Ubuntu 10.04 LTS
All our hosts ( 48 ) have an item agent.ping that gets (passively) updated every 30 seconds. We have a trigger on it "Host is unreachable" like 'ping.agent.nodata(180)'.
Suddenly all hosts started becoming unreachable, coming back to reachable state a bit after, and flapping between the 2 states. A while later, all 48 hosts were marked as being unreachable.
I noticed the MySQL partition ( on the same server ) being 98% utilized loadwise. This was the processlist at the time:
mysql> show processlist;
--------------------------------------------------------------------------------------------------------------------------------------------------------
Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
--------------------------------------------------------------------------------------------------------------------------------------------------------
8838 | zabbix | localhost | zabbix | Sleep | 17 | NULL | 47 | 70 | 71 | |
8840 | zabbix | localhost | zabbix | Sleep | 41 | NULL | 0 | 60 | 61 | |
8841 | zabbix | localhost | zabbix | Sleep | 85 | NULL | 0 | 53 | 54 | |
8842 | zabbix | localhost | zabbix | Sleep | 81 | NULL | 0 | 75 | 76 | |
8843 | zabbix | localhost | zabbix | Sleep | 0 | NULL | 1 | 42 | 4 | |
8844 | zabbix | localhost | zabbix | Sleep | 7 | NULL | 0 | 0 | 1 | |
8845 | zabbix | localhost | zabbix | Sleep | 2 | NULL | 0 | 0 | 1 | |
8846 | zabbix | localhost | zabbix | Sleep | 2 | NULL | 0 | 0 | 3 | |
8847 | zabbix | localhost | zabbix | Query | 26 | Sending data | select value from history_uint where itemid=5935 and clock<=1338728695 | 44152 | 0 | 44153 |
8848 | zabbix | localhost | zabbix | Sleep | 6384 | NULL | 0 | 0 | 1 | |
8849 | zabbix | localhost | zabbix | Sleep | 2 | NULL | 0 | 53 | 54 | |
8850 | zabbix | localhost | zabbix | Sleep | 204 | NULL | 0 | 53 | 54 | |
8851 | zabbix | localhost | zabbix | Sleep | 6384 | NULL | 0 | 0 | 1 | |
8852 | zabbix | localhost | zabbix | Sleep | 17 | NULL | 0 | 0 | 1 | |
8854 | zabbix | localhost | zabbix | Sleep | 29 | NULL | 0 | 0 | 3 | |
8855 | zabbix | localhost | zabbix | Sleep | 0 | NULL | 0 | 0 | 1 | |
8856 | zabbix | localhost | zabbix | Sleep | 22 | NULL | 1 | 1 | 2 | |
8857 | zabbix | localhost | zabbix | Sleep | 3676 | NULL | 0 | 0 | 2 | |
8858 | zabbix | localhost | zabbix | Sleep | 488 | NULL | 3163 | 3853 | 49 | |
8859 | zabbix | localhost | zabbix | Sleep | 1448 | NULL | 2468 | 3020 | 49 | |
8860 | zabbix | localhost | zabbix | Sleep | 128 | NULL | 3127 | 3853 | 49 | |
8861 | zabbix | localhost | zabbix | Sleep | 608 | NULL | 3154 | 3853 | 49 | |
8862 | zabbix | localhost | zabbix | Sleep | 7 | NULL | 3129 | 3853 | 49 | |
11756 | root | localhost | zabbix | Query | 0 | NULL | show processlist | 0 | 0 | 1 |
--------------------------------------------------------------------------------------------------------------------------------------------------------
As you could see, 1 query blocking most of the other queries. Each of these queries took around 30-50 seconds. This made all graphs fall behind over 20 minutes.
I noticed every query was on of 3 itemids, attached to 2 particular hosts:
mysql> select * from items where itemid IN (5934, 5935, 5932);
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
itemid | type | snmp_community | snmp_oid | snmp_port | hostid | description | key_ | delay | history | trends | lastvalue | lastclock | prevvalue | status | value_type | trapper_hosts | units | multiplier | delta | prevorgvalue | snmpv3_securityname | snmpv3_securitylevel | snmpv3_authpassphrase | snmpv3_privpassphrase | formula | error | lastlogsize | logtimefmt | templateid | valuemapid | delay_flex | params | ipmi_sensor | data_type | authtype | username | password | publickey | privatekey | mtime |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5932 | 0 | 161 | 10065 | Mogstored process | proc.num[mogstored] | 60 | 30 | 365 | 1 | 1338729172 | 1 | 1 | 3 | 0 | 0 | NULL | 0 | 1 | 0 | 5897 | 0 | 0 | 0 | 0 | ||||||||||||||||
5934 | 0 | 161 | 10065 | MogileFSD process | proc.num[mogilefsd] | 60 | 14 | 365 | 18 | 1338731695 | 18 | 1 | 3 | 0 | 0 | NULL | 0 | 1 | 0 | 5896 | 0 | 0 | 0 | 0 | ||||||||||||||||
5935 | 0 | 161 | 10066 | MogileFSD process | proc.num[mogilefsd] | 60 | 14 | 365 | 18 | 1338731695 | 18 | 1 | 3 | 0 | 0 | NULL | 0 | 1 | 0 | 5896 | 0 | 0 | 0 | 0 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Both hosts ( and only those ) had the Mogile Template containing these items.
This went on for several hours, crippling our monitoring. I decided to disable these 2 hosts, and in a matter of couple of seconds, all agent.ping triggers went back to OK, and all graphs were fully up to date.
For now i haven't enabled these 2 hosts again, i would like to wait and see your opinion on this first.
Thanks!