Loading...

Type: Incident report
Resolution: Won't fix
Priority: Trivial
Fix Version/s: None
Affects Version/s: 1.8.10
Component/s: Server (S)
Labels:
None
Environment:
Ubuntu 10.04 LTS

All our hosts ( 48 ) have an item agent.ping that gets (passively) updated every 30 seconds. We have a trigger on it "Host is unreachable" like 'ping.agent.nodata(180)'.

Suddenly all hosts started becoming unreachable, coming back to reachable state a bit after, and flapping between the 2 states. A while later, all 48 hosts were marked as being unreachable.

I noticed the MySQL partition ( on the same server ) being 98% utilized loadwise. This was the processlist at the time:

mysql> show processlist;
--------------------------------------------------------------------------------------------------------------------------------------------------------

Id

User

Host

db

Command

Time

State

Info

Rows_sent

Rows_examined

Rows_read

--------------------------------------------------------------------------------------------------------------------------------------------------------

8838	zabbix	localhost	zabbix	Sleep	17		NULL	47	70	71
8840	zabbix	localhost	zabbix	Sleep	41		NULL	0	60	61
8841	zabbix	localhost	zabbix	Sleep	85		NULL	0	53	54
8842	zabbix	localhost	zabbix	Sleep	81		NULL	0	75	76
8843	zabbix	localhost	zabbix	Sleep	0		NULL	1	42	4
8844	zabbix	localhost	zabbix	Sleep	7		NULL	0	0	1
8845	zabbix	localhost	zabbix	Sleep	2		NULL	0	0	1
8846	zabbix	localhost	zabbix	Sleep	2		NULL	0	0	3
8847	zabbix	localhost	zabbix	Query	26	Sending data	select value from history_uint where itemid=5935 and clock<=1338728695	44152	0	44153
8848	zabbix	localhost	zabbix	Sleep	6384		NULL	0	0	1
8849	zabbix	localhost	zabbix	Sleep	2		NULL	0	53	54
8850	zabbix	localhost	zabbix	Sleep	204		NULL	0	53	54
8851	zabbix	localhost	zabbix	Sleep	6384		NULL	0	0	1
8852	zabbix	localhost	zabbix	Sleep	17		NULL	0	0	1
8854	zabbix	localhost	zabbix	Sleep	29		NULL	0	0	3
8855	zabbix	localhost	zabbix	Sleep	0		NULL	0	0	1
8856	zabbix	localhost	zabbix	Sleep	22		NULL	1	1	2
8857	zabbix	localhost	zabbix	Sleep	3676		NULL	0	0	2
8858	zabbix	localhost	zabbix	Sleep	488		NULL	3163	3853	49
8859	zabbix	localhost	zabbix	Sleep	1448		NULL	2468	3020	49
8860	zabbix	localhost	zabbix	Sleep	128		NULL	3127	3853	49
8861	zabbix	localhost	zabbix	Sleep	608		NULL	3154	3853	49
8862	zabbix	localhost	zabbix	Sleep	7		NULL	3129	3853	49
11756	root	localhost	zabbix	Query	0	NULL	show processlist	0	0	1

--------------------------------------------------------------------------------------------------------------------------------------------------------

As you could see, 1 query blocking most of the other queries. Each of these queries took around 30-50 seconds. This made all graphs fall behind over 20 minutes.
I noticed every query was on of 3 itemids, attached to 2 particular hosts:

mysql> select * from items where itemid IN (5934, 5935, 5932);
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

itemid

type

snmp_community

snmp_oid

snmp_port

hostid

description

key_

delay

history

trends

lastvalue

lastclock

prevvalue

status

value_type

trapper_hosts

units

multiplier

delta

prevorgvalue

snmpv3_securityname

snmpv3_securitylevel

snmpv3_authpassphrase

snmpv3_privpassphrase

formula

error

lastlogsize

logtimefmt

templateid

valuemapid

delay_flex

params

ipmi_sensor

data_type

authtype

username

password

publickey

privatekey

mtime

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

5932	161	10065	Mogstored process	proc.num[mogstored]	60	30	365	1	1338729172	1	1	3	NULL	1	5897
5934	161	10065	MogileFSD process	proc.num[mogilefsd]	60	14	365	18	1338731695	18	1	3	NULL	1	5896
5935	161	10066	MogileFSD process	proc.num[mogilefsd]	60	14	365	18	1338731695	18	1	3	NULL	1	5896

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Both hosts ( and only those ) had the Mogile Template containing these items.

This went on for several hours, crippling our monitoring. I decided to disable these 2 hosts, and in a matter of couple of seconds, all agent.ping triggers went back to OK, and all graphs were fully up to date.

For now i haven't enabled these 2 hosts again, i would like to wait and see your opinion on this first.

Thanks!

Details

Description

Attachments

Activity

People

Dates