-
Incident report
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
1.9.7 (beta), 1.9.8 (beta)
-
None
-
Debian x32 & x64
I have problems with retrying to get a value.
First found in version 1.9.7 (fresh install), upgrade to 1.9.9 didn't fixed it. Tested on 2 servers with lots of clients.
I found some fixed issues on similar errors, but it seems they are not completely fixed, upgrade to the 1.9.9 doen't help.
Logs are populated with the following:
17808:20120210:161916.259 resuming Zabbix agent checks on host [lari-casino]: connection restored
17821:20120210:161923.164 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
17796:20120210:161925.241 Zabbix agent item [system.swap.size[,pfree]] on host [lari-poker] failed: first network error, wait for 20 seconds
17751:20120210:161929.219 Zabbix agent item [system.cpu.load[,avg15]] on host [lari-casino] failed: first network error, wait for 20 seconds
17812:20120210:161949.182 resuming Zabbix agent checks on host [lari-casino]: connection restored
17782:20120210:161958.749 Zabbix agent item [vm.memory.size[total]] on host [gw.viaden.com] failed: first network error, wait for 20 seconds
17782:20120210:162005.730 Zabbix agent item [vfs.fs.size[/,pfree]] on host [lari-casino] failed: first network error, wait for 20 seconds
17819:20120210:162018.302 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
17816:20120210:162025.407 resuming Zabbix agent checks on host [lari-casino]: connection restored
17785:20120210:162102.411 Zabbix agent item [vfs.fs.inode[/home,pfree]] on host [lari-casino] failed: first network error, wait for 20 seconds
17806:20120210:162122.346 resuming Zabbix agent checks on host [lari-casino]: connection restored
17714:20120210:162124.508 Zabbix agent item [system.cpu.util[,idle,avg1]] on host [gw.viaden.com] failed: first network error, wait for 20 seconds
17793:20120210:162126.288 Zabbix agent item [vm.memory.inactive] on host [gw.viaden.com] failed: another network error, wait for 20 seconds
17726:20120210:162140.805 Zabbix agent item [system.cpu.load[,avg1]] on host [lari-casino] failed: first network error, wait for 20 seconds
17805:20120210:162146.459 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
17805:20120210:162200.672 resuming Zabbix agent checks on host [lari-casino]: connection restored
Note, keys and servers are different.
Tested different UnreachableDelay (from 5 to 20).
This is not connectivity issue, the same time tested with multiple zabbix_get - no errors at all.
The agent log with debug enabled shows no errors - it always sends data back.
tcpdump shows a lot of RST flags from server. It doesn't seem to be right tcp session end.
I tried to disable checks on host, wait until queue is cleared, then start monitoring again. It doesn't help.
Agent and server restarts sometimes help, sometimes not. The issue occurs randomly and can dissaper after some time (few hours ordinary), or stay for a long time.
There are no strange spikes on the internal zabbix monitoring graphs (except housekeeping tasks), network activity and pooling are stable.
That's interesting. Could it be related to some limits of Linux kernel related to TCP stack? Do you see anything suspicious in syslog or kern.log?