-
Incident report
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
2.0.6
-
centos 5.9, x64
After upgrading to 2.0.6 we see that something bad with restoring connection after temporary disabling checks. It's seems that this temporary became constant.
Logs:
5729:20130426:112247.920 Zabbix agent item [proc.num[,,run]] on host [eu2-db-201] failed: first network error, wait for 15 seconds 5732:20130426:112249.594 Zabbix agent item [proc.num[,,run]] on host [eu2-s-55] failed: first network error, wait for 15 seconds 5730:20130426:112253.870 Zabbix agent item [vfs.dev.read[/dev/sdb1,ops,avg1]] on host [eu2-db-205] failed: first network error, wait for 15 seconds 5729:20130426:112253.871 Zabbix agent item [vfs.fs.size[/mnt/mysql,pfree]] on host [eu2-db-205] failed: first network error, wait for 15 seconds 5732:20130426:112257.396 SNMP item [BW_baseapps_count_1_percent] on host [eu2-s-33] failed: first network error, wait for 15 seconds 5728:20130426:112257.410 SNMP item [MYSQL_status_Opened_tables] on host [eu2-s-33] failed: first network error, wait for 15 seconds 5729:20130426:112328.316 Zabbix agent item [net.if.total[eth3, bytes]] on host [eu2-s-20] failed: first network error, wait for 15 seconds 5730:20130426:112328.316 Zabbix agent item [fs.readonly] on host [eu2-s-20] failed: first network error, wait for 15 seconds 5728:20130426:112328.316 Zabbix agent item [system.cpu.util[,system,avg1]] on host [eu2-s-20] failed: first network error, wait for 15 seconds 5729:20130426:112330.814 Zabbix agent item [proc.num[,,run]] on host [wowpeu2-st1-4] failed: first network error, wait for 15 seconds 5731:20130426:112334.967 Zabbix agent item [net.if.total[eth0, bytes]] on host [eu2-st6-5] failed: first network error, wait for 15 seconds 5729:20130426:112334.968 Zabbix agent item [agent.ping] on host [eu2-st6-5] failed: first network error, wait for 15 seconds 5730:20130426:112334.968 Zabbix agent item [system.cpu.util[,idle,avg1]] on host [eu2-st6-5] failed: first network error, wait for 15 seconds 5728:20130426:112335.261 Zabbix agent item [system.cpu.util[,system,avg1]] on host [eu2-gc2012-1] failed: first network error, wait for 15 seconds 5729:20130426:112351.068 SNMP item [BW_baseapps_count_1_percent] on host [eu2-s-4] failed: first network error, wait for 15 seconds 5728:20130426:112351.114 SNMP item [MYSQL_status_Innodb_row_lock_time] on host [eu2-s-4] failed: first network error, wait for 15 seconds 5731:20130426:112400.377 Zabbix agent item [vfs.dev.read[/dev/mapper/vg00-root,ops,avg1]] on host [eu2-db-stagings-1] failed: first network error, wait for 15 seconds 5730:20130426:112400.378 Zabbix agent item [vfs.dev.read[/dev/sda2,ops,avg1]] on host [eu2-db-stagings-1] failed: first network error, wait for 15 seconds
After restart:
9067:20130426:130605.918 Starting Zabbix Proxy (active) [eu2-mgmt-1]. Zabbix 2.0.6 (revision 35158). 9067:20130426:130605.918 **** Enabled features **** 9067:20130426:130605.918 SNMP monitoring: YES 9067:20130426:130605.918 IPMI monitoring: YES 9067:20130426:130605.918 WEB monitoring: YES 9067:20130426:130605.918 ODBC: NO 9067:20130426:130605.918 SSH2 support: YES 9067:20130426:130605.918 IPv6 support: NO 9067:20130426:130605.918 ************************** 9069:20130426:130606.307 proxy #1 started [configuration syncer #1] 9070:20130426:130606.307 proxy #2 started [heartbeat sender #1] 9071:20130426:130606.308 proxy #3 started [data sender #1] 9078:20130426:130606.312 proxy #10 started [trapper #1] 9079:20130426:130606.312 proxy #11 started [trapper #2] 9080:20130426:130606.312 proxy #12 started [trapper #3] 9081:20130426:130606.313 proxy #13 started [trapper #4] 9082:20130426:130606.313 proxy #14 started [trapper #5] 9083:20130426:130606.313 proxy #15 started [icmp pinger #1] 9084:20130426:130606.314 proxy #16 started [housekeeper #1] 9084:20130426:130606.314 executing housekeeper 9085:20130426:130606.314 proxy #17 started [http poller #1] 9087:20130426:130606.315 proxy #19 started [history syncer #1] 9088:20130426:130606.315 proxy #20 started [history syncer #2] 9089:20130426:130606.315 proxy #21 started [history syncer #3] 9090:20130426:130606.316 proxy #22 started [history syncer #4] 9091:20130426:130606.316 proxy #23 started [ipmi poller #1] 9092:20130426:130606.316 proxy #24 started [ipmi poller #2] 9093:20130426:130606.316 proxy #25 started [ipmi poller #3] 9067:20130426:130606.316 proxy #0 started [main process] 9074:20130426:130606.339 proxy #6 started [poller #3] 9072:20130426:130606.340 proxy #4 started [poller #1] 9077:20130426:130606.340 proxy #9 started [unreachable poller #1] 9076:20130426:130606.340 proxy #8 started [poller #5] 9075:20130426:130606.341 proxy #7 started [poller #4] 9073:20130426:130606.342 proxy #5 started [poller #2] 9086:20130426:130606.343 proxy #18 started [discoverer #1] 9069:20130426:130607.966 Received configuration data from server. Datalen 6755893 9077:20130426:130611.347 resuming Zabbix agent checks on host [eu2-bt]: connection restored 9084:20130426:130612.291 housekeeper deleted 670557 records from history (spent 5.975718 seconds) 9077:20130426:130615.177 resuming Zabbix agent checks on host [eu2-s-96]: connection restored 9077:20130426:130615.178 resuming Zabbix agent checks on host [eu2-s-47]: connection restored 9077:20130426:130615.179 resuming Zabbix agent checks on host [eu2-s-90]: connection restored 9077:20130426:130615.180 resuming Zabbix agent checks on host [eu2-s-57]: connection restored 9077:20130426:130615.181 resuming Zabbix agent checks on host [eu2-s-72]: connection restored 9077:20130426:130615.185 resuming Zabbix agent checks on host [eu2-jabber]: connection restored 9077:20130426:130615.190 resuming Zabbix agent checks on host [eu2-wgniru]: connection restored 9077:20130426:130615.194 resuming Zabbix agent checks on host [eu2-gc2012-5]: connection restored 9077:20130426:130615.209 resuming Zabbix agent checks on host [eu2-backyard-ct]: connection restored 9077:20130426:130615.211 resuming Zabbix agent checks on host [wowpeu2-st1-2]: connection restored 9077:20130426:130615.212 resuming Zabbix agent checks on host [eu2-s-20]: connection restored 9077:20130426:130615.218 resuming Zabbix agent checks on host [wowpeu2-st1-4]: connection restored 9077:20130426:130615.225 resuming Zabbix agent checks on host [eu2-st1-3]: connection restored 9077:20130426:130615.226 resuming Zabbix agent checks on host [eu2-s-36]: connection restored 9077:20130426:130615.238 resuming Zabbix agent checks on host [eu2-s-58]: connection restored 9077:20130426:130615.239 resuming Zabbix agent checks on host [eu2-db-207]: connection restored 9077:20130426:130615.240 resuming Zabbix agent checks on host [eu2-s-37]: connection restored 9077:20130426:130615.242 resuming Zabbix agent checks on host [eu2-st1-9]: connection restored 9077:20130426:130615.246 resuming Zabbix agent checks on host [eu2-wgq-1]: connection restored 9077:20130426:130621.255 resuming Zabbix agent checks on host [eu2-st2-4]: connection restored 9077:20130426:130621.256 resuming Zabbix agent checks on host [eu2-db-209]: connection restored 9077:20130426:130621.258 resuming Zabbix agent checks on host [eu2-blitz-1]: connection restored 9077:20130426:130621.266 resuming Zabbix agent checks on host [eu2-s-25]: connection restored 9077:20130426:130621.267 resuming Zabbix agent checks on host [eu2-s-104]: connection restored 9077:20130426:130621.268 resuming Zabbix agent checks on host [eu2-s-40]: connection restored 9077:20130426:130621.272 resuming Zabbix agent checks on host [eu2-s-86]: connection restored 9077:20130426:130621.304 resuming Zabbix agent checks on host [eu2-st3-7]: connection restored 9077:20130426:130621.306 resuming Zabbix agent checks on host [eu2-db-205]: connection restored 9077:20130426:130621.323 resuming SNMP checks on host [wowseu2-frm-ru]: connection restored 9077:20130426:130621.328 resuming Zabbix agent checks on host [wowpeu2-st1-8]: connection restored 9077:20130426:130621.335 resuming SNMP checks on host [wowpeu2-st3-1]: connection restored 9077:20130426:130621.440 resuming SNMP checks on host [eu2-knl-ru]: connection restored 9077:20130426:130621.443 resuming Zabbix agent checks on host [wowpeu2-st2-8]: connection restored 9077:20130426:130621.444 resuming Zabbix agent checks on host [eu2-s-26]: connection restored 9077:20130426:130621.446 resuming Zabbix agent checks on host [eu2-gnls-balancer]: connection restored 9077:20130426:130621.448 resuming Zabbix agent checks on host [eu2-gnls-node1]: connection restored 9077:20130426:130621.456 resuming Zabbix agent checks on host [eu2-s-4]: connection restored 9077:20130426:130621.458 resuming Zabbix agent checks on host [eu2-s-61]: connection restored 9077:20130426:130621.459 resuming Zabbix agent checks on host [eu2-s-94]: connection restored 9077:20130426:130621.462 resuming Zabbix agent checks on host [eu2-gnls-ptl]: connection restored 9077:20130426:130621.575 resuming SNMP checks on host [wowpeu2-st1-7]: connection restored 9077:20130426:130621.584 resuming Zabbix agent checks on host [eu2-s-105]: connection restored 9077:20130426:130621.619 resuming Zabbix agent checks on host [wowpeu2-st1-6]: connection restored 9077:20130426:130621.736 resuming SNMP checks on host [eu2-st6-5]: connection restored 9077:20130426:130621.880 resuming SNMP checks on host [eu2-st4-6]: connection restored 9077:20130426:130622.109 resuming Zabbix agent checks on host [eu2-knl-eu]: connection restored
and etc...
So main problem is, that host not resuming for checking.
For example, after downgrade to 2.0.5:
30752:20130429:112039.864 resuming SNMP checks on host [woteu2-s-4]: connection restored 30752:20130429:112054.144 resuming SNMP checks on host [woteu2-s-33]: connection restored 30749:20130429:112138.664 SNMP item [BW_resend_max_percent] on host [woteu2-s-33] failed: first network error, wait for 15 seconds 30748:20130429:112138.764 SNMP item [MYSQL_status_Sort_rows] on host [woteu2-s-33] failed: another network error, wait for 15 seconds 30752:20130429:112154.610 resuming SNMP checks on host [woteu2-s-33]: connection restored 30757:20130429:112208.488 cannot send list of active checks to [127.0.0.1]: host [Zabbix server] not found 30750:20130429:112227.532 SNMP item [BW_cluster_onlinePlayers] on host [woteu2-s-4] failed: first network error, wait for 15 seconds 30749:20130429:112228.372 SNMP item [MYSQL_status_Qcache_hits] on host [woteu2-s-4] failed: another network error, wait for 15 seconds 30752:20130429:112250.075 resuming SNMP checks on host [woteu2-s-4]: connection restored 30749:20130429:112322.889 SNMP item [BW_cluster_onlinePlayers] on host [woteu2-s-33] failed: first network error, wait for 15 seconds 30752:20130429:112343.378 resuming SNMP checks on host [woteu2-s-33]: connection restored 30756:20130429:112408.613 cannot send list of active checks to [127.0.0.1]: host [Zabbix server] not found 30750:20130429:112517.554 SNMP item [BW_cluster_onlinePlayers] on host [woteu2-s-4] failed: first network error, wait for 15 seconds 30751:20130429:112517.991 SNMP item [MYSQL_disk_rootdir_bytes_read] on host [woteu2-s-4] failed: another network error, wait for 15 seconds 30748:20130429:112519.138 SNMP item [MYSQL_disk_rootdir_bytes_write] on host [woteu2-s-4] failed: another network error, wait for 15 seconds 30751:20130429:112533.688 SNMP item [BW_baseapps_count_1_percent] on host [woteu2-s-33] failed: first network error, wait for 15 seconds 30750:20130429:112534.971 SNMP item [BW_baseapps_count_2_percents] on host [woteu2-s-33] failed: another network error, wait for 15 seconds 30749:20130429:112534.973 SNMP item [MYSQL_status_Open_tables] on host [woteu2-s-33] failed: another network error, wait for 15 seconds 30752:20130429:112539.399 resuming SNMP checks on host [woteu2-s-4]: connection restored 30752:20130429:112549.631 resuming SNMP checks on host [woteu2-s-33]: connection restored
Please, it's very critical for us. Right now we downgrading to 2.0.5 version.