-
Incident report
-
Resolution: Fixed
-
Blocker
-
1.8.9
-
1.8.9 (svn head)
Deep debugging -> this issue report
Configuration:
Single enabled host with three agent items. The keys are:
agent.version (itemid=22535)
agent.version[] (itemid=22536) <----DISABLED for the experiment stage 1 (similar key selected for easier searching across log file)
sleep5 (itemid=22537)
Update interval for they all is 30 seconds.
Zabbix agent config (added UserParameter):
UserParameter=sleep5,sleep 5
Zabbix server config is default.
Stage 1:
In the zabbix_server.log I see:
30817:20111027:133610.349 Zabbix agent item [sleep5] on host [it0] failed: first network error, wait for 15 seconds
30820:20111027:133628.418 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:133646.421 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:133704.426 temporarily disabling Zabbix agent checks on host [it0]: host unavailable
And the item "agent.version" will never be checked !!! because the key "sleep5" gives timeout.
And it's very bad.
Stage 2:
In this moment I enabled the item with the key "agent.version[]"
And server behavior changed to this:
30820:20111027:134834.519 enabling Zabbix agent checks on host [it0]: host became available
30815:20111027:134840.435 Zabbix agent item [sleep5] on host [it0] failed: first network error, wait for 15 seconds
30820:20111027:134858.525 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:134916.528 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:134931.530 resuming Zabbix agent checks on host [it0]: connection restored
30819:20111027:134940.443 Zabbix agent item [sleep5] on host [it0] failed: first network error, wait for 15 seconds
30820:20111027:134958.538 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:135016.541 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:135031.544 resuming Zabbix agent checks on host [it0]: connection restored
30815:20111027:135040.450 Zabbix agent item [sleep5] on host [it0] failed: first network error, wait for 15 seconds
30820:20111027:135058.550 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:135116.553 Zabbix agent item [sleep5] on host [it0] failed: another network error, wait for 15 seconds
30820:20111027:135131.556 resuming Zabbix agent checks on host [it0]: connection restored
.... etc, etc, etc
The key "agent.version[]" does not give to become host unavailable after three network errors (UnreachablePeriod=45 seconds).
Currently, the keys "agent.version" and "agent.version[]" are checked but with the not proper & unstable intervals:
-> "agent.version"
2011.Oct.27 13:51:36 1.8.9rc1
2011.Oct.27 13:50:36 1.8.9rc1
2011.Oct.27 13:49:36 1.8.9rc1
2011.Oct.27 13:48:36 1.8.9rc1
-> "agent.version[]"
2011.Oct.27 13:51:36 1.8.9rc1
2011.Oct.27 13:51:31 1.8.9rc1
2011.Oct.27 13:50:36 1.8.9rc1
2011.Oct.27 13:50:31 1.8.9rc1
2011.Oct.27 13:49:36 1.8.9rc1
2011.Oct.27 13:49:31 1.8.9rc1
2011.Oct.27 13:48:36 1.8.9rc1
2011.Oct.27 13:48:34 1.8.9rc1
See "items_interval.png" screenshot additionally.
But the key "sleep5" are not marked anyhow with the error state in the GUI.
Single place where we can see the reason is zabbix_sever.log and a queue in the GUI.
That's too not very good.
Server was restarted and the filtered log (debuglevel=4) is attached.
Filter is: grep -E " started|agent.version|network error|connection restored|became available|sleep5|agent result" zabbix_server.log > demo.log
I would recommend to redesign this behavior.
Specification: https://www.zabbix.org/wiki/Docs/specs/ZBX-4284
- is duplicated by
-
ZBXNEXT-1022 heartbeat communication for between Zabbix server and agent
- Reopened
-
ZBXNEXT-840 if some snmp values timeout, the whole host doesn't get snmp values
- Closed
-
ZBX-4371 Low level discovery bind to snmp_errors
- Closed
-
ZBX-5943 Zabbix treats SNMP noSuchName as "host unavailable"
- Closed
-
ZBX-9609 The host is treated as unavailable because of one failed check.
- Closed