[ZBX-4284] Possible wrong host (agent!,snmp?) disabling and bad handling of Unreachable items (hosts) Created: 2011 Oct 27 Updated: 2018 Jun 14 Resolved: 2016 Feb 24 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 1.8.9 |
Fix Version/s: | 2.2.11rc1, 2.4.7rc1, 2.5.0 |
Type: | Incident report | Priority: | Blocker |
Reporter: | Oleksii Zagorskyi | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 9 |
Labels: | pollers, timeout, unavailable | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
1.8.9 (svn head) |
Attachments: |
![]() ![]() ![]() ![]() |
||||||||||||||||||||||||
Issue Links: |
|
Description |
Deep debugging -> this issue report Configuration: Zabbix agent config (added UserParameter): Zabbix server config is default. Stage 1: And the item "agent.version" will never be checked !!! because the key "sleep5" gives timeout. And it's very bad. Stage 2: 30820:20111027:134834.519 enabling Zabbix agent checks on host [it0]: host became available The key "agent.version[]" does not give to become host unavailable after three network errors (UnreachablePeriod=45 seconds). -> "agent.version[]" See "items_interval.png" screenshot additionally. But the key "sleep5" are not marked anyhow with the error state in the GUI. Single place where we can see the reason is zabbix_sever.log and a queue in the GUI. That's too not very good. Server was restarted and the filtered log (debuglevel=4) is attached. I would recommend to redesign this behavior. Specification: https://www.zabbix.org/wiki/Docs/specs/ZBX-4284 |
Comments |
Comment by Oleksii Zagorskyi [ 2012 Apr 21 ] |
<zalex> no, the |
Comment by Eric Gearhart [ 2012 Apr 22 ] |
On the SNMP specific side of this issue, if I change my items and LLD discovery rules over to SNMPv2, the "first network error" problems completely disappear. As soon as I change back over to SNMPv3 the "first network error" problem reappears, and my SNMP hosts (Cisco routers and switches) get disabled by Zabbix. As the title of this bug alludes to, I think items that are low-evel discovered should be reporting back as "OID not supported" and automatically marked as disabled. Instead, it appears that SNMPv3 items that should be reporting back "not supported" are reporting back as timeouts. |
Comment by Oleksii Zagorskyi [ 2012 May 25 ] |
Because of single slow Userparameter (takes more than 20 seconds) and customized UnreachableDelay=27 and Timeout=20 for the zabbix server, one user had such issue. Here is part of filtered log (real server with 1500 hosts, 186 NVPS): 30304:20120522:202105.398 enabling Zabbix agent checks on host [host.name]: host became available 29776:20120522:202206.198 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:202253.431 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30308:20120522:202353.670 enabling Zabbix agent checks on host [host.name]: host became available 30024:20120522:202508.563 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30307:20120522:202555.081 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30307:20120522:202655.102 enabling Zabbix agent checks on host [host.name]: host became available 29965:20120522:202804.243 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:202831.493 resuming Zabbix agent checks on host [host.name]: connection restored 30000:20120522:202907.101 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:202934.508 resuming Zabbix agent checks on host [host.name]: connection restored 30090:20120522:203007.613 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:203054.173 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30304:20120522:203154.560 enabling Zabbix agent checks on host [host.name]: host became available 30122:20120522:203306.661 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30308:20120522:203353.720 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30308:20120522:203453.113 enabling Zabbix agent checks on host [host.name]: host became available 29579:20120522:203601.715 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30307:20120522:203648.344 resuming Zabbix agent checks on host [host.name]: connection restored 30308:20120522:203648.521 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30307:20120522:203735.352 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30304:20120522:203835.011 enabling Zabbix agent checks on host [host.name]: host became available 29836:20120522:203905.405 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30307:20120522:203932.370 resuming Zabbix agent checks on host [host.name]: connection restored 29681:20120522:204007.878 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30307:20120522:204034.385 resuming Zabbix agent checks on host [host.name]: connection restored 29687:20120522:204106.037 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:204153.092 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30304:20120522:204253.108 enabling Zabbix agent checks on host [host.name]: host became available 29624:20120522:204403.412 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:204450.154 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30304:20120522:204550.177 enabling Zabbix agent checks on host [host.name]: host became available 29952:20120522:204707.395 Zabbix agent item [run_this["sudo /home/script"]] on host [host.name] failed: first network error, wait for 27 seconds 30304:20120522:204754.204 temporarily disabling Zabbix agent checks on host [host.name]: host unavailable 30304:20120522:204854.225 enabling Zabbix agent checks on host [host.name]: host became available |
Comment by Oleksii Zagorskyi [ 2013 Mar 17 ] |
Another example, see attached 60_items_update_interval_10sec.png I have 60 agent items: keys are "echo[1]-echo[59]" (ids=23370-23428) and single item key "sleep[5]" (id=23429) - they all created sequentially. On the graph the sleep[5] item was enabled for some period and for that period data on the graph look worse. Some part of server log for that period (DL=3): 31958:20130317:143149.622 resuming Zabbix agent checks on host [IT0]: connection restored 31955:20130317:143202.669 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143217.649 resuming Zabbix agent checks on host [IT0]: connection restored 31953:20130317:143222.706 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143237.686 resuming Zabbix agent checks on host [IT0]: connection restored 31954:20130317:143242.724 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143257.787 resuming Zabbix agent checks on host [IT0]: connection restored 31953:20130317:143302.751 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143317.825 resuming Zabbix agent checks on host [IT0]: connection restored 31957:20130317:143322.799 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143337.863 resuming Zabbix agent checks on host [IT0]: connection restored 31956:20130317:143342.827 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143401.939 resuming Zabbix agent checks on host [IT0]: connection restored 31953:20130317:143412.871 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143427.951 resuming Zabbix agent checks on host [IT0]: connection restored 31956:20130317:143432.898 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143447.987 resuming Zabbix agent checks on host [IT0]: connection restored 31954:20130317:143452.923 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143507.103 resuming Zabbix agent checks on host [IT0]: connection restored 31957:20130317:143512.940 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143527.117 resuming Zabbix agent checks on host [IT0]: connection restored 31957:20130317:143532.985 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143547.147 resuming Zabbix agent checks on host [IT0]: connection restored 31956:20130317:143552.987 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143613.248 resuming Zabbix agent checks on host [IT0]: connection restored 31955:20130317:143622.145 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143640.295 Zabbix agent item [sleep[5]] on host [IT0] failed: another network error, wait for 15 seconds 31958:20130317:143655.300 resuming Zabbix agent checks on host [IT0]: connection restored 31956:20130317:143702.048 Zabbix agent item [sleep[5]] on host [IT0] failed: first network error, wait for 15 seconds 31958:20130317:143719.394 resuming Zabbix agent checks on host [IT0]: connection restored Also additional graph - update interval for all items have been changed to 60 seconds - attached 60_items_update_interval_60sec.png Conclusion - single timeouted item can "kill" agents monitoring. Maybe we should consider separately cases when we cannot establish a TCP connection to a host and when it timeouted by response ? |
Comment by Oleksii Zagorskyi [ 2013 Mar 17 ] |
Related case but for SNMP only is there - |
Comment by richlv [ 2013 Mar 17 ] |
this reminded me about an issue i reported earlier... or so i thought. i searched for a LONG time. eventually i wound draft text in some file =) i'm not creating a new issue on this, but my text went like this :
extending on this idea, a rough logic draft :
i can think of so many scenarios that would benefit from different values of "n" - not sure whether there's and easy way out to work properly in all of those |
Comment by Aleksandrs Saveljevs [ 2015 Jun 17 ] |
There was the following idea during initial investigation. Currently, when we have any kind of network timeout (both connect and read), we move all items of the host to the unreachable poller queue. As zalex_ua proposed, we can distinguish between connect timeout and read timeout. So if we could connect to a host, but there was a read timeout, we can move only this item to the unreachable poller queue. However, there are two major considerations:
Note that if the above solution will only solve the problem for Zabbix agent items, then implementing it might not be worth it, because there is a much simpler solution for Zabbix agent items - just make Zabbix server timeout larger than Zabbix agent timeout (which is already so by default since |
Comment by Aleksandrs Saveljevs [ 2015 Jun 17 ] |
(1) During investigation described above, it was considered to use TIMEOUT_ERROR constant for read timeouts. In the process, the following bug with TIMEOUT_ERROR was discovered (description below is based on 2.0 branch). Currently, we only use TIMEOUT_ERROR in src/libs/zbxexec/execute.c and it is returned by zbx_execute(). That function is called in zbx_execute_script() (note just in case that "result" will not be populated in case of TIMEOUT_ERROR), which itself is called by execute_script(), which is called in node_process_command(). So TIMEOUT_ERROR is propagated from zbx_execute() to node_process_command(), where it is used as follows: response = (FAIL == ret) ? ZBX_PROTO_VALUE_FAILED : ZBX_PROTO_VALUE_SUCCESS; So TIMEOUT_ERROR will be considered a success when we execute a command from the frontend or as requested by a master node. wiper This bug was fixed in 2.2, so I created a new bugreport for 2.0 branch. See |
Comment by Andris Zeila [ 2015 Jun 19 ] |
Specification available at https://www.zabbix.org/wiki/Docs/specs/ZBX-4284 |
Comment by Andris Zeila [ 2015 Jun 25 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-4284 |
Comment by dimir [ 2015 Aug 13 ] |
Please review my small changes in r54968. wiper Thanks! |
Comment by Andris Zeila [ 2015 Aug 14 ] |
Released in:
|
Comment by richlv [ 2015 Aug 14 ] |
(2) regarding the documentation, this improvement is really great and should go to go in whatsnew, and it would also be great to describe the new behaviour in the main manual. wiper documentation:
I added to 2.2, after it's reviewed - I will copy to 2.4/3.0 documentation <richlv> thanks, did some minor style changes, otherwise looks good to me - but would be great to have martins-v review it martins-v Reading it the first time, I had some difficulty being sure what "this item" refers to in the sentence second part. So changed it to "problematic item" to clear ambiguity. But please check if I understood it right. Other than that, all seemed good. wiper Copied to 2.4 and 3.0. Not entirely sure if I found right places to insert it. <richlv> revisited the paragraph. it now says "...without affecting host availability", but that's a bit confusing - when will a host be set to be unreachable then ? after a single timeout ? if so, that's affecting host availability. after two items fail on a host ? that wouldn't work well on hosts with a single item wiper When item fails with timeout for a first time - the host is internally marked as unreachable. If item fails for a second time and there was a succesfull check between the failures - the item is marked as not supported without affecting host availability. If host has only one item - there won't be succesfull checks between timeouts and host will be processed as unreachable (and marked as unavailable after UnreachablePeriod) as before. <richlv> thanks. "internally marked" - only in the server cache ? when is that written out to the database ? wiper after it stays unreachable for UnreachablePeriod seconds. Then its flagged as unavailable and database is updated. asaveljevs So the documentation basically says the following:
I must admit that I find it hard to infer this behavior from the specification at https://www.zabbix.org/wiki/Docs/specs/ZBX-4284#Specification , which says:
It does not describe how different items affect one another: TIMEOUT_ERROR marks an item as unreachable and, according to the specification, it is stored as not supported with the unreachable poller. What if we have item checks A, B, C and both A and B time out? Will the host be marked as unreachable after B, but before C? Does bug wiper The documentation was updated when fixing asaveljevs CLOSED sasha Great work. Thanks a lot! |
Comment by Aleksandrs Saveljevs [ 2015 Dec 29 ] |
This seems to have introduced a bug: |
Comment by Andris Zeila [ 2016 Jan 21 ] |
Continuing from (2) If A & B time out with TIMEOUT_ERROR, then nothing will be done to the host. If A & B time out with NETWORK_ERROR, then host will be marked as unreachable (disable_until != 0) after the A timeout. If the B timeout will happen after UnreachablePeriod, then host will be flagged unavailable. Otherwise B timeout will simply increase disable_until by the UnreachableDelay. Successfull C check will remove unreachable flag (reset the disable_until field to 0). |
Comment by Andris Zeila [ 2016 Feb 09 ] |
A serious problem was found in one real use case. The network was configured so that the server request to agent went through, but the agent response didn't. So server decided that those items are timeouting and instead of marking host as unreachable (which would be the correct action) the items were made unsupported and left for unreachable poller to process them at their defined intervals. Of course the unreachable poller could not handle it. Based on all the problems that surfaced after this attempt of timeout item detection we dropped this detection in Apparently some kind of per item throttling (similar to all item throttling on unreachable hosts) should be implemented before trying to identify single timeouting items without affecting host unreachability. |
Comment by Oleksii Zagorskyi [ 2016 Feb 12 ] |
Andris, accidentally got a case when stopped zabbix agent on my workstation caused this: 21696:20160212:174407.279 Starting Zabbix Server. Zabbix 2.4.7rc1 (revision 54982). 21701:20160212:174407.327 server #4 started [poller #2] 21704:20160212:174407.334 server #7 started [poller #5] 21713:20160212:174407.344 server #16 started [history syncer #1] ... 21701:20160212:174936.967 Item [Zabbix server:net.if.out[eth0]] error: Get value from agent failed: cannot connect to [[127.0.0.1]:10050]: [111] Connection refused 21704:20160212:174936.967 End of get_value_agent():NETWORK_ERROR 21701:20160212:174936.967 End of get_value():NETWORK_ERROR 21704:20160212:174936.967 Item [Zabbix server:vm.memory.size[available]] error: Get value from agent failed: cannot connect to [[127.0.0.1]:10050]: [111] Connection refused 21701:20160212:174936.967 In deactivate_host() hostid:10084 itemid:31536 type:0 21704:20160212:174936.967 End of get_value():NETWORK_ERROR 21704:20160212:174936.967 In deactivate_host() hostid:10084 itemid:23316 type:0 21701:20160212:174936.967 query [txnlev:1] [begin;] 21704:20160212:174936.967 End of deactivate_host() 21704:20160212:174936.967 End of get_values():1 21704:20160212:174936.967 In DCconfig_get_poller_nextcheck() poller_type:0 21704:20160212:174936.967 End of DCconfig_get_poller_nextcheck():1455292176 21701:20160212:174936.967 query [txnlev:1] [update hosts set errors_from=1455292176,disable_until=1455292191,error='Get value from agent failed: cannot connect to [[127.0.0.1]:10050]: [111] Connection refused' where hostid=10084] ... 21713:20160212:174937.138 item "Zabbix server:vm.memory.size[available]" became not supported: Get value from agent failed: cannot connect to [[127.0.0.1]:10050]: [111] Connection refused 21713:20160212:174937.138 In DCadd_nextcheck() 21713:20160212:174937.138 End of DCadd_nextcheck() 21713:20160212:174937.138 multiply_item_value_uint64() 19981,8 159848 21713:20160212:174937.138 query [txnlev:1] [update items set state=1,error='Get value from agent failed: cannot connect to [[127.0.0.1]:10050]: [111] Connection refused' where itemid=23316; note how an item became not supported while I'd not expect that. I have a full server debug log for that period of time. |
Comment by Andris Zeila [ 2016 Feb 15 ] |
wiper Are you really sure the sever had at least this revision? With the fix only items failing with NOTSUPPORTED, AGENT_ERROR or CONFIG_ERROR errors will be made not supported. zalex_ua You have referenced another issue report - |
Comment by Oleksii Zagorskyi [ 2018 May 30 ] |
Not sure it was really fixed or we lost those improvement. Further discussion in Now it's clear that changes here were rolled back. Details in the ZBX. |