[ZBX-10215] host availability not updated for connection errors on timeouting items Created: 2015 Dec 28 Updated: 2017 May 30 Resolved: 2016 Feb 09 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 3.0.0alpha5 |
Fix Version/s: | 2.2.12rc1, 2.4.8rc1, 3.0.0rc2 |
Type: | Incident report | Priority: | Major |
Reporter: | Aleksandrs Saveljevs | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | timeout, unreachable | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
Description |
This seems to have been introduced in Suppose we have a host with a "sleeping" item (one that simply sleeps when queried, thus causing a timeout). If we query that item on a running agent, then the server will mark the agent as reachable, but the item as unreachable. If we now stop the agent, then the corresponding Zabbix host will not be marked as unavailable. The reason seems to be the last lines of the following code: static void deactivate_host(DC_ITEM *item, zbx_timespec_t *ts, int *available, const char *error) { const char *__function_name = "deactivate_host"; zbx_host_availability_t in, out; zabbix_log(LOG_LEVEL_DEBUG, "In %s() hostid:" ZBX_FS_UI64 " itemid:" ZBX_FS_UI64 " type:%d", __function_name, item->host.hostid, item->itemid, (int)item->type); if (FAIL == host_get_availability(&item->host, item->type, &in)) goto out; /* if the item is still flagged as unreachable while the host is reachable, */ /* it means that this is item rather than network failure */ if (0 == in.errors_from && 0 != item->unreachable) goto out; ... If an item is unreachable, then it does not influence host's availability, even if the error is NETWORK_ERROR, not TIMEOUT_ERROR. Another consequence of the bug is that deactivate_host() does not set "last_available" to HOST_AVAILABLE_FALSE in this case, and the item becomes not supported, as opposed to host becoming unavailable (see get_values() function in poller.c). |
Comments |
Comment by Andris Zeila [ 2016 Jan 27 ] |
Currently timeouted items are processed on unreachable pollers with their original update period. This imposes additional workload on unreachable pollers and in some cases could even stall them. As a quick fix was decided to treat timeouts as network errors and stop trying to identify single items with errors. |
Comment by Aleksandrs Saveljevs [ 2016 Jan 27 ] |
Trying to identify single items with errors seems to be a useful feature. Will a "not so quick fix" be considered later? wiper after we add throttling for timeouted items we will be able to process them without deactivating hosts. So items failing with timeouts will be handled. The question is if we should identify single items failing with network error. asaveljevs So will we keep this issue open or register a new ZBX? wiper I think we should register a new ZBXNEXT |
Comment by Andris Zeila [ 2016 Jan 27 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-10215 |
Comment by richlv [ 2016 Jan 27 ] |
does this negate wiper not entirely. The 'bad' checks will still get less priority. So we will not have a situation when item failed with timeout and was the first item to check again after unreachable period has passed. |
Comment by Aleksandrs Saveljevs [ 2016 Feb 08 ] |
(1) As mentioned in the issue description, is it alright that deactivate_host() does not always set "available" to HOST_AVAILABLE_FALSE (consider "goto out" cases) and for this reason items become not supported? wiper RESOLVED in r58312 asaveljevs CLOSED |
Comment by Andris Zeila [ 2016 Feb 08 ] |
Released in:
|
Comment by Andris Zeila [ 2016 Feb 08 ] |
(2) Documentation:
asaveljevs https://www.zabbix.com/documentation/2.4/manual/introduction/whatsnew248#daemon_improvements page refers to 2.2.11, but it should refer to 2.4.7. REOPENED. wiper fixed version. asaveljevs Also, "What's new" pages and upgrade notes say:
Is this really the reason? I thought that the actual reason is that properly throttling these unreachable items (so that they do not stall unreachable pollers) would take a while to implement and it is a bit too late to do that in Zabbix 3.0. wiper yes, but in practice the problem surfaced because there were network problems that were not detected by Zabbix- items from multiple hosts were flagged as unreachable instead of marking host as unreachable. So the checks were simply moved to unreachable poller rather than being delayed by UnreachablePeriod and unreachable poller could not manage so many timeouting items. Isn't it more dangerous than not throttling real timeouted items? asaveljevs How about "... because the implementation turned out to be incomplete"? wiper I added commentary to asaveljevs Great! CLOSED. |