[ZBX-10215] host availability not updated for connection errors on timeouting items Created: 2015 Dec 28  Updated: 2017 May 30  Resolved: 2016 Feb 09

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 3.0.0alpha5
Fix Version/s: 2.2.12rc1, 2.4.8rc1, 3.0.0rc2

Type: Incident report Priority: Major
Reporter: Aleksandrs Saveljevs Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: timeout, unreachable
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by ZBX-10498 agent.ping.nodata({$period}) stopped ... Closed

 Description   

This seems to have been introduced in ZBX-4284.

Suppose we have a host with a "sleeping" item (one that simply sleeps when queried, thus causing a timeout). If we query that item on a running agent, then the server will mark the agent as reachable, but the item as unreachable.

If we now stop the agent, then the corresponding Zabbix host will not be marked as unavailable. The reason seems to be the last lines of the following code:

static void	deactivate_host(DC_ITEM *item, zbx_timespec_t *ts, int *available, const char *error)
{
	const char		*__function_name = "deactivate_host";

	zbx_host_availability_t	in, out;

	zabbix_log(LOG_LEVEL_DEBUG, "In %s() hostid:" ZBX_FS_UI64 " itemid:" ZBX_FS_UI64 " type:%d",
			__function_name, item->host.hostid, item->itemid, (int)item->type);

	if (FAIL == host_get_availability(&item->host, item->type, &in))
		goto out;

	/* if the item is still flagged as unreachable while the host is reachable, */
	/* it means that this is item rather than network failure                   */
	if (0 == in.errors_from && 0 != item->unreachable)
		goto out;
	
	...

If an item is unreachable, then it does not influence host's availability, even if the error is NETWORK_ERROR, not TIMEOUT_ERROR.

Another consequence of the bug is that deactivate_host() does not set "last_available" to HOST_AVAILABLE_FALSE in this case, and the item becomes not supported, as opposed to host becoming unavailable (see get_values() function in poller.c).



 Comments   
Comment by Andris Zeila [ 2016 Jan 27 ]

Currently timeouted items are processed on unreachable pollers with their original update period. This imposes additional workload on unreachable pollers and in some cases could even stall them.

As a quick fix was decided to treat timeouts as network errors and stop trying to identify single items with errors.

Comment by Aleksandrs Saveljevs [ 2016 Jan 27 ]

Trying to identify single items with errors seems to be a useful feature. Will a "not so quick fix" be considered later?

wiper after we add throttling for timeouted items we will be able to process them without deactivating hosts. So items failing with timeouts will be handled. The question is if we should identify single items failing with network error.

asaveljevs So will we keep this issue open or register a new ZBX?

wiper I think we should register a new ZBXNEXT

Comment by Andris Zeila [ 2016 Jan 27 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-10215

Comment by richlv [ 2016 Jan 27 ]

does this negate ZBX-4284 then ?

wiper not entirely. The 'bad' checks will still get less priority. So we will not have a situation when item failed with timeout and was the first item to check again after unreachable period has passed.

Comment by Aleksandrs Saveljevs [ 2016 Feb 08 ]

(1) As mentioned in the issue description, is it alright that deactivate_host() does not always set "available" to HOST_AVAILABLE_FALSE (consider "goto out" cases) and for this reason items become not supported?

wiper RESOLVED in r58312

asaveljevs CLOSED

Comment by Andris Zeila [ 2016 Feb 08 ]

Released in:

  • pre-2.2.12rc1 r58315
  • pre-2.4.8rc1 r58316
  • pre-3.0.0rc2 r58317
Comment by Andris Zeila [ 2016 Feb 08 ]

(2) Documentation:

asaveljevs https://www.zabbix.com/documentation/2.4/manual/introduction/whatsnew248#daemon_improvements page refers to 2.2.11, but it should refer to 2.4.7. REOPENED.

wiper fixed version.

asaveljevs Also, "What's new" pages and upgrade notes say:

The detection of a single item failing with network/timeout error introduced in Zabbix 2.2.11 was removed because of inability to distinguish possible network errors.

Is this really the reason? I thought that the actual reason is that properly throttling these unreachable items (so that they do not stall unreachable pollers) would take a while to implement and it is a bit too late to do that in Zabbix 3.0.

wiper yes, but in practice the problem surfaced because there were network problems that were not detected by Zabbix- items from multiple hosts were flagged as unreachable instead of marking host as unreachable. So the checks were simply moved to unreachable poller rather than being delayed by UnreachablePeriod and unreachable poller could not manage so many timeouting items. Isn't it more dangerous than not throttling real timeouted items?

asaveljevs How about "... because the implementation turned out to be incomplete"?

wiper I added commentary to ZBX-4284 describing the problems of that solution.
RESOLVED

asaveljevs Great! CLOSED.

Generated at Thu Apr 25 18:09:09 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.