[ZBX-3694] Later retry on failed items with a great delay due calculate_item_nextcheck() algorithm Created: 2011 Apr 06  Updated: 2017 May 30  Resolved: 2012 Oct 08

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 1.8.4
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Ricardo Santos Assignee: Unassigned
Resolution: Won't fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Items (passive) are checked by a proxy


Attachments: Text File UJajctq7.txt     Text File kSTmiHy8.txt    

 Description   
  • We have a item (passive) with a 86400 delay (1 day).
  • My lastclock for this item is 1301956473 (2011-04-04 19:34:33)
  • Something failed on check yesterday (2011-04-05)
  • Now is 1302117161 (2011-04-06 15:12:41)
  • We're 44 hours without check.

According calculate_item_nextcheck() next check will be at 1302129262 (2011-04-06 19:34:22) when we'll be 48 hours without check.

2011-04-04 19:34:33 - last check
2011-04-05 19:34:22 - fail
2011-04-06 15:12:41 - now
2011-04-06 19:34:22 - next check

More details on attachment



 Comments   
Comment by Aleksandrs Saveljevs [ 2011 Apr 07 ]

Did you see any complaints from Zabbix proxy about that host (something like "26047:20110406:175733.971 Zabbix Host [svn]: first network error, wait for 15 seconds") or about that item (something like "26048:20110406:175822.425 Item [svn:system.cpu.util[,system,avg1]] is not supported") on April 5th?

Note also that if an item became unsupported, but its host is monitored by proxy, you will not see the error message in GUI (see ZBX-2604). Also, "lastclock" field is not updated in the "items" table on proxy (this is intended and is not a bug).

We also assume that you did not reboot your host, Zabbix agent, or Zabbix proxy anywhere near that time. Please correct us if this assumption is false.

Comment by Ricardo Santos [ 2011 Apr 07 ]

Unfortunately I don't have the logs from April 5th anymore to confirm the error.

Even if the item was set as unsupported, I think that the nextcheck should be 10 minutes later (default refresh time). But the "error" column on table "item" from proxy is empty.

Let's assume that proxy was down at that time. When should be the next check?

Comment by Aleksandrs Saveljevs [ 2011 Apr 07 ]

Items are checked at fixed times, calculated by function calculate_item_nextcheck() based on itemid and delay. So regardless of when you start a proxy, that function returns "2011-04-06 19:34:22" as item's next check (after "2011-04-05 19:34:23"), so it should be checked at that time.

Comment by Aleksandrs Saveljevs [ 2011 Apr 08 ]

Did the problem ever happen before?

Comment by Ricardo Santos [ 2011 Apr 08 ]

It's happen all the time, please look the logs at UJajctq7.txt attachment

The lastclock was "2011-04-07 06:58:13" and something was wrong with host99. Now the nextcheck will be tomorrow.

Comment by Aleksandrs Saveljevs [ 2011 Apr 08 ]

So Zabbix did try to check the item at the right time, but host could not be contacted for some reason, and Zabbix scheduled the next check of this item for tomorrow, according to the specified delay. Zabbix seems to work as expected.

The reason we do not wish to check all delayed items as soon as the host is reachable again is because that might create an intensive load on the system. For instance, a network hub was down, which made 100 hosts unreachable. When a network hub is back again, that would make the server check all 100 items on those 100 hosts simultaneously. A similar scenario is when Zabbix server has been down for a long time - it would try to check all items on all hosts then.

So, unless you have a different idea, I propose we close this issue as "Won't fix".

Comment by Ricardo Santos [ 2011 Apr 08 ]

I agree that this algorithm is great to avoid the overloading. But it also generates a false positive on queue list.

As a fix, I'll set items delay from 1 day to 15 minutes.

Comment by Aleksandrs Saveljevs [ 2011 Apr 08 ]

There are other issues related to checking items with a long delay, e.g., ZBXNEXT-473. Fixing those issues might eventually solve your problem, too. Closing.

Comment by richlv [ 2011 Apr 08 ]

somewhat also related to ZBXNEXT-388 ("deal with new items sooner")

Comment by Ricardo Santos [ 2011 Apr 08 ]

I applied other kind of fix with success changing the item status from OK (0) to UNSUPPORTED (3). Then, when an item is UNSUPPORTED the delay is overridden by CONFIG_REFRESH_UNSUPPORTED.

Something like this:
UPDATE items LEFT JOIN hosts ON items.hostid=hosts.hostid SET items.status=3,items.error='Check failed. Retrying...' WHERE items.status=0 AND hosts.status=0 AND items.templateid>0 AND (items.lastclock+items.delay+600)<UNIX_TIMESTAMP();

It's my fix, not my suggestion for a solution.

I think that the solution about ZBXNEXT-473 is to create a new status for items like CHECKNOW (4)

Comment by richlv [ 2012 Oct 08 ]

reopen to change "fix version"

Generated at Tue Jul 08 08:57:54 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.