[ZBX-3694] Later retry on failed items with a great delay due calculate_item_nextcheck() algorithm Created: 2011 Apr 06 Updated: 2017 May 30 Resolved: 2012 Oct 08 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 1.8.4 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Major |
Reporter: | Ricardo Santos | Assignee: | Unassigned |
Resolution: | Won't fix | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Items (passive) are checked by a proxy |
Attachments: |
![]() ![]() |
Description |
According calculate_item_nextcheck() next check will be at 1302129262 (2011-04-06 19:34:22) when we'll be 48 hours without check. 2011-04-04 19:34:33 - last check More details on attachment |
Comments |
Comment by Aleksandrs Saveljevs [ 2011 Apr 07 ] |
Did you see any complaints from Zabbix proxy about that host (something like "26047:20110406:175733.971 Zabbix Host [svn]: first network error, wait for 15 seconds") or about that item (something like "26048:20110406:175822.425 Item [svn:system.cpu.util[,system,avg1]] is not supported") on April 5th? Note also that if an item became unsupported, but its host is monitored by proxy, you will not see the error message in GUI (see We also assume that you did not reboot your host, Zabbix agent, or Zabbix proxy anywhere near that time. Please correct us if this assumption is false. |
Comment by Ricardo Santos [ 2011 Apr 07 ] |
Unfortunately I don't have the logs from April 5th anymore to confirm the error. Even if the item was set as unsupported, I think that the nextcheck should be 10 minutes later (default refresh time). But the "error" column on table "item" from proxy is empty. Let's assume that proxy was down at that time. When should be the next check? |
Comment by Aleksandrs Saveljevs [ 2011 Apr 07 ] |
Items are checked at fixed times, calculated by function calculate_item_nextcheck() based on itemid and delay. So regardless of when you start a proxy, that function returns "2011-04-06 19:34:22" as item's next check (after "2011-04-05 19:34:23"), so it should be checked at that time. |
Comment by Aleksandrs Saveljevs [ 2011 Apr 08 ] |
Did the problem ever happen before? |
Comment by Ricardo Santos [ 2011 Apr 08 ] |
It's happen all the time, please look the logs at UJajctq7.txt attachment The lastclock was "2011-04-07 06:58:13" and something was wrong with host99. Now the nextcheck will be tomorrow. |
Comment by Aleksandrs Saveljevs [ 2011 Apr 08 ] |
So Zabbix did try to check the item at the right time, but host could not be contacted for some reason, and Zabbix scheduled the next check of this item for tomorrow, according to the specified delay. Zabbix seems to work as expected. The reason we do not wish to check all delayed items as soon as the host is reachable again is because that might create an intensive load on the system. For instance, a network hub was down, which made 100 hosts unreachable. When a network hub is back again, that would make the server check all 100 items on those 100 hosts simultaneously. A similar scenario is when Zabbix server has been down for a long time - it would try to check all items on all hosts then. So, unless you have a different idea, I propose we close this issue as "Won't fix". |
Comment by Ricardo Santos [ 2011 Apr 08 ] |
I agree that this algorithm is great to avoid the overloading. But it also generates a false positive on queue list. As a fix, I'll set items delay from 1 day to 15 minutes. |
Comment by Aleksandrs Saveljevs [ 2011 Apr 08 ] |
There are other issues related to checking items with a long delay, e.g., |
Comment by richlv [ 2011 Apr 08 ] |
somewhat also related to |
Comment by Ricardo Santos [ 2011 Apr 08 ] |
I applied other kind of fix with success changing the item status from OK (0) to UNSUPPORTED (3). Then, when an item is UNSUPPORTED the delay is overridden by CONFIG_REFRESH_UNSUPPORTED. Something like this: It's my fix, not my suggestion for a solution. I think that the solution about |
Comment by richlv [ 2012 Oct 08 ] |
reopen to change "fix version" |