[ZBX-9016] Some items in the queue indefinitely after host reboot Created: 2014 Nov 10  Updated: 2017 May 30  Resolved: 2014 Dec 04

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 2.4.1
Fix Version/s: 2.2.8rc1, 2.4.3rc1, 2.5.0

Type: Incident report Priority: Critical
Reporter: Stanislav Antic Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: cache, queue, unreachable
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6 and PostgreSQL 9.1


Attachments: PNG File busy_unreachable.png     PNG File no_data.png     PNG File no_data_graph.png    
Issue Links:
Duplicate
is duplicated by ZBX-8475 Is it safe to change "status" column ... Closed

 Description   

We have a problem with that after rebooting two of our servers have some items that don't get updated. They just stand in queue. In attachment are images of queue (detail view) and latest data graph for one of this items.
I tried to reproduce this with manually restarting test server but without success.
I didn't see any errors in zabbix server log.



 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Nov 11 ]

Are these items "Zabbix agent" or "Zabbix agent (active)"? Do we understand correctly that some items on these hosts do get updated, but some do not? Is there anything special about those items that do not? Are you monitoring these hosts through a proxy?

Comment by Stanislav Antic [ 2014 Nov 11 ]

Are these items "Zabbix agent" or "Zabbix agent (active)"?

This are all items from "Zabbix agent".

Do we understand correctly that some items on these hosts do get updated, but some do not?

Yes, some items are correctly updated on those hosts. There are three hosts: Windows, FreeBSD and Linux.

Is there anything special about those items that do not?

I dont think that there is anything special, some items are discovery some are not.

Are you monitoring these hosts through a proxy?

No, directly.

Comment by Aleksandrs Saveljevs [ 2014 Nov 12 ]

When you rebooted those servers, did you put those hosts into "no data" maintenance? If so, this looks to be a regression from ZBX-8541 - we update item "nextcheck" when a host comes out of maintenance, but we do not change their position in the queue.

Comment by Aleksandrs Saveljevs [ 2014 Nov 12 ]

Actually, disregard that - ZBX-8541 only affects items without a poller (like "Zabbix agent (active)" items).

Comment by Aleksandrs Saveljevs [ 2014 Nov 12 ]

If your servers were down for a while, their monitoring might have been overtaken by unreachable pollers. How many unreachable pollers do you have? Are you monitoring how busy they are using internal items such as "zabbix[process,unreachable poller,avg,busy]"? If so, could you please post a graph that shows whether they are stuck? If they are stuck, could you please do their "strace" so that we know what are they stuck on?

Comment by Stanislav Antic [ 2014 Nov 12 ]

It doesn't look as it is busy right now. Also all "unreachable pooler" shows "got values 0".

I missed one information, this hosts were disabled when they were offline, they were not in maintenance during their offline time.

Comment by Aleksandrs Saveljevs [ 2014 Nov 13 ]

Without ZBXNEXT-2588, this might be a bit hard to arrive at the exact cause of the problem, but we have a conjecture.

It might have been that these hosts became disabled while these items were processed. After items are processed, we call DCrequeue_items(). There, if a host is disabled, we do not requeue items. However, we also do not change dc_item->location - it stays ZBX_LOC_POLLER. And if it stays ZBX_LOC_POLLER, the item never gets back into the queue.

Comment by Stanislav Antic [ 2014 Nov 13 ]

Assumption looks reasonable. Also we rebooted server and it's OK now, so anything that was wrong was in memory state.

Comment by Aleksandrs Saveljevs [ 2014 Nov 14 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-9016 .

Comment by Aleksandrs Saveljevs [ 2014 Nov 20 ]

Issue ZBX-8475 seems to have the same cause.

Comment by Alexander Vladishev [ 2014 Dec 04 ]

(1) Please review my changes in r50994 before a merge.

asaveljevs Thank you! CLOSED.

Comment by Aleksandrs Saveljevs [ 2014 Dec 04 ]

Fixed in pre-2.2.8 r50997, pre-2.4.3 r50998, and pre-2.5.0 (trunk) r50999.

Generated at Fri Apr 19 18:19:19 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.