[ZBX-8095] Gaps in agent.ping items cause random triggers being activated Created: 2014 Apr 16  Updated: 2017 May 30  Resolved: 2014 Jun 11

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.3
Fix Version/s: 2.2.4rc2, 2.3.2

Type: Incident report Priority: Major
Reporter: Bart Verwilst Assignee: Unassigned
Resolution: Fixed Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.5 x86_64


Attachments: PNG File Schermafdruk.png     PNG File agent.ping-configuration.png     Text File ipana-zabbix-server-items.txt     PNG File ipana.png     File just-fix.patch     File nextcheck-queueing-more-plus-heap-fix-and-debug.patch     File nextcheck-queueing-more-plus-heap-fix.patch     File nextcheck-queueing-more.patch     File nextcheck-queueing.patch    
Issue Links:
Duplicate
is duplicated by ZBX-8109 Frequest False alert's on Zabbix agen... Closed

 Description   

Ever since we upgraded from 2.0 to 2.2.x, we get random 'server is unreachable' triggers being activated, and resolved a minute or so later. This happens 1 or 2 times every day ( and night ), with no immediate cause.

Here are the agent.ping items from the time the alert is triggered:

2014.Apr.16 03:36:58 Up (1)
2014.Apr.16 03:35:28 Up (1)
2014.Apr.16 03:33:58 Up (1)
2014.Apr.16 03:32:28 Up (1)
2014.Apr.16 03:32:16 Up (1)
2014.Apr.16 03:21:58 Up (1)
2014.Apr.16 03:20:29 Up (1)
2014.Apr.16 03:18:58 Up (1)
2014.Apr.16 03:17:28 Up (1)
2014.Apr.16 03:15:58 Up (1)

As you can see there is a 10 minute gap in items. We have set DebugLevel to 4 on both server and agent, and that showed us that the server never creates/asks those items, making it seem like a server issue, not an agent one.

A bit more info about our environment:

Number of hosts (monitored/not monitored/templates) 444 384 / 7 / 53
Number of items (monitored/disabled/not supported) 26142 24525 / 513 / 1104
Number of triggers (enabled/disabled) [problem/ok] 5406 5401 / 5 [51 / 5350]
Required server performance, new values per second 144.63 -

Item: Agent ping Triggers (2) agent.ping 60 7 7 Zabbix agent Zabbix agent
Trigger:

{Template Zabbix Agent:agent.ping.nodata(5m)}

=1

PS: Our template has an update interval of 60, while all hosts put it at 90, i guess that's worthy of another bug report.

Internal Zabbix server items are quite idle, with busy poller % for example being around 10-15%. It was 20-25 %, but we increased pollers from 5 to 12 in the hopes of alleviating this problem. It didnt help.

Anything else we can provide?



 Comments   
Comment by richlv [ 2014 Apr 16 ]

data missing does not seem to point to a clear bug, it could be network issue or some misconfiguration.

Comment by Bart Verwilst [ 2014 Apr 16 ]

cat zabbix-ipana-20140410.log | awk '

{ print $6 " " substr($1,6,13)}

' | sort

As you can see, for that particular host that randomly was marked as down at that time, all checks get asked without a hitch, except for agent.ping, which is silent for 10 minutes. For clarification, above is taken from the server log on DebugLevel 4.

Comment by Aleksandrs Saveljevs [ 2014 Apr 17 ]

Related issue: ZBX-8109.

Comment by Nikolajs Agafonovs (Inactive) [ 2014 Apr 25 ]

Could you please attach the full Zabbix server log with DebugLevel 4 at that period (2014.Apr.16 03:15:58-03:36:58)?

Comment by Bart Verwilst [ 2014 Apr 25 ]

Logs are bz2'ed, 30MB+-, and can be found here: http://www.trimbletl.com/files/ZBX-8095-zbxsrv-level4.log.bz2

This is not about period 2014.Apr.16 03:15:58-03:36:58, but the one from the ipana-zabbix-server-items.txt attachment, with gap from 20140410:1031 to 20140410:1042. (It contains the debug from 10:20 to 10:59 IIRC. )

Comment by Nikolajs Agafonovs (Inactive) [ 2014 Apr 25 ]

Which is the related host name?

Comment by Bart Verwilst [ 2014 Apr 25 ]

centos6-install-ipana

Comment by Nikolajs Agafonovs (Inactive) [ 2014 Apr 28 ]

Could you please attach a screenshot of that item configuration (as I understood it's hostid:10226 itemid:47808). Also, is it known if the problem happens every day at a certain time? Also, is this happening only with one host or different ones?

Comment by Bart Verwilst [ 2014 Apr 28 ]

Problem happens daily at seemingly totally random times to seemingly random hosts. Ipana just happened to be the one that triggered when my DebugLevel was at 4. It didnt occur again for that host, but every day we have a couple of others behaving exactly the same.

Comment by Nikolajs Agafonovs (Inactive) [ 2014 Apr 28 ]

Could you actually click the "Agent ping" item to see its configuration. We would like to see a screenshot where you set the item's "Name", "Value type", "Key", "Delay" etc.

Comment by Nikolajs Agafonovs (Inactive) [ 2014 Apr 28 ]

Would it be possible to attach the log with that "Ipana" case (and tell the exact host name)?

Comment by Bart Verwilst [ 2014 May 05 ]

The logs http://www.trimbletl.com/files/ZBX-8095-zbxsrv-level4.log.bz2 are about ipana ( hostname centos6-install-ipana ).

Comment by Bart Verwilst [ 2014 May 05 ]

agent.ping configuration window as requested.

Comment by Nikolajs Agafonovs (Inactive) [ 2014 May 07 ]

Please show screenshot of configuration of agent.ping item exactly for centos6-install-ipana host. You showed it for template - this is not informative enough.

Comment by Bart Verwilst [ 2014 May 07 ]

Screenshot of ipana's host agent.ping configuration

Comment by Bart Verwilst [ 2014 May 12 ]

Anything else I can provide to you?

Comment by Nikolajs Agafonovs (Inactive) [ 2014 May 12 ]

We can't reproduce this issue. It seems like 'evaluate_function()' working correctly all the time for 'agent.ping' but 'get_value_agent()' is making pause for 10min after which it is executed two times. We'll try to investigate this issue further.

grep "centos6-install-ipana:agent.ping.nodata" ZBX-8095-zbxsrv-level4.log # there is NO pause (executes every 30sec)

grep "In get_value_agent() host:'centos6-install-ipana' addr:'centos6-install-ipana.adc.ttl' key:'agent.ping'" ZBX-8095-zbxsrv-level4.log # there IS 10min pause and then 2 times within 2seconds (executes every 90sec)

Comment by Bart Verwilst [ 2014 May 12 ]

If we can assist in running our server with more debugging enabled specifically for this case, we'll be more than happy to do so. In the form of a patch that adds this debugging for example.

Comment by soichirou jyashiki [ 2014 May 20 ]

I also confirmed the similar problem.
However, what I am using is 2.0.8
I think. There is a problem with the NODATA function.
The cause of this problem ...

1.Monitoring server high load
2.timer process is busy
3.you have used the of many NODATA funcction

Element to be used in the evaluation

item.lastclock
now = (int)time(NULL);

Acquisition timing of the two is different.

I think. You can try the Tuning StartTimers

I'm sorry if my English sentences are incorrect
i can't speak english...

Comment by Aleksandrs Saveljevs [ 2014 May 22 ]

Bart, could you please execute the following query on your database?

select * from items where itemid=47808\G

It seems a bit suspicious to me that the item is updated every 90 seconds, whereas the screenshot you posted shows 60 seconds. You might have changed the interval after the log fragment though.

My current guess, provided the database is consistent, is that the problem might be with "nextcheck" calculation and item queueing. For this reason I have prepared a patch for you against 2.2.3 that adds more debugging information for DebugLevel=4 (see nextcheck-queueing.patch). Could you please run Zabbix server for a while with this patch until the next time the issue occurs and provide a log file in the same wonderful way that you already did?

Also, are there any other custom patches that you are running?

Comment by Bart Verwilst [ 2014 May 23 ]

mysql> select * from items where itemid=47808\G
itemid: 47808
type: 0
snmp_community:
snmp_oid:
hostid: 10226
name: Agent ping
key_: agent.ping
delay: 60
history: 7
trends: 7
status: 0
value_type: 3
trapper_hosts:
units:
multiplier: 0
delta: 0
snmpv3_securityname:
snmpv3_securitylevel: 0
snmpv3_authpassphrase:
snmpv3_privpassphrase:
formula: 1
error:
lastlogsize: 0
logtimefmt:
templateid: 10020
valuemapid: 10
delay_flex:
params:
ipmi_sensor:
data_type: 0
authtype: 0
username:
password:
publickey:
privatekey:
mtime: 0
flags: 0
filter:
interfaceid: 142
port:
description: The agent always returns 1 for this item. It could be used in combination with nodata() for availability check.
inventory_link: 0
lifetime: 0
snmpv3_authprotocol: 0
snmpv3_privprotocol: 0
state: 0
snmpv3_contextname:
1 row in set (0.00 sec)

Yeah we got this fixed in the meantime, so this can be disregarded.

Comment by Bart Verwilst [ 2014 May 23 ]

Patched version is currently running with debuglevel 4.

11:00:42 centos6-zabbix-zalador | zabbix | /home/verwilst # tail -f /var/log/zabbix/zabbix_server.log | grep update_item_queue
26662:20140523:110044.282 DCupdate_item_queue() itemid:71984 location:0 poller_type:0 nextcheck:'1400835704,2014.05.23 11:01:44' old_nextcheck:'1400835704,2014.05.23 11:01:44'
26659:20140523:110045.242 DCupdate_item_queue() itemid:56385 location:0 poller_type:0 nextcheck:'1400835705,2014.05.23 11:01:45' old_nextcheck:'1400835705,2014.05.23 11:01:45'
26662:20140523:110045.308 DCupdate_item_queue() itemid:61545 location:0 poller_type:0 nextcheck:'1400835705,2014.05.23 11:01:45' old_nextcheck:'1400835705,2014.05.23 11:01:45'
26664:20140523:110045.349 DCupdate_item_queue() itemid:56145 location:0 poller_type:0 nextcheck:'1400835705,2014.05.23 11:01:45' old_nextcheck:'1400835705,2014.05.23 11:01:45'
26660:20140523:110045.380 DCupdate_item_queue() itemid:71145 location:0 poller_type:0 nextcheck:'1400835705,2014.05.23 11:01:45' old_nextcheck:'1400835705,2014.05.23 11:01:45'
<snip>

This is what you added to you the server with your patch, is it not? Which proves that it is working. We should have new data very soon! After the weekend at the latest.

Thanks for the patch!

Comment by Bart Verwilst [ 2014 May 23 ]

And finally, we did not patch zabbix_server in any other way.

Comment by Aleksandrs Saveljevs [ 2014 May 23 ]

Thank you, Bart! Shall be waiting for the log file after the weekend.

Comment by Bart Verwilst [ 2014 May 27 ]

As promised: http://www.trimbletl.com/files/zabbix_server_20140527_0800-0830.log.gz

Event that was triggered ( recovered 5 minutes later ):

Average: Host centos6-clouderamanager-caroni is unreachable
Time: 2014.05.27 08:17:30
Trigger: 109486
Event: 1396908

Comment by Aleksandrs Saveljevs [ 2014 May 28 ]

If we do a grep on that item ID, we will notice something strange:

$ grep itemid:83596 zabbix_server_20140527_0800-0830.log | grep -v -e query -e zbx_vc_
...
7991:20140527:081213.517 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171136,2014.05.27 08:12:16' old_nextcheck:'1401171136,2014.05.27 08:12:16'
7998:20140527:081216.441 In activate_host() hostid:10584 itemid:83596 type:0
7998:20140527:081216.442 DCupdate_item_queue() itemid:83596 location:0 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081314.015 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081414.645 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081515.166 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081615.685 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081716.173 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081816.700 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:081917.191 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:082017.803 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7991:20140527:082118.303 DCupdate_item_queue() itemid:83596 location:1 poller_type:0 nextcheck:'1401171196,2014.05.27 08:13:16' old_nextcheck:'1401171196,2014.05.27 08:13:16'
7998:20140527:082216.058 In activate_host() hostid:10584 itemid:83596 type:0
7998:20140527:082216.058 DCupdate_item_queue() itemid:83596 location:0 poller_type:0 nextcheck:'1401171796,2014.05.27 08:23:16' old_nextcheck:'1401171796,2014.05.27 08:23:16'
...

The strange thing being that the configuration syncer (PID 7991) updates the item with the nextcheck of 08:13:16, even when that time has long passed, until the item gets checked for an unknown reason at the correct time (08:22:16), despite having a weird and outdated nextcheck.

This leads me to believe that there are some queueing problems going on, so I have added more debugging information to the patch (see nextcheck-queueing-more.patch) that might help to investigate this further. Bart, could you please try running the server with the new patch?

Comment by Bart Verwilst [ 2014 Jun 02 ]

Patched binary installed, stay tuned for some new logs.

Comment by Bart Verwilst [ 2014 Jun 02 ]

The new log for your enjoyment: http://www.trimbletl.com/files/zabbix_server_20140602_1030-1055.log.gz

The event that was triggered ( and OK 5 minutes later ):

Average: Host centos6-mail-mack is unreachable
Time: 2014.06.02 10:47:00
Trigger: 109395
Event: 1436455

Comment by Aleksandrs Saveljevs [ 2014 Jun 02 ]

We have found one bug in the binary heap implementation. Namely, in zbx_binary_heap_remove_direct() function, we should not only try bubbling the replacement element down, but we should also try bubbling it up.

Bart, I am still in the process of investigating the issue, but could you please try patching your server with nextcheck-queueing-more-plus-heap-fix.patch (this includes both the old debugging information and the heap fix) and see whether that makes it any better?

Comment by Bart Verwilst [ 2014 Jun 03 ]

Patch has been installed. Will keep you posted!

Comment by Bart Verwilst [ 2014 Jun 03 ]

Well, we just had another false positive:

Average: Host centos6-ldap-ledgic is unreachable
Time: 2014.06.03 13:07:00
Trigger: 109378
Event: 1442183

Corresponding logs: http://www.trimbletl.com/files/zabbix_server_20140603_1300-1314.log.gz

Comment by Aleksandrs Saveljevs [ 2014 Jun 03 ]

Thank you! But the bug with binary heap has to be fixed anyway.

Comment by Bart Verwilst [ 2014 Jun 03 ]

Sure, we can call it the first positive side-effect of this issue. Throw any other patches my way, I'm standing by!

Comment by Aleksandrs Saveljevs [ 2014 Jun 05 ]

Just noticed a pattern in the logs:

$ grep 'get.value.agent.*centos6-clouderamanager-caroni.*agent.ping' zabbix_server_20140527_0800-0830.log
...
8000:20140527:081116.028 In get_value_agent() host:'centos6-clouderamanager-caroni' addr:'centos6-clouderamanager-caroni.adc.ttl' key:'agent.ping'
7998:20140527:081216.440 In get_value_agent() host:'centos6-clouderamanager-caroni' addr:'centos6-clouderamanager-caroni.adc.ttl' key:'agent.ping'
7998:20140527:082216.056 In get_value_agent() host:'centos6-clouderamanager-caroni' addr:'centos6-clouderamanager-caroni.adc.ttl' key:'agent.ping'
8002:20140527:082316.266 In get_value_agent() host:'centos6-clouderamanager-caroni' addr:'centos6-clouderamanager-caroni.adc.ttl' key:'agent.ping'
...
$ grep 'get.value.agent.*centos6-mail-mack.*agent.ping' zabbix_server_20140602_1030-1055.log
...
22438:20140602:104046.170 In get_value_agent() host:'centos6-mail-mack' addr:'centos6-mail-mack.adc.ttl' key:'agent.ping'
22439:20140602:104146.976 In get_value_agent() host:'centos6-mail-mack' addr:'centos6-mail-mack.adc.ttl' key:'agent.ping'
22438:20140602:105216.907 In get_value_agent() host:'centos6-mail-mack' addr:'centos6-mail-mack.adc.ttl' key:'agent.ping'
22443:20140602:105246.927 In get_value_agent() host:'centos6-mail-mack' addr:'centos6-mail-mack.adc.ttl' key:'agent.ping'
...
$ grep 'get.value.agent.*centos6-ldap-ledgic.*agent.ping' zabbix_server_20140603_1300-1314.log 
4840:20140603:130048.177 In get_value_agent() host:'centos6-ldap-ledgic' addr:'centos6-ldap-ledgic.adc.ttl' key:'agent.ping'
4834:20140603:130148.502 In get_value_agent() host:'centos6-ldap-ledgic' addr:'centos6-ldap-ledgic.adc.ttl' key:'agent.ping'
4842:20140603:131216.722 In get_value_agent() host:'centos6-ldap-ledgic' addr:'centos6-ldap-ledgic.adc.ttl' key:'agent.ping'
4844:20140603:131248.915 In get_value_agent() host:'centos6-ldap-ledgic' addr:'centos6-ldap-ledgic.adc.ttl' key:'agent.ping'
4843:20140603:131348.178 In get_value_agent() host:'centos6-ldap-ledgic' addr:'centos6-ldap-ledgic.adc.ttl' key:'agent.ping'

The items that disappears for 10 minutes always gets checked at 16 seconds past the minute.

Comment by Aleksandrs Saveljevs [ 2014 Jun 05 ]

Bart, would it be possible for you to attach the templates you have on these hosts so that I can try to reproduce the issue locally with settings as close as possible to yours?

Comment by Aleksandrs Saveljevs [ 2014 Jun 06 ]

In the last log, the following lines were interesting:

...
4834:20140603:130148.504 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:-1 old_itemid_at_index:0 old_size_of_heap:21926 old_min_nextcheck_in_heap:'1401793309,2014.06.03 13:01:49' location:0 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4834:20140603:130148.504 DCupdate_item_queue() itemid:48048 new_index_in_heap:21926 new_itemid_at_index:48048 new_size_of_heap:21927 new_min_nextcheck_in_heap:'1401793309,2014.06.03 13:01:49'
4831:20140603:130155.723 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793316,2014.06.03 13:01:56' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130256.528 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793377,2014.06.03 13:02:57' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130357.387 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21933 old_min_nextcheck_in_heap:'1401793437,2014.06.03 13:03:57' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130458.336 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21932 old_min_nextcheck_in_heap:'1401793499,2014.06.03 13:04:59' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130559.748 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793560,2014.06.03 13:06:00' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130700.487 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793620,2014.06.03 13:07:00' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130801.059 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793681,2014.06.03 13:08:01' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:130901.667 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21933 old_min_nextcheck_in_heap:'1401793742,2014.06.03 13:09:02' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:131002.400 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21930 old_min_nextcheck_in_heap:'1401793803,2014.06.03 13:10:03' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:131103.045 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:17433 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793863,2014.06.03 13:11:03' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4831:20140603:131203.620 DCupdate_item_queue() itemid:48048 state:0 old_index_in_heap:2178 old_itemid_at_index:48048 old_size_of_heap:21934 old_min_nextcheck_in_heap:'1401793924,2014.06.03 13:12:04' location:1 poller_type:0 nextcheck:'1401793368,2014.06.03 13:02:48' old_poller_type:0 old_nextcheck:'1401793368,2014.06.03 13:02:48'
4842:20140603:131216.722 DCconfig_get_poller_items() itemid:48048 nextcheck:'1401793368,2014.06.03 13:02:48'
4842:20140603:131216.722 DCconfig_get_poller_items() itemid:48048 taking item for processing
...

We can see that for 10 minutes item stays in the heap at the same place without moving.

This motivated me to add more debugging information to the heap to try to understand why.

Bart, could you please try the new patch nextcheck-queueing-more-plus-heap-fix-and-debug.patch? It includes all previous patches.

Comment by Bart Verwilst [ 2014 Jun 10 ]

Patch installed.

Comment by Bart Verwilst [ 2014 Jun 10 ]

http://www.trimbletl.com/files/zabbix_server_20140610_1200-1230.log.bz2

Average: Host centos6-testbench-tailor is unreachable
Time: 2014.06.10 12:07:00
Trigger: 109383
Event: 1483553

and

Average: Host centos6-radius-roshi is unreachable
Time: 2014.06.10 12:17:30
Trigger: 109408
Event: 1483609

Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Bart, could you please execute the following query in the meanwhile: "select * from items where itemid=131536\G"?

Comment by Bart Verwilst [ 2014 Jun 11 ]
mysql> select * from items where itemid=131536\G
*************************** 1. row ***************************
               itemid: 131536
                 type: 0
       snmp_community: 
             snmp_oid: 
               hostid: 10644
                 name: tupreturned
                 key_: postgresql[tupreturned]
                delay: 30
              history: 90
               trends: 365
               status: 0
           value_type: 0
        trapper_hosts: 
                units: 
           multiplier: 0
                delta: 0
  snmpv3_securityname: 
 snmpv3_securitylevel: 0
snmpv3_authpassphrase: 
snmpv3_privpassphrase: 
              formula: 1
                error: Type of received value [2535931851232.000000] is not suitable for value type [Numeric (float)]
          lastlogsize: 0
           logtimefmt: 
           templateid: 130436
           valuemapid: NULL
           delay_flex: 
               params: 
          ipmi_sensor: 
            data_type: 0
             authtype: 0
             username: 
             password: 
            publickey: 
           privatekey: 
                mtime: 0
                flags: 0
               filter: 
          interfaceid: 578
                 port: 
          description: 
       inventory_link: 0
             lifetime: 30
  snmpv3_authprotocol: 0
  snmpv3_privprotocol: 0
                state: 1
   snmpv3_contextname: 
1 row in set (0.00 sec)
Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Thank you, Bart, I have found the problem. Now I will just need a bit of time to formulate that.

Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Broken in:

$ svn log -c 37120
------------------------------------------------------------------------
r37120 | pavels | 2013-07-18 15:30:02 +0300 (Thu, 18 Jul 2013) | 10 lines

A.F.....S. [ZBXNEXT-1689] implemented server-side value cache

Before the change:
- Information about last and previous item values was stored in the items tables which required frequent updates and caused table availability problems.

After the change:
- Zabbix server stores last item values in the value cache. Frontend retrieves those values directly from the history tables. If the history storage period is set to 0, item last value information will not be available in the overview, latest data pages and the {ITEM.LASTVALUE} macro will not work. Item last check dates will no longer be displayed for unsupported items. Item change value will no longer be displayed in the latest data page when an item receives the first value.
- Item lastclock, lastns, lastvalue, prevvalue and prevorgvalue columns have been dropped.
- Item API prevorgvalue property was removed, lastclock, lastns, lastvalue and prevvalue properties are still supported. Host.get with_historical_items and hostgroup.get with_historical_items parameters were removed.
- The item queue is now requested from the Zabbix server instead of being calculated on the frontend side.
------------------------------------------------------------------------
Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Taking the last log as an example, the story goes like this.

The "agent.ping" item was checked at 12:01:48 and was put into the queue with "nextcheck" at 12:02:48. At some point it arrived at index 7750 in the queue and sat there happily. Then, at 12:02:16, the "postgresql[tupreturned]" item mentioned above was checked and requeued with "nextcheck" at 12:02:46. During requeueing it arrived at index 3874 in the queue (the parent of the "agent.ping" item). However, database syncer later found out that the received value for "postgresql[tupreturned]" was not suitable, made it not supported and tried to requeue it with "nextcheck" at 12:12:16 (this happens at the end of DCcalculate_item_delta_float() function).

At this point, all is well, except that there is a bug in DCrequeue_reachable_item() and DCrequeue_unreachable_item(). Both update "dc_item->nextcheck" before remembering the old value in "old_nextcheck" variable. Hence, in DCupdate_item_queue(), variables "dc_item->nextcheck" and "old_nextcheck" coincide, and the queue is not updated. So we end up with an inconsistent ordering in the queue, with "postgresql[tupreturned]" having the "nextcheck" at 12:12:16 and "agent.ping" sitting behind it with "nextcheck" at 12:02:48. So "agent.ping" only gets checked after "postgresql[tupreturned]" at 12:12:16.

Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

The binary heap bug split out into ZBX-8343.

Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8095 .

Comment by Bart Verwilst [ 2014 Jun 11 ]

Can we get a patch with the fix so we can test and verify before 2.2.4 is out? maybe without the debugging?
Thanks a lot for the awesome debugging skills!

Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Bart, thank you for being so cooperative!

You can find the patch for Zabbix 2.2.3 with just the fix in "just-fix.patch".

Comment by Bart Verwilst [ 2014 Jun 11 ]

Installed, will keep you posted!

Comment by Aleksandrs Saveljevs [ 2014 Jun 11 ]

Fixed in pre-2.2.4 r46421 and pre-2.3.2 (trunk) r46422.

Comment by Bart Verwilst [ 2014 Jun 19 ]

We haven't had a single false unreachable since installing this patch! Thanks a lot!

Generated at Wed Apr 24 14:35:32 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.