[#ZBX-14341] Passively collected items may get many more values than one per check

[ZBX-14341] Passively collected items may get many more values than one per check Created: 2018 May 09 Updated: 2024 Apr 10 Resolved: 2018 Aug 05
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Agent (G), Proxy (P), Server (S)
Affects Version/s:	3.4.8
Fix Version/s:	4.0.0beta1, 4.0 (plan)

Type:

Problem report

Priority:

Major

Reporter:

Marc

Assignee:

Andris Zeila

Resolution:

Fixed

Votes:

Labels:

protocols

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Attachments:

ExampleItem.txt

ExampleItemHistory.txt

ExampleItemHistoryPerClock.txt

Issue Links:

Duplicate
duplicates	ZBXNEXT-3280	Zabbix protocol should prevent unnece...	Open
Sub-task
part of	ZBX-10176	agent may lose active check data	Open

Team:

Team A

Sprint:

Sprint 33, Sprint 34, Sprint 35, Sprint 36, Sprint 38, Sprint 39

Story Points:

Description

Frankly, I was not able to come up with a good issue summary. Feel free to change it to a more appropriate one

We noticed today Zabbix lagging behind. After a quick investigation we found out that there was always one history syncer busy what slewed down the synchronization as a whole.

A quick investigation on "Triggers top 100" revealed many thousands of status changes for each of round about 15 triggers.

This may happen easily with log or trapper items. In this case however, all were triggers based on web scenarios items. The triggers itself made only use of the count() function and did not create multiple events. By the way, update interval of the web scenarios in question was configured to be once a minute.

So we looked on the history tables of the Zabbix server database. There we found out that for items in questions, occasionally over thousand values for the same second exist. Where it would have been expected to be only one value for the same second and that only every minute.

We haven't further investigated this, but we assume this was not limited to Web scenarios as we found at least one item of different type with similar symptoms (User parameter).

The only thing all seem to have in common was to be monitored by the same (passive) Zabbix proxy.

Zabbix proxy and Zabbix server were on the same version.

Comments

Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

May be caused by ZBX-12678.

In the following scenario due to network delays or slow processing on the server side proxy disconnects before receiving confirmation from server that data has been received and processed. But the server eventually gets the data and successfully processes them. Unfortunately, proxy will not know about it and will attempt to send same data again when server requests history next time.

Server	Proxy
sends request
waits...	receives request
waits...	...prepares data
waits...	...sends data
receives data	waits...
processes data...	waits...
processes data...	disconnects
processes data...
sends response

With default ProxyDataFrequency of 1 second it's quite plausible that this happened 2000 times in a row.

Comment by Vladislavs Sokurenko [ 2018 May 10 ]

Thank you for your report, how big is the value of Timeout in proxy configuration file ?
Do you see such errors in Zabbix proxy log:

cannot send proxy data to server at

Comment by Marc [ 2018 May 10 ]

I can confirm ProxyDataFrequency to be on its default of 1 second. I presume the same for the proxies timeout setting. Will have to go to another laptop to check. Will report back on this later on.

Don't have any evidence that the server was slow on processing (that was not related to this issue). But the connection to this particular proxy goes over a DSL fixed line with quite low bandwidth.

Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

Could you also check other items monitored by that proxy? Do all of them have duplicate values around the same time? This could be a very strong evidence to support my hypothesis.

Comment by Marc [ 2018 May 10 ]

There is a ton of time outs logged in the proxy's log:

[...]
2310:20180509:040140.616 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2310:20180509:040150.824 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2312:20180509:040201.580 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2312:20180509:040213.659 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2311:20180509:040220.772 received configuration data from server at "127.0.0.1", datalen 92101
2313:20180509:040226.217 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2311:20180509:040238.336 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2310:20180509:040249.494 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
[...]

@vso, I have to correct myself. The proxy's Timeout setting is set to 4 seconds.

Comment by Marc [ 2018 May 10 ]

@glebs.ivanovskis, In the database I find 5408 items monitored by proxy in question that have more than one value per second. There seem 89 items to be affected by this issue. The proxy is responsible for 324 items in total. The proxy by the way does not do any log or trap based monitoring.

Count of values per same clock value per item:

zabbix=# SELECT
zabbix-# *
zabbix-# FROM (SELECT
zabbix(# v.clock,
zabbix(# v.itemid,
zabbix(# COUNT(*) cnt
zabbix(# FROM history_uint_p1525824000 v,
zabbix(# items i,
zabbix(# hosts h
zabbix(# WHERE v.itemid = i.itemid
zabbix(# AND i.hostid = h.hostid
zabbix(# AND h.proxy_hostid = 12498
zabbix(# GROUP BY v.clock,
zabbix(# v.itemid) s
zabbix-# WHERE s.cnt > 1
zabbix-# ;
clock | itemid | cnt
------------+---------+------
1525827743 | 1108763 | 2424
1525827744 | 1108763 | 5
1525827744 | 1108824 | 2424
1525827745 | 1108824 | 5
1525827745 | 1108825 | 2424
1525827746 | 1108825 | 5
1525827747 | 520573 | 1995
1525827748 | 520623 | 2016

--- SNIP --- SNAP ---

1525832392 | 520639 | 1934
1525832394 | 520647 | 1898
1525832394 | 520651 | 1919
(5408 rows)

zabbix=#

Number of affected items in total:

zabbix=# SELECT
zabbix-# COUNT(*)
zabbix-# FROM (SELECT DISTINCT
zabbix(# s.itemid
zabbix(# FROM (SELECT
zabbix(# v.clock,
zabbix(# v.itemid,
zabbix(# COUNT(*) cnt
zabbix(# FROM history_uint_p1525824000 v,
zabbix(# items i,
zabbix(# hosts h
zabbix(# WHERE v.itemid = i.itemid
zabbix(# AND i.hostid = h.hostid
zabbix(# AND h.proxy_hostid = 12498
zabbix(# GROUP BY v.clock,
zabbix(# v.itemid) s
zabbix(# WHERE s.cnt > 1) t;
count
-------
89
(1 row)

zabbix=#

Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

Thank you for the information!

Actually, if items have longer update intervals than the duration of network problems, then these items may not be affected.

Comment by Marc [ 2018 May 10 ]

I missed to check other data types too:

data type	count
history_uint	89
history	114
history_log	0
history_str	33
history_text	1

Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

So 237 out of 324 items are affected.

Could you also give us information about update intervals? Sort of:

select i.delay,count(*) from hosts h,items i where h.proxy_hostid=12498 and i.hostid=h.hostid group by i.type,i.delay;

Comment by Marc [ 2018 May 10 ]

zabbix=# SELECT i.type,
zabbix-# i.delay,
zabbix-# COUNT(*)
zabbix-# FROM hosts h,
zabbix-# items i
zabbix-# WHERE h.proxy_hostid = 12498
zabbix-# AND i.hostid = h.hostid
zabbix-# AND i.status = 0
zabbix-# AND i.state = 0
zabbix-# GROUP BY i.type,
zabbix-# i.delay
zabbix-# ORDER BY i.type,
zabbix-# i.delay;
type  | delay | count
------+-------+-------
0     | 1d    | 2
0     | 1h    | 21
0     | 1m    | 72
0     | 2m    | 1
0     | 5m    | 1
5     | 10m   | 1
5     | 1m    | 13
5     | 30s   | 1
7     | 10m   | 2
7     | 1h    | 6
7     | 1m    | 38
7     | 30s   | 1
7     | 5m    | 1
7     | 5s    | 13
9     | 1m    | 141
9     | 20s   | 24

Limited it to enabled and supported items. Please not that item's status and state do not necessarily correlate with the ones at the time of facing the issue.

Comment by Andris Zeila [ 2018 Jun 21 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-14341

Comment by Andris Zeila [ 2018 Aug 03 ]

Released in:

pre-4.0.0beta1 r83443

Comment by richlv [ 2018 Aug 30 ]

Fixed version comment seems to be missing. Feel free to delete this comment once it is added.

...now it says fixed in alpha10, but that version does not seem to exist - beta1 does.

Generated at Wed Apr 24 17:17:40 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-14341] Passively collected items may get many more values than one per check Created: 2018 May 09 Updated: 2024 Apr 10 Resolved: 2018 Aug 05