[ZBX-14341] Passively collected items may get many more values than one per check Created: 2018 May 09  Updated: 2024 Apr 10  Resolved: 2018 Aug 05

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G), Proxy (P), Server (S)
Affects Version/s: 3.4.8
Fix Version/s: 4.0.0beta1, 4.0 (plan)

Type: Problem report Priority: Major
Reporter: Marc Assignee: Andris Zeila
Resolution: Fixed Votes: 0
Labels: protocols
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File ExampleItem.txt     Text File ExampleItemHistory.txt     Text File ExampleItemHistoryPerClock.txt    
Issue Links:
Duplicate
duplicates ZBXNEXT-3280 Zabbix protocol should prevent unnece... Open
Sub-task
part of ZBX-10176 agent may lose active check data Open
Team: Team A
Sprint: Sprint 33, Sprint 34, Sprint 35, Sprint 36, Sprint 38, Sprint 39
Story Points: 2

 Description   

Frankly, I was not able to come up with a good issue summary. Feel free to change it to a more appropriate one

We noticed today Zabbix lagging behind. After a quick investigation we found out that there was always one history syncer busy what slewed down the synchronization as a whole.

A quick investigation on "Triggers top 100" revealed many thousands of status changes for each of round about 15 triggers.

This may happen easily with log or trapper items. In this case however, all were triggers based on web scenarios items. The triggers itself made only use of the count() function and did not create multiple events. By the way, update interval of the web scenarios in question was configured to be once a minute.

So we looked on the history tables of the Zabbix server database. There we found out that for items in questions, occasionally over thousand values for the same second exist. Where it would have been expected to be only one value for the same second and that only every minute.

We haven't further investigated this, but we assume this was not limited to Web scenarios as we found at least one item of different type with similar symptoms (User parameter).

The only thing all seem to have in common was to be monitored by the same (passive) Zabbix proxy.

Zabbix proxy and Zabbix server were on the same version.

 



 Comments   
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

May be caused by ZBX-12678.

In the following scenario due to network delays or slow processing on the server side proxy disconnects before receiving confirmation from server that data has been received and processed. But the server eventually gets the data and successfully processes them. Unfortunately, proxy will not know about it and will attempt to send same data again when server requests history next time.

Server Proxy
sends request  
waits... receives request
waits... ...prepares data
waits... ...sends data
receives data waits...
processes data... waits...
processes data... disconnects
processes data...  
sends response  

With default ProxyDataFrequency of 1 second it's quite plausible that this happened 2000 times in a row.

Comment by Vladislavs Sokurenko [ 2018 May 10 ]

Thank you for your report, how big is the value of Timeout in proxy configuration file ?
Do you see such errors in Zabbix proxy log:

cannot send proxy data to server at
Comment by Marc [ 2018 May 10 ]

I can confirm ProxyDataFrequency to be on its default of 1 second. I presume the same for the proxies timeout setting. Will have to go to another laptop to check. Will report back on this later on.

Don't have any evidence that the server was slow on processing (that was not related to this issue). But the connection to this particular proxy goes over a DSL fixed line with quite low bandwidth.

Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

Could you also check other items monitored by that proxy? Do all of them have duplicate values around the same time? This could be a very strong evidence to support my hypothesis.

Comment by Marc [ 2018 May 10 ]

There is a ton of time outs logged in the proxy's log:

[...]
2310:20180509:040140.616 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2310:20180509:040150.824 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2312:20180509:040201.580 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2312:20180509:040213.659 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2311:20180509:040220.772 received configuration data from server at "127.0.0.1", datalen 92101
2313:20180509:040226.217 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2311:20180509:040238.336 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
2310:20180509:040249.494 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out
[...]

 

@vso, I have to correct myself. The proxy's Timeout setting is set to 4 seconds.

Comment by Marc [ 2018 May 10 ]

@glebs.ivanovskis, In the database I find 5408 items monitored by proxy in question that have more than one value per second. There seem 89 items to be affected by this issue. The proxy is responsible for 324 items in total. The proxy by the way does not do any log or trap based monitoring.

Count of values per same clock value per item:

zabbix=# SELECT
zabbix-# *
zabbix-# FROM (SELECT
zabbix(# v.clock,
zabbix(# v.itemid,
zabbix(# COUNT(*) cnt
zabbix(# FROM history_uint_p1525824000 v,
zabbix(# items i,
zabbix(# hosts h
zabbix(# WHERE v.itemid = i.itemid
zabbix(# AND i.hostid = h.hostid
zabbix(# AND h.proxy_hostid = 12498
zabbix(# GROUP BY v.clock,
zabbix(# v.itemid) s
zabbix-# WHERE s.cnt > 1
zabbix-# ;
clock | itemid | cnt
------------+---------+------
1525827743 | 1108763 | 2424
1525827744 | 1108763 | 5
1525827744 | 1108824 | 2424
1525827745 | 1108824 | 5
1525827745 | 1108825 | 2424
1525827746 | 1108825 | 5
1525827747 | 520573 | 1995
1525827748 | 520623 | 2016

--- SNIP --- SNAP ---

1525832392 | 520639 | 1934
1525832394 | 520647 | 1898
1525832394 | 520651 | 1919
(5408 rows)

zabbix=#

 

Number of affected items in total:

zabbix=# SELECT
zabbix-# COUNT(*)
zabbix-# FROM (SELECT DISTINCT
zabbix(# s.itemid
zabbix(# FROM (SELECT
zabbix(# v.clock,
zabbix(# v.itemid,
zabbix(# COUNT(*) cnt
zabbix(# FROM history_uint_p1525824000 v,
zabbix(# items i,
zabbix(# hosts h
zabbix(# WHERE v.itemid = i.itemid
zabbix(# AND i.hostid = h.hostid
zabbix(# AND h.proxy_hostid = 12498
zabbix(# GROUP BY v.clock,
zabbix(# v.itemid) s
zabbix(# WHERE s.cnt > 1) t;
count
-------
89
(1 row)

zabbix=#
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

Thank you for the information!

Actually, if items have longer update intervals than the duration of network problems, then these items may not be affected.

Comment by Marc [ 2018 May 10 ]

I missed to check other data types too:

data type count
history_uint 89
history 114
history_log 0
history_str 33
history_text 1
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ]

So 237 out of 324 items are affected.

Could you also give us information about update intervals? Sort of:

select i.delay,count(*) from hosts h,items i where h.proxy_hostid=12498 and i.hostid=h.hostid group by i.type,i.delay;
Comment by Marc [ 2018 May 10 ]
zabbix=# SELECT i.type,
zabbix-# i.delay,
zabbix-# COUNT(*)
zabbix-# FROM hosts h,
zabbix-# items i
zabbix-# WHERE h.proxy_hostid = 12498
zabbix-# AND i.hostid = h.hostid
zabbix-# AND i.status = 0
zabbix-# AND i.state = 0
zabbix-# GROUP BY i.type,
zabbix-# i.delay
zabbix-# ORDER BY i.type,
zabbix-# i.delay;
type  | delay | count
------+-------+-------
0     | 1d    | 2
0     | 1h    | 21
0     | 1m    | 72
0     | 2m    | 1
0     | 5m    | 1
5     | 10m   | 1
5     | 1m    | 13
5     | 30s   | 1
7     | 10m   | 2
7     | 1h    | 6
7     | 1m    | 38
7     | 30s   | 1
7     | 5m    | 1
7     | 5s    | 13
9     | 1m    | 141
9     | 20s   | 24

Limited it to enabled and supported items. Please not that item's status and state do not necessarily correlate with the ones at the time of facing the issue. 

Comment by Andris Zeila [ 2018 Jun 21 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-14341

Comment by Andris Zeila [ 2018 Aug 03 ]

Released in:

  • pre-4.0.0beta1 r83443
Comment by richlv [ 2018 Aug 30 ]

Fixed version comment seems to be missing. Feel free to delete this comment once it is added.

...now it says fixed in alpha10, but that version does not seem to exist - beta1 does.

Generated at Wed Apr 24 17:17:40 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.