[ZBX-14341] Passively collected items may get many more values than one per check Created: 2018 May 09 Updated: 2018 Aug 30 Resolved: 2018 Aug 05 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G), Proxy (P), Server (S) |
Affects Version/s: | 3.4.8 |
Fix Version/s: | 4.0.0beta1, 4.0 (plan) |
Type: | Problem report | Priority: | Major |
Reporter: | Marc | Assignee: | Andris Zeila |
Resolution: | Fixed | Votes: | 0 |
Labels: | protocols | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | ExampleItem.txt ExampleItemHistory.txt ExampleItemHistoryPerClock.txt | ||||||||||||||||
Issue Links: |
|
||||||||||||||||
Team: | Team A | ||||||||||||||||
Sprint: | Sprint 33, Sprint 34, Sprint 35, Sprint 36, Sprint 38, Sprint 39 | ||||||||||||||||
Story Points: | 2 |
Description |
Frankly, I was not able to come up with a good issue summary. Feel free to change it to a more appropriate one We noticed today Zabbix lagging behind. After a quick investigation we found out that there was always one history syncer busy what slewed down the synchronization as a whole. A quick investigation on "Triggers top 100" revealed many thousands of status changes for each of round about 15 triggers. This may happen easily with log or trapper items. In this case however, all were triggers based on web scenarios items. The triggers itself made only use of the count() function and did not create multiple events. By the way, update interval of the web scenarios in question was configured to be once a minute. So we looked on the history tables of the Zabbix server database. There we found out that for items in questions, occasionally over thousand values for the same second exist. Where it would have been expected to be only one value for the same second and that only every minute. We haven't further investigated this, but we assume this was not limited to Web scenarios as we found at least one item of different type with similar symptoms (User parameter). The only thing all seem to have in common was to be monitored by the same (passive) Zabbix proxy. Zabbix proxy and Zabbix server were on the same version.
|
Comments |
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ] | ||||||||||||||||||||
May be caused by ZBX-12678. In the following scenario due to network delays or slow processing on the server side proxy disconnects before receiving confirmation from server that data has been received and processed. But the server eventually gets the data and successfully processes them. Unfortunately, proxy will not know about it and will attempt to send same data again when server requests history next time.
With default ProxyDataFrequency of 1 second it's quite plausible that this happened 2000 times in a row. | ||||||||||||||||||||
Comment by Vladislavs Sokurenko [ 2018 May 10 ] | ||||||||||||||||||||
Thank you for your report, how big is the value of Timeout in proxy configuration file ? cannot send proxy data to server at | ||||||||||||||||||||
Comment by Marc [ 2018 May 10 ] | ||||||||||||||||||||
I can confirm ProxyDataFrequency to be on its default of 1 second. I presume the same for the proxies timeout setting. Will have to go to another laptop to check. Will report back on this later on. Don't have any evidence that the server was slow on processing (that was not related to this issue). But the connection to this particular proxy goes over a DSL fixed line with quite low bandwidth. | ||||||||||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ] | ||||||||||||||||||||
Could you also check other items monitored by that proxy? Do all of them have duplicate values around the same time? This could be a very strong evidence to support my hypothesis. | ||||||||||||||||||||
Comment by Marc [ 2018 May 10 ] | ||||||||||||||||||||
There is a ton of time outs logged in the proxy's log: [...] 2310:20180509:040140.616 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out 2310:20180509:040150.824 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out 2312:20180509:040201.580 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out 2312:20180509:040213.659 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out 2311:20180509:040220.772 received configuration data from server at "127.0.0.1", datalen 92101 2313:20180509:040226.217 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out 2311:20180509:040238.336 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out 2310:20180509:040249.494 cannot send proxy data to server at "127.0.0.1": SSL_read() timed out [...]
@vso, I have to correct myself. The proxy's Timeout setting is set to 4 seconds. | ||||||||||||||||||||
Comment by Marc [ 2018 May 10 ] | ||||||||||||||||||||
@glebs.ivanovskis, In the database I find 5408 items monitored by proxy in question that have more than one value per second. There seem 89 items to be affected by this issue. The proxy is responsible for 324 items in total. The proxy by the way does not do any log or trap based monitoring. Count of values per same clock value per item: zabbix=# SELECT zabbix-# * zabbix-# FROM (SELECT zabbix(# v.clock, zabbix(# v.itemid, zabbix(# COUNT(*) cnt zabbix(# FROM history_uint_p1525824000 v, zabbix(# items i, zabbix(# hosts h zabbix(# WHERE v.itemid = i.itemid zabbix(# AND i.hostid = h.hostid zabbix(# AND h.proxy_hostid = 12498 zabbix(# GROUP BY v.clock, zabbix(# v.itemid) s zabbix-# WHERE s.cnt > 1 zabbix-# ; clock | itemid | cnt ------------+---------+------ 1525827743 | 1108763 | 2424 1525827744 | 1108763 | 5 1525827744 | 1108824 | 2424 1525827745 | 1108824 | 5 1525827745 | 1108825 | 2424 1525827746 | 1108825 | 5 1525827747 | 520573 | 1995 1525827748 | 520623 | 2016 --- SNIP --- SNAP --- 1525832392 | 520639 | 1934 1525832394 | 520647 | 1898 1525832394 | 520651 | 1919 (5408 rows) zabbix=#
Number of affected items in total: zabbix=# SELECT zabbix-# COUNT(*) zabbix-# FROM (SELECT DISTINCT zabbix(# s.itemid zabbix(# FROM (SELECT zabbix(# v.clock, zabbix(# v.itemid, zabbix(# COUNT(*) cnt zabbix(# FROM history_uint_p1525824000 v, zabbix(# items i, zabbix(# hosts h zabbix(# WHERE v.itemid = i.itemid zabbix(# AND i.hostid = h.hostid zabbix(# AND h.proxy_hostid = 12498 zabbix(# GROUP BY v.clock, zabbix(# v.itemid) s zabbix(# WHERE s.cnt > 1) t; count ------- 89 (1 row) zabbix=# | ||||||||||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ] | ||||||||||||||||||||
Thank you for the information! Actually, if items have longer update intervals than the duration of network problems, then these items may not be affected. | ||||||||||||||||||||
Comment by Marc [ 2018 May 10 ] | ||||||||||||||||||||
I missed to check other data types too:
| ||||||||||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2018 May 10 ] | ||||||||||||||||||||
So 237 out of 324 items are affected. Could you also give us information about update intervals? Sort of: select i.delay,count(*) from hosts h,items i where h.proxy_hostid=12498 and i.hostid=h.hostid group by i.type,i.delay; | ||||||||||||||||||||
Comment by Marc [ 2018 May 10 ] | ||||||||||||||||||||
zabbix=# SELECT i.type, zabbix-# i.delay, zabbix-# COUNT(*) zabbix-# FROM hosts h, zabbix-# items i zabbix-# WHERE h.proxy_hostid = 12498 zabbix-# AND i.hostid = h.hostid zabbix-# AND i.status = 0 zabbix-# AND i.state = 0 zabbix-# GROUP BY i.type, zabbix-# i.delay zabbix-# ORDER BY i.type, zabbix-# i.delay; type | delay | count ------+-------+------- 0 | 1d | 2 0 | 1h | 21 0 | 1m | 72 0 | 2m | 1 0 | 5m | 1 5 | 10m | 1 5 | 1m | 13 5 | 30s | 1 7 | 10m | 2 7 | 1h | 6 7 | 1m | 38 7 | 30s | 1 7 | 5m | 1 7 | 5s | 13 9 | 1m | 141 9 | 20s | 24 Limited it to enabled and supported items. Please not that item's status and state do not necessarily correlate with the ones at the time of facing the issue. | ||||||||||||||||||||
Comment by Andris Zeila [ 2018 Jun 21 ] | ||||||||||||||||||||
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-14341 | ||||||||||||||||||||
Comment by Andris Zeila [ 2018 Aug 03 ] | ||||||||||||||||||||
Released in:
| ||||||||||||||||||||
Comment by richlv [ 2018 Aug 30 ] | ||||||||||||||||||||
Fixed version comment seems to be missing. Feel free to delete this comment once it is added. ...now it says fixed in alpha10, but that version does not seem to exist - beta1 does. |