[ZBX-12281] SNMPv2 poller and/or proxy creates duplicate history items with same timestamp Created: 2017 Jun 14 Updated: 2024 Apr 10 Resolved: 2017 Oct 10 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P) |
Affects Version/s: | 3.2.5 |
Fix Version/s: | 3.0.12rc1, 3.2.9rc1, 3.4.3rc1, 4.0.0alpha1, 4.0 (plan) |
Type: | Problem report | Priority: | Major |
Reporter: | Jori Hämäläinen | Assignee: | Vjaceslavs Bogdanovs |
Resolution: | Fixed | Votes: | 0 |
Labels: | duplicates, history | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Zabbix-server (ext. mysql) + Zabbix-proxy(sqlite + passive-mode) collecting data |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() |
||||||||
Issue Links: |
|
||||||||
Team: | |||||||||
Sprint: | Sprint 16, Sprint 17, Sprint 18 | ||||||||
Story Points: | 2 |
Description |
I noted that on graphs there are drops to zero and investigating it further I saw following data from latest data show as values Here are the latest data values showing graph with zero. Note timestamp "2017-06-14 06:51:28" where database is returning two values, one with zero (false) and one with real value.. 2017-06-14 07:17:14 270769855.7631 2017-06-14 07:11:33 284481628.6498 2017-06-14 07:06:01 289079965.9833 2017-06-14 07:02:37 277084788.7457 2017-06-14 06:56:01 228357732.88 2017-06-14 06:51:28 0 2017-06-14 06:51:28 249467524.7403 2017-06-14 06:46:40 263783339.0135 2017-06-14 06:41:27 243063043.2119 2017-06-14 06:36:00 218321687.2289 2017-06-14 06:31:45 222512429.1539 |
Comments |
Comment by Jori Hämäläinen [ 2017 Jun 14 ] | ||||||||||||
Grafana staircase line shows same behaviour and I took 4 items to demonstrate that it is happening time-to-time with all values.. | ||||||||||||
Comment by Ingus Vilnis [ 2017 Jun 14 ] | ||||||||||||
Just few observations first. Then check Monitoring -> Graphs - > select your Zabbix server -> Zabbix data gathering process busy and see how busy the poller processes are. Please also check zabbix_server.log file for that time period if you notice any messages regarding those hosts. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jun 15 ] | ||||||||||||
Polling interval is "standard" 300s / 5minutes. Only "Update interval", no custom interval. Logs (server & proxy) do not reveal anything on timeframe like I wrote to initial ticket. But I have only default debug level (2?) active on production Zabbix. Attached graph: Far fetched but perhaps worth mentioning:
| ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jun 15 ] | ||||||||||||
Server & proxy data gathering busy graphs.. The left side spike is that I removed 1/2 of polling items ~500k (573nvps) to ~200k - now at 200nvps. But the system could handle that ~600nvps well on 3 different VMWARE machines, master, proxy and mysql-db. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jun 15 ] | ||||||||||||
I ran direct MySQL query to evaluate problem: For latest 48h on my database it returned roughly 1300 duplicate lines spread over all items I've (compared with second timestamp only, not with nanosecond). IfHCInOctets had similar number of duplicate lines. CounterChange ifHCInOctets (item for delta simple change) returned 2300 duplicate database entries. ifInUcastPkts (delta speed per second) returns duplicate entries ~2300. ifOperStatus (stored "as is") also returned duplicate rows ~2400. Template_ping (ping status & rtt) returned empty set.. (stored "as is"). So this is SNMP related issue? all items are SNMPv2 with bulk-mode enabled on host. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jun 15 ] | ||||||||||||
sysUpTime returned empty set for duplicates. Entity-sensor-mib sensorthresholds (1-4) returned 35000 duplicates in history database. Zabbix OS Linux Interfaces discovery prototypes:
| ||||||||||||
Comment by Oleksii Zagorskyi [ 2017 Jun 15 ] | ||||||||||||
Just theoretically - maybe duplicating OIDs are existing in bulk responses? DebugLevel definitely should be set to 3 - to include warnings about network errors, which can be very related. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jun 16 ] | ||||||||||||
Here is 2 days of graph and somehow for "yesterday" Zabbix has fixed itself but day before yesterday it shows odd behaviour, mostly for packets per second KPIs but also some drops on Mbps-side as well. So another thing comes to my mind, as Zabbix deleted ~300k of LLD-items, could this be due to housekeeping database-load? Is it possible? Server housekeeping settings, interval 12h and 15k maxdelete. As graphs seem to be normal now - I do not increase debug level to 3. But if I see this behaviour again I'll check additional debug levels. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jun 16 ] | ||||||||||||
I ran query returning also nanosecond timestamp and it seems that ns are different. 71575 1497330360 281217022.2940 643449884 Is it possible that Zabbix has queried ITEM twice in very short period where router ifHCInOctets has no had time to increase. Difference here is 40782ns = 41us. Another case Diff ~15us. Oleksiy, the bulk-disabling would be an option but right now the system has become more stable so I still keep following.. | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 02 ] | ||||||||||||
Is server -> proxy connection stable? | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 23 ] | ||||||||||||
ZBXNEXT-3280 may be related. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jul 31 ] | ||||||||||||
server - proxy connection should be fairly stable. I just returned from yearly vacation so I've not followed this incident for 4 weeks. Quickly looking the graphs seem to be fine and data is collected so I consider proxy->server uploads are fine. ZBXNEXT-3280 - I took a quick glance on it. To me it sounds like connectivity issue. Issue on 12281 is that other data is valid (like 12345bps) and other uploaded data is invalid (as zero = 0bps). So could 3280 cause this? | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ] | ||||||||||||
Yes, I think it can. If proxy collected values x, y, z, but sent y to server twice, server will be processing values x, y, y, z and if Save as is Delta then you will end up with y-x, y-y=0, z-y in the database. What is proxypoller process busyness level? | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jul 31 ] | ||||||||||||
Zero value - that sound logical as well. I was thinking proxy was doing "As speed" -calculation but if it is done by server with Master database then zero value is more than logical. I looked at data of reported incident-period (June 14-16) from Proxy KPI data. Zabbix-cache Usage % Free: min value >99% (history index & write cache) I added Zabbix proxy queue graph of 2 weeks. It is showing jump from Discovery & low-lever-discovery on 12th of June. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jul 31 ] | ||||||||||||
Proxy values per second and queue graph of 2 weeks. | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ] | ||||||||||||
Could you take proxypoller process busyness item from zabbix_zero_value_via_polling.gif | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jul 31 ] | ||||||||||||
Proxy poller busy graph added with [all] from 10/6 to 17/6. | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ] | ||||||||||||
Sorry, it's [avg]. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jul 31 ] | ||||||||||||
New iteration of proxy poller busy with avg & max as a graph | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ] | ||||||||||||
Hm, nothing special here... There are quite substantial load pikes on June 13, but zeros/duplicates continue on June 14 when proxypoller is pretty relaxed. It seems that on zabbix_zero.gif | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Jul 31 ] | ||||||||||||
As posted run on Jun 15th 13:04 (Jun 13th 13:00 -> Jun 15th 13:00) the direct database queries revealed many thousand item id's with same-timestamp-data. There were different types of data (as is, delta value and delta speed). from Jun 15th posting This would make me wonder the earlier thing: "Yes, I think it can. If proxy collected values x, y, z, but sent y to server twice, server will be processing values x, y, y, z and if Save as is Delta then you will end up with y-x, y-y=0, z-y in the database." | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 29 ] | ||||||||||||
Dear jhgrc, what is the value of ProxyDataFrequency parameter in server's configuration file? | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Aug 30 ] | ||||||||||||
### Option: ProxyDataFrequency ProxyDataFrequency=1 – I am thinking that as Zabbix installation was new when "0"-data problems occurred (a lot of low level discoveries etc) and I was running proxy with SQLITE (I upgraded to MariaDB later) could problem be there on high load and "wrong database" for that load. I think I had ~700nvps for while over the proxy. No nvps is settled around 300nvps after making better LLD filters. | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 30 ] | ||||||||||||
Thank you! No, nothing is wrong with database in this case. We are currently blaming too low ProxyDataFrequency, network latencies and absence of proper concurrency handling on Zabbix side when polling passive proxies. | ||||||||||||
Comment by Jori Hämäläinen [ 2017 Aug 30 ] | ||||||||||||
With Wireshark I analyzed >3ms network+linux stack latency (measured with DATA-ACK) between server and proxy on Zabbix protocol port. I've no reason to believe that it was different at the beginning. | ||||||||||||
Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 31 ] | ||||||||||||
vjaceslavs proposed a scenario, when passive proxy may send same historical data twice without any network or DB errors:
And here is a debug log of proxy trappers when this is happening in the real world: connection #1 ----> 17139:20170830:200944.906 In zbx_tls_accept() 17139:20170830:200944.907 zbx_psk_server_cb() requested PSK identity "P21" 17139:20170830:200944.907 End of zbx_tls_accept():SUCCEED (established TLSv1.2 PSK-AES128-CBC-SHA) 17139:20170830:200944.907 __zbx_zbx_setproctitle() title:'trapper #1 [processing data]' request #1 -------> 17139:20170830:200944.908 trapper got '{"request":"history data"}' 17139:20170830:200944.908 In send_proxyhistory() query lastid #1 --> 17139:20170830:200944.908 In proxy_get_lastid() field:'proxy_history.history_lastid' 17139:20170830:200944.908 query [txnlev:0] [select nextid from ids where table_name='proxy_history' and field_name='history_lastid'] 17139:20170830:200944.910 End of proxy_get_lastid():6709738 17139:20170830:200944.910 In proxy_get_history_data() 17139:20170830:200944.910 query [txnlev:0] [select id,itemid,clock,ns,timestamp,source,severity,value,logeventid,state,lastlogsize,mtime,flags from proxy_history where id>6709738 order by id limit 1000] history data #1 --> 17139:20170830:200944.916 End of proxy_get_history_data():149 records_processed:149 lastid:6709887 waiting... #1 ----> 17139:20170830:200944.917 In zbx_recv_response() connection #2 ----> 17142:20170830:200945.070 In zbx_tls_accept() 17142:20170830:200945.072 zbx_psk_server_cb() requested PSK identity "P21" 17142:20170830:200945.075 End of zbx_tls_accept():SUCCEED (established TLSv1.2 PSK-AES128-CBC-SHA) 17142:20170830:200945.084 __zbx_zbx_setproctitle() title:'trapper #4 [processing data]' confirmation #1 --> 17139:20170830:200945.103 zbx_recv_response() '{"response":"success"}' request #2 -------> 17142:20170830:200945.107 trapper got '{"request":"history data"}' 17139:20170830:200945.113 End of zbx_recv_response():SUCCEED 17142:20170830:200945.113 In send_proxyhistory() 17139:20170830:200945.113 In proxy_set_lastid() [proxy_history.history_lastid:6709887] query lastid #2 --> 17142:20170830:200945.114 In proxy_get_lastid() field:'proxy_history.history_lastid' 17139:20170830:200945.115 query [txnlev:0] [select 1 from ids where table_name='proxy_history' and field_name='history_lastid'] 17142:20170830:200945.115 query [txnlev:0] [select nextid from ids where table_name='proxy_history' and field_name='history_lastid'] 17142:20170830:200945.116 End of proxy_get_lastid():6709738 17142:20170830:200945.116 In proxy_get_history_data() 17142:20170830:200945.116 query [txnlev:0] [select id,itemid,clock,ns,timestamp,source,severity,value,logeventid,state,lastlogsize,mtime,flags from proxy_history where id>6709738 order by id limit 1000] history data #2 --> 17142:20170830:200945.118 End of proxy_get_history_data():149 records_processed:149 lastid:6709887 waiting... #2 ----> 17142:20170830:200945.118 In zbx_recv_response() 17139:20170830:200945.120 query without transaction detected update lastid #1 -> 17139:20170830:200945.120 query [txnlev:0] [update ids set nextid=6709887 where table_name='proxy_history' and field_name='history_lastid'] 17139:20170830:200945.121 End of proxy_set_lastid() confirmation #2 --> 17142:20170830:200945.131 zbx_recv_response() '{"response":"success"}' 17139:20170830:200945.133 End of send_proxyhistory() 17142:20170830:200945.134 End of zbx_recv_response():SUCCEED 17142:20170830:200945.134 In proxy_set_lastid() [proxy_history.history_lastid:6709887] 17139:20170830:200945.135 __zbx_zbx_setproctitle() title:'trapper #1 [processed data in 0.225406 sec, waiting for connection]' 17142:20170830:200945.135 query [txnlev:0] [select 1 from ids where table_name='proxy_history' and field_name='history_lastid'] 17142:20170830:200945.135 query without transaction detected update lastid #2 -> 17142:20170830:200945.135 query [txnlev:0] [update ids set nextid=6709887 where table_name='proxy_history' and field_name='history_lastid'] 17142:20170830:200945.136 End of proxy_set_lastid() 17142:20170830:200945.136 End of send_proxyhistory() 17142:20170830:200945.136 __zbx_zbx_setproctitle() title:'trapper #4 [processed data in 0.051894 sec, waiting for connection]' | ||||||||||||
Comment by Sergejs Paskevics [ 2017 Sep 29 ] | ||||||||||||
Successfully tested |