[ZBX-12281] SNMPv2 poller and/or proxy creates duplicate history items with same timestamp Created: 2017 Jun 14  Updated: 2018 Oct 09  Resolved: 2017 Oct 10

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P)
Affects Version/s: 3.2.5
Fix Version/s: 3.0.12rc1, 3.2.9rc1, 3.4.3rc1, 4.0.0alpha1, 4.0 (plan)

Type: Problem report Priority: Major
Reporter: Jori Hämäläinen Assignee: Vjaceslavs Bogdanovs
Resolution: Fixed Votes: 0
Labels: duplicates, history
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Zabbix-server (ext. mysql) + Zabbix-proxy(sqlite + passive-mode) collecting data
Standard ifHCInOctets with save as "speed" item as a float
Seached server/proxy logs with default logging no entries on timestamp in question.


Attachments: PNG File zabbix-proxy-poller-busy.png     PNG File zabbix-proxy-queue.png     GIF File zabbix_zero.gif     GIF File zabbix_zero_value.gif     GIF File zabbix_zero_value_via_grafana.gif     GIF File zabbix_zero_value_via_polling.gif    
Issue Links:
Duplicate
is duplicated by ZBX-8120 Duplicated history values with zeroes... Closed
Team: Team C
Sprint: Sprint 16, Sprint 17, Sprint 18
Story Points: 2

 Description   

I noted that on graphs there are drops to zero and investigating it further I saw following data from latest data show as values

Here are the latest data values showing graph with zero. Note timestamp "2017-06-14 06:51:28" where database is returning two values, one with zero (false) and one with real value..

2017-06-14 07:17:14	270769855.7631
2017-06-14 07:11:33	284481628.6498
2017-06-14 07:06:01	289079965.9833
2017-06-14 07:02:37	277084788.7457
2017-06-14 06:56:01	228357732.88
2017-06-14 06:51:28	0
2017-06-14 06:51:28	249467524.7403
2017-06-14 06:46:40	263783339.0135
2017-06-14 06:41:27	243063043.2119
2017-06-14 06:36:00	218321687.2289
2017-06-14 06:31:45	222512429.1539


 Comments   
Comment by Jori Hämäläinen [ 2017 Jun 14 ]

Grafana staircase line shows same behaviour and I took 4 items to demonstrate that it is happening time-to-time with all values..

Comment by Ingus Vilnis [ 2017 Jun 14 ]

Just few observations first.
You are using a very short update interval for those items - just five seconds. Do you really need that many values to be collected?

Then check Monitoring -> Graphs - > select your Zabbix server -> Zabbix data gathering process busy and see how busy the poller processes are.

Please also check zabbix_server.log file for that time period if you notice any messages regarding those hosts.

Comment by Jori Hämäläinen [ 2017 Jun 15 ]

Polling interval is "standard" 300s / 5minutes. Only "Update interval", no custom interval.

Logs (server & proxy) do not reveal anything on timeframe like I wrote to initial ticket. But I have only default debug level (2?) active on production Zabbix.

Attached graph:
Server: 2d avg poller busy: 0.23%
Proxy 2d avg poller busy: 0.23%

Far fetched but perhaps worth mentioning:

  • I poll bps delta (speed) every 5 minutes as ifHC*Octet -key
  • I poll same SNMP items for delta (change) every 5 minutes as ifHC*OctetCounter -key
  • could that trigger some weird problem? Most likely not but..
  • 100% monitored Interfaces are LLD discovered & prototyped - no manual if* -monitoring
  • Item key's unique key is ifAlias - not ifIndex
Comment by Jori Hämäläinen [ 2017 Jun 15 ]

Server & proxy data gathering busy graphs.. The left side spike is that I removed 1/2 of polling items ~500k (573nvps) to ~200k - now at 200nvps. But the system could handle that ~600nvps well on 3 different VMWARE machines, master, proxy and mysql-db.

Comment by Jori Hämäläinen [ 2017 Jun 15 ]

I ran direct MySQL query to evaluate problem:
select *, count( * ) from history where itemid in (select itemid from items where name like 'ifHCOutOctets %') and clock > unix_timestamp( now() - interval 48 hour) group by itemid,clock having count( * ) > 1;

For latest 48h on my database it returned roughly 1300 duplicate lines spread over all items I've (compared with second timestamp only, not with nanosecond). IfHCInOctets had similar number of duplicate lines.

CounterChange ifHCInOctets (item for delta simple change) returned 2300 duplicate database entries.

ifInUcastPkts (delta speed per second) returns duplicate entries ~2300.

ifOperStatus (stored "as is") also returned duplicate rows ~2400.

Template_ping (ping status & rtt) returned empty set.. (stored "as is"). So this is SNMP related issue?

all items are SNMPv2 with bulk-mode enabled on host.

Comment by Jori Hämäläinen [ 2017 Jun 15 ]

sysUpTime returned empty set for duplicates.

Entity-sensor-mib sensorthresholds (1-4) returned 35000 duplicates in history database.

Zabbix OS Linux Interfaces discovery prototypes:
'Incoming traffic on interface %' = 0 duplicate lines

  • not affecting Zabbix Agent items??
Comment by Oleksii Zagorskyi [ 2017 Jun 15 ]

Just theoretically - maybe duplicating OIDs are existing in bulk responses?
I'd try to disable the bulk option to see how does it affect the issue.

DebugLevel definitely should be set to 3 - to include warnings about network errors, which can be very related.

Comment by Jori Hämäläinen [ 2017 Jun 16 ]

Here is 2 days of graph and somehow for "yesterday" Zabbix has fixed itself but day before yesterday it shows odd behaviour, mostly for packets per second KPIs but also some drops on Mbps-side as well.

So another thing comes to my mind, as Zabbix deleted ~300k of LLD-items, could this be due to housekeeping database-load? Is it possible?

Server housekeeping settings, interval 12h and 15k maxdelete.

As graphs seem to be normal now - I do not increase debug level to 3. But if I see this behaviour again I'll check additional debug levels.

Comment by Jori Hämäläinen [ 2017 Jun 16 ]

I ran query returning also nanosecond timestamp and it seems that ns are different.

71575 1497330360 281217022.2940 643449884
71575 1497330660 289600151.8936 163363119
71575 1497331032 282349070.4332 64528010
71575 1497331032 0.0000 64568792
71575 1497331261 272841949.5103 15460457
71575 1497331560 269095363.6536 816515669

Is it possible that Zabbix has queried ITEM twice in very short period where router ifHCInOctets has no had time to increase. Difference here is 40782ns = 41us.

Another case
71575 1497471095 686412656.2109 242346753
71575 1497471435 701657084.1854 642659885
71575 1497471683 727075549.3186 78616163
71575 1497471683 0.0000 78800521
71575 1497471960 703503365.4248 859805469
71575 1497472363 610708695.0233 986508341
71575 1497472560 673117942.1261 904039455

Diff ~15us.

Oleksiy, the bulk-disabling would be an option but right now the system has become more stable so I still keep following..

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 02 ]

Is server -> proxy connection stable?

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 23 ]

ZBXNEXT-3280 may be related.

Comment by Jori Hämäläinen [ 2017 Jul 31 ]

server - proxy connection should be fairly stable. I just returned from yearly vacation so I've not followed this incident for 4 weeks. Quickly looking the graphs seem to be fine and data is collected so I consider proxy->server uploads are fine.

ZBXNEXT-3280 - I took a quick glance on it. To me it sounds like connectivity issue. Issue on 12281 is that other data is valid (like 12345bps) and other uploaded data is invalid (as zero = 0bps). So could 3280 cause this?

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ]

Yes, I think it can. If proxy collected values x, y, z, but sent y to server twice, server will be processing values x, y, y, z and if Save as is Delta then you will end up with y-x, y-y=0, z-y in the database.

What is proxypoller process busyness level?

Comment by Jori Hämäläinen [ 2017 Jul 31 ]

Zero value - that sound logical as well. I was thinking proxy was doing "As speed" -calculation but if it is done by server with Master database then zero value is more than logical.

I looked at data of reported incident-period (June 14-16) from Proxy KPI data.

Zabbix-cache Usage % Free: min value >99% (history index & write cache)
Zabbix busy * in %: max value <2% on all but HTTP-poller and icmp pinger. Average poller process busy for 2d was 0.19%
Zabbix internal process busy, history syncer: avg 0.61% and max 4.56%.
Zabbix proxy performance: Zabbix queue is rather high, max 142K / avg 15.8k

I added Zabbix proxy queue graph of 2 weeks. It is showing jump from Discovery & low-lever-discovery on 12th of June.

Comment by Jori Hämäläinen [ 2017 Jul 31 ]

Proxy values per second and queue graph of 2 weeks.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ]

Could you take proxypoller process busyness item from zabbix_zero_value_via_polling.gif and plot it on a separate graph? Please make it not just [avg], but [all] (or at least add another line for [max]).

Comment by Jori Hämäläinen [ 2017 Jul 31 ]

Proxy poller busy graph added with [all] from 10/6 to 17/6.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ]

Sorry, it's [avg].

Comment by Jori Hämäläinen [ 2017 Jul 31 ]

New iteration of proxy poller busy with avg & max as a graph

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jul 31 ]

Hm, nothing special here... There are quite substantial load pikes on June 13, but zeros/duplicates continue on June 14 when proxypoller is pretty relaxed.

It seems that on zabbix_zero.gif many items are having issues at exactly the same time. Could you please verify this (if you still have historical data)? I would expect to see that zeros/duplicates happen to multiple itemid‍s at exactly the same clock and ns in the server's DB.

Comment by Jori Hämäläinen [ 2017 Jul 31 ]

As posted run on Jun 15th 13:04 (Jun 13th 13:00 -> Jun 15th 13:00) the direct database queries revealed many thousand item id's with same-timestamp-data. There were different types of data (as is, delta value and delta speed).

from Jun 15th posting
> ifOperStatus (stored "as is") also returned duplicate rows ~2400.

This would make me wonder the earlier thing: "Yes, I think it can. If proxy collected values x, y, z, but sent y to server twice, server will be processing values x, y, y, z and if Save as is Delta then you will end up with y-x, y-y=0, z-y in the database."

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 29 ]

Dear jhgrc, what is the value of ProxyDataFrequency parameter in server's configuration file?

Comment by Jori Hämäläinen [ 2017 Aug 30 ]
### Option: ProxyDataFrequency
ProxyDataFrequency=1

I am thinking that as Zabbix installation was new when "0"-data problems occurred (a lot of low level discoveries etc) and I was running proxy with SQLITE (I upgraded to MariaDB later) could problem be there on high load and "wrong database" for that load. I think I had ~700nvps for while over the proxy. No nvps is settled around 300nvps after making better LLD filters.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 30 ]

Thank you! No, nothing is wrong with database in this case. We are currently blaming too low ProxyDataFrequency, network latencies and absence of proper concurrency handling on Zabbix side when polling passive proxies.

Comment by Jori Hämäläinen [ 2017 Aug 30 ]

With Wireshark I analyzed >3ms network+linux stack latency (measured with DATA-ACK) between server and proxy on Zabbix protocol port. I've no reason to believe that it was different at the beginning.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 31 ]

vjaceslavs proposed a scenario, when passive proxy may send same historical data twice without any network or DB errors:

server proxy
proxypoller #1 requests data trapper #1 receives request, sends data and waits for response to mark data as sent by updating lastid
proxypoller #1 sends confirmation... trapper #1 still waits...
proxypoller #1 processes data and returns proxy to the queue trapper #1 still waits...
proxypoller #2 requests data trapper #2 receives request, sends same data as trapper #1, trapper #1 still waits...
proxypoller #2 sends confirmation... trapper #1 and trapper #2 receive confirmations and finally update lastid

And here is a debug log of proxy trappers when this is happening in the real world:

connection #1 ----> 17139:20170830:200944.906 In zbx_tls_accept()
                    17139:20170830:200944.907 zbx_psk_server_cb() requested PSK identity "P21"
                    17139:20170830:200944.907 End of zbx_tls_accept():SUCCEED (established TLSv1.2 PSK-AES128-CBC-SHA)
                    17139:20170830:200944.907 __zbx_zbx_setproctitle() title:'trapper #1 [processing data]'
request #1 -------> 17139:20170830:200944.908 trapper got '{"request":"history data"}'
                    17139:20170830:200944.908 In send_proxyhistory()
query lastid #1 --> 17139:20170830:200944.908 In proxy_get_lastid() field:'proxy_history.history_lastid'
                    17139:20170830:200944.908 query [txnlev:0] [select nextid from ids where table_name='proxy_history' and field_name='history_lastid']
                    17139:20170830:200944.910 End of proxy_get_lastid():6709738
                    17139:20170830:200944.910 In proxy_get_history_data()
                    17139:20170830:200944.910 query [txnlev:0] [select id,itemid,clock,ns,timestamp,source,severity,value,logeventid,state,lastlogsize,mtime,flags from proxy_history where id>6709738 order by id limit 1000]
history data #1 --> 17139:20170830:200944.916 End of proxy_get_history_data():149 records_processed:149 lastid:6709887
waiting... #1 ----> 17139:20170830:200944.917 In zbx_recv_response()
connection #2 ----> 17142:20170830:200945.070 In zbx_tls_accept()
                    17142:20170830:200945.072 zbx_psk_server_cb() requested PSK identity "P21"
                    17142:20170830:200945.075 End of zbx_tls_accept():SUCCEED (established TLSv1.2 PSK-AES128-CBC-SHA)
                    17142:20170830:200945.084 __zbx_zbx_setproctitle() title:'trapper #4 [processing data]'
confirmation #1 --> 17139:20170830:200945.103 zbx_recv_response() '{"response":"success"}'
request #2 -------> 17142:20170830:200945.107 trapper got '{"request":"history data"}'
                    17139:20170830:200945.113 End of zbx_recv_response():SUCCEED
                    17142:20170830:200945.113 In send_proxyhistory()
                    17139:20170830:200945.113 In proxy_set_lastid() [proxy_history.history_lastid:6709887]
query lastid #2 --> 17142:20170830:200945.114 In proxy_get_lastid() field:'proxy_history.history_lastid'
                    17139:20170830:200945.115 query [txnlev:0] [select 1 from ids where table_name='proxy_history' and field_name='history_lastid']
                    17142:20170830:200945.115 query [txnlev:0] [select nextid from ids where table_name='proxy_history' and field_name='history_lastid']
                    17142:20170830:200945.116 End of proxy_get_lastid():6709738
                    17142:20170830:200945.116 In proxy_get_history_data()
                    17142:20170830:200945.116 query [txnlev:0] [select id,itemid,clock,ns,timestamp,source,severity,value,logeventid,state,lastlogsize,mtime,flags from proxy_history where id>6709738 order by id limit 1000]
history data #2 --> 17142:20170830:200945.118 End of proxy_get_history_data():149 records_processed:149 lastid:6709887
waiting... #2 ----> 17142:20170830:200945.118 In zbx_recv_response()
                    17139:20170830:200945.120 query without transaction detected
update lastid #1 -> 17139:20170830:200945.120 query [txnlev:0] [update ids set nextid=6709887 where table_name='proxy_history' and field_name='history_lastid']
                    17139:20170830:200945.121 End of proxy_set_lastid()
confirmation #2 --> 17142:20170830:200945.131 zbx_recv_response() '{"response":"success"}'
                    17139:20170830:200945.133 End of send_proxyhistory()
                    17142:20170830:200945.134 End of zbx_recv_response():SUCCEED
                    17142:20170830:200945.134 In proxy_set_lastid() [proxy_history.history_lastid:6709887]
                    17139:20170830:200945.135 __zbx_zbx_setproctitle() title:'trapper #1 [processed data in 0.225406 sec, waiting for connection]'
                    17142:20170830:200945.135 query [txnlev:0] [select 1 from ids where table_name='proxy_history' and field_name='history_lastid']
                    17142:20170830:200945.135 query without transaction detected
update lastid #2 -> 17142:20170830:200945.135 query [txnlev:0] [update ids set nextid=6709887 where table_name='proxy_history' and field_name='history_lastid']
                    17142:20170830:200945.136 End of proxy_set_lastid()
                    17142:20170830:200945.136 End of send_proxyhistory()
                    17142:20170830:200945.136 __zbx_zbx_setproctitle() title:'trapper #4 [processed data in 0.051894 sec, waiting for connection]'
Comment by Sergejs Paskevics [ 2017 Sep 29 ]

Successfully tested

Generated at Thu Mar 28 21:35:58 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.