[ZBX-6249] Zabbix Proxy drop to send some data Created: 2013 Feb 12 Updated: 2017 May 30 Resolved: 2013 Aug 16 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P) |
Affects Version/s: | 1.8.16, 2.0.8rc2 |
Fix Version/s: | 1.8.18rc1, 2.0.9rc1, 2.1.3 |
Type: | Incident report | Priority: | Critical |
Reporter: | MATSUDA Daiki | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 1 |
Labels: | proxy, synchronization | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | zabbix-1.8.16-proxy_transaction_num_check_v3.patch zabbix-2.0.4-proxy_transaction_num_check_v3.patch zabbix-2.0.6-memorize_datasender_time.patch zabbix-2.0.6-proxy_transaction_num_check_v2.patch |
Description |
Zabbix Proxy 1.8.x drop to send some data. It depends on transaction delay to DB. In Zabbix Proxy... So, I think that for resolving this problem, you shuld insert the WRITE Table lock for DCmass_proxy_add_history() and DCmass_proxy_update_items(). |
Comments |
Comment by Bashman [ 2013 May 08 ] |
In my case, I'm using version 1.8.15, Zabbix proxy doesn't send all data to server. 9817:20130508:121800.348 Trapper got [{ "request":"history data"}] len 28 9817:20130508:121800.348 In send_proxyhistory() 9817:20130508:121800.348 In proxy_get_history_data() table:'proxy_history' 9817:20130508:121800.348 In proxy_get_lastid() [proxy_history.history_lastid] 9817:20130508:121800.349 query [txnlev:0] [select nextid from ids where table_name='proxy_history' and field_name='history_lastid'] 9814:20130508:121800.349 zbx_recv_response() '{ "response":"success"}' 9814:20130508:121800.349 End of zbx_recv_response():SUCCEED 9814:20130508:121800.349 End of send_host_availability() 9817:20130508:121800.349 End of proxy_get_lastid():601142242 9817:20130508:121800.349 query [txnlev:0] [select p.id,h.host,i.key_,p.clock,p.timestamp,p.source,p.severity,p.value,p.logeventid from hosts h,items i,proxy_history p where h.hostid=i.hostid and i.itemid=p.itemid and p.id>601142242 order by p.id limit 1000] 9817:20130508:121800.350 End of proxy_get_history_data():36 lastid:601142278 9817:20130508:121800.350 In zbx_recv_response() 9816:20130508:121800.351 Trapper got [{ "request":"discovery data"}] len 30 9816:20130508:121800.352 In send_discovery_data() 9816:20130508:121800.352 In proxy_get_history_data() table:'proxy_dhistory' 9816:20130508:121800.352 In proxy_get_lastid() [proxy_dhistory.dhistory_lastid] 9816:20130508:121800.352 query [txnlev:0] [select nextid from ids where table_name='proxy_dhistory' and field_name='dhistory_lastid'] 9817:20130508:121800.352 zbx_recv_response() '{ "response":"success"}' 9817:20130508:121800.352 End of zbx_recv_response():SUCCEED 9817:20130508:121800.352 In proxy_set_lastid() [proxy_history.history_lastid:601142278] 9817:20130508:121800.352 query [txnlev:0] [select 1 from ids where table_name='proxy_history' and field_name='history_lastid'] 9816:20130508:121800.352 End of proxy_get_lastid():777 9816:20130508:121800.352 query [txnlev:0] [select p.id,p.clock,p.druleid,p.dcheckid,p.type,p.ip,p.port,p.key_,p.value,p.status from proxy_dhistory p where p.id>777 order by p.id limit 1000] 9817:20130508:121800.353 query without transaction detected 9817:20130508:121800.353 query [txnlev:0] [update ids set nextid=601142278 where table_name='proxy_history' and field_name='history_lastid'] 9816:20130508:121800.353 End of proxy_get_history_data():0 lastid:0 9816:20130508:121800.353 In zbx_recv_response() 9818:20130508:121800.353 Trapper got [{ "request":"auto registration"}] len 33 9818:20130508:121800.353 In send_areg_data() 9818:20130508:121800.353 In proxy_get_history_data() table:'proxy_autoreg_host' 9818:20130508:121800.353 In proxy_get_lastid() [proxy_autoreg_host.autoreg_host_lastid] 9818:20130508:121800.353 query [txnlev:0] [select nextid from ids where table_name='proxy_autoreg_host' and field_name='autoreg_host_lastid'] 9816:20130508:121800.353 zbx_recv_response() '{ "response":"success"}' 9816:20130508:121800.353 End of zbx_recv_response():SUCCEED 9816:20130508:121800.353 End of send_discovery_data() 9817:20130508:121800.354 End of proxy_set_lastid() 9817:20130508:121800.354 End of send_proxyhistory() 9818:20130508:121800.354 End of proxy_get_lastid():10306 9818:20130508:121800.354 query [txnlev:0] [select p.id,p.clock,p.host from proxy_autoreg_host p where p.id>10306 order by p.id limit 1000] 9818:20130508:121800.354 End of proxy_get_history_data():0 lastid:0 9818:20130508:121800.354 In zbx_recv_response() 9818:20130508:121800.354 zbx_recv_response() '{ "response":"success"}' 9818:20130508:121800.355 End of zbx_recv_response():SUCCEED 9818:20130508:121800.355 End of send_areg_data() mysql> select count(*) from proxy_history; +----------+ | count(*) | +----------+ | 15447 | +----------+ 1 row in set (0.01 sec) Could be related to |
Comment by Bashman [ 2013 May 08 ] |
It doesn't happen with icmp. |
Comment by Bashman [ 2013 May 08 ] |
Which is the socket size limit?, I created a new zabbix_proxy with less hosts and it works fine. |
Comment by MATSUDA Daiki [ 2013 May 09 ] |
The important thing is that it occurs after commit delay as I wrote. So, did you see commit delay ? |
Comment by Bashman [ 2013 May 09 ] |
I don't see any delay after commitment: 9825:20130508:121929.942 Syncing ... 9825:20130508:121929.942 In DCsync_history() history_first:1419 history_num:27 9825:20130508:121929.942 In DCinit_nextchecks() 9825:20130508:121929.942 query [txnlev:1] [begin;] 9824:20130508:121929.943 Syncing ... 9824:20130508:121929.943 In DCsync_history() history_first:1446 history_num:0 9824:20130508:121929.943 history syncer #3 spent 0.000023 seconds while processing 0 items 9824:20130508:121929.943 sleeping for 5 seconds 9825:20130508:121929.943 In DCmass_proxy_add_history() 9825:20130508:121929.943 query [txnlev:1] [insert into proxy_history (itemid,clock,value) values (49192,1368008365,'33.000000'),(58183,1368008366,'1469584703.000000'),(53185,1368008368,'0.209273'); insert into proxy_history (itemid,clock,value) values (51935,1368008364,'3937898075'),(39832,1368008365,'0'),(38753,1368008365,'5537877'),(49221,1368008362,'1'),(60111,1368008362,'1'),(44452,1368008362,'1'),(42562,1368008362,'1'),(44272,1368008362,'1'),(49852,1368008362,'1'),(50032,1368008362,'1'),(66052,1368008362,'1'),(48951,1368008362,'1'),(50031,1368008362,'1'),(50570,1368008362,'0'),(49184,1368008366,'26'),(66005,1368008367,'0'),(39065,1368008367,'3570903802'),(43824,1368008365,'1'),(50574,1368008365,'0'),(50035,1368008365,'1'),(50575,1368008365,'0'),(52014,1368008365,'1'),(50034,1368008365,'1'),(50033,1368008365,'1'); ] 9823:20130508:121929.943 Syncing ... 9823:20130508:121929.943 In DCsync_history() history_first:1446 history_num:0 9823:20130508:121929.943 history syncer #2 spent 0.000034 seconds while processing 0 items 9823:20130508:121929.943 sleeping for 5 seconds 9825:20130508:121929.944 End of DCmass_proxy_add_history() 9825:20130508:121929.944 In DCmass_proxy_update_items() 9825:20130508:121929.944 End of DCmass_proxy_update_items() 9825:20130508:121929.944 query [txnlev:1] [commit;] 9825:20130508:121929.944 In DCflush_nextchecks() 9825:20130508:121929.944 history syncer #4 spent 0.001907 seconds while processing 27 items 9825:20130508:121929.944 sleeping for 5 seconds 9822:20130508:121929.996 Syncing ... 9822:20130508:121929.997 In DCsync_history() history_first:1446 history_num:0 9822:20130508:121929.997 history syncer #1 spent 0.000033 seconds while processing 0 items 9822:20130508:121929.997 sleeping for 5 seconds The problem I see is that not all the proxy data is sent to the server. May be it is due to the high amount of data to be sent: Host count: 381 I tried with a new proxy: Host count: 4 And everything is ok. |
Comment by MATSUDA Daiki [ 2013 May 10 ] |
What do you think about this problem? You only paste successful logs and do not try to confirm source code! In fact, the problem occured in our customer's environment and DB often delays for a process to commit. And I confirmed the source code with their logs, 1.8.16 and 2.0.5. Because History syncers work asynchronous, there is a possiblity. And I attatch the patches. |
Comment by Bashman [ 2013 May 10 ] |
All I know is that apparently in the logs everything is correct, but the data that proxy must send to the server is not complete, there is missing data. However, if proxy has fewer hosts, it is able to send all the data to server. |
Comment by MATSUDA Daiki [ 2013 May 10 ] |
No, as I wrote at the first, 1. process A transaction starts and takes lastvalue N Do you understand? And have you seen this phonomena? |
Comment by Bashman [ 2013 May 16 ] |
Yeah, I understand. You're totally right: mysql> SELECT max(id) from proxy_history; +---------+ | max(id) | +---------+ | 2906356 | +---------+ 1 row in set (0.00 sec) mysql> select * from ids; +--------+--------------------+---------------------+---------+ | nodeid | table_name | field_name | nextid | +--------+--------------------+---------------------+---------+ | 0 | proxy_autoreg_host | autoreg_host_lastid | 66285 | | 0 | proxy_history | history_lastid | 2906383 | +--------+--------------------+---------------------+---------+ 2 rows in set (0.00 sec) |
Comment by AndreaConsadori [ 2013 Jul 04 ] |
i also see this issue because i've an ustable dsl connection that loose packet every 5 minutes |
Comment by Andris Mednis [ 2013 Aug 06 ] |
I looked into proposed patch for 2.0.6. It sets up a small shared memory segment for keeping a counting semaphore "How many DB syncers are doing their transactions at the moment". Every "DB syncer" process before starting a transaction:
The "Data sender" process observes the semaphore and waits until it becomes 0 (no one DB syncer is in transaction). Waiting is a sleeping for 0.1 second. |
Comment by Andris Mednis [ 2013 Aug 07 ] |
Proposed solution - modify functions proxy_get_history_data() and, maybe, proxy_get_history_data_simple():
Data sender cannot send missing records later and cannot wait too long for them. |
Comment by MATSUDA Daiki [ 2013 Aug 08 ] |
> I looked into proposed patch for 2.0.6. It sets up a small shared memory segment for keeping a counting semaphore "How many DB syncers are doing their transactions at the moment". Every "DB syncer" process before starting a transaction: > The "Data sender" process observes the semaphore and waits until it becomes 0 (no one DB syncer is in transaction). Waiting is a sleeping for 0.1 second. Yes. > I wonder how this will behave on a heavy loaded proxies ? Maybe yes. But there is a usual behaviour that 2nd transaction is commited faster than 1st transaction. > Is there a guarantee that "Data sender" will get access to data in a reasonable time ? Maybe no, because I do not understand 'reasonable time'. But there is no problem when Data sender accesses the DB, no DB syncer does not start transaction. > Some time passes between the "Data sender" seeing that the semaphore is 0 and proceeding with reading data until the data is acquired by the "Data sender". As it is not an atomic operation, in the meantime "DB syncer" processes can do something, maybe even a transaction, who knows. I know the potential problem that Data sender wait for forever. So, I attached the patch to limit waiting time. |
Comment by Alexander Vladishev [ 2013 Aug 14 ] |
Changes in the development branch svn://svn.zabbix.com/branches/dev/ZBX-6249-trunk is successfully reviewed with a small fix. Please review r37798. |
Comment by Andris Mednis [ 2013 Aug 16 ] |
Available in development branches:
|
Comment by Alexander Vladishev [ 2013 Aug 19 ] |
svn://svn.zabbix.com/branches/dev/ZBX-6249-18 was successfully tested! |
Comment by Alexander Vladishev [ 2013 Aug 19 ] |
svn://svn.zabbix.com/branches/dev/ZBX-6249-trunk was successfully tested! |
Comment by Andris Mednis [ 2013 Aug 20 ] |
Available in pre-1.8.18 r37965, pre-2.0.9 r38027 and pre-2.1.3 (trunk) r37968. |
Comment by Anton Samets [ 2013 Aug 27 ] |
Guys, this patch is awesome! Right now in "Administration - Queue" we see only green zeroes at all. |
Comment by Andris Mednis [ 2013 Aug 27 ] |
Are you using a patch from |