[ZBX-6249] Zabbix Proxy drop to send some data Created: 2013 Feb 12  Updated: 2017 May 30  Resolved: 2013 Aug 16

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P)
Affects Version/s: 1.8.16, 2.0.8rc2
Fix Version/s: 1.8.18rc1, 2.0.9rc1, 2.1.3

Type: Incident report Priority: Critical
Reporter: MATSUDA Daiki Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: proxy, synchronization
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File zabbix-1.8.16-proxy_transaction_num_check_v3.patch     File zabbix-2.0.4-proxy_transaction_num_check_v3.patch     File zabbix-2.0.6-memorize_datasender_time.patch     File zabbix-2.0.6-proxy_transaction_num_check_v2.patch    

 Description   

Zabbix Proxy 1.8.x drop to send some data. It depends on transaction delay to DB.

In Zabbix Proxy...
Process A: in DCsync_history() calls transacation with DCmass_proxy_add_history() and DCmass_proxy_update_items().
But DCmass_proxy_update_items() is delayed for some reason.
Process B: my call DCmass_proxy_add_history() and commited. But Process A is not completed.
Process C: try to send data to Server. Then it reads the data from DB which commited by Process B. But currently Process A is not completed, so it can not read Process A's data.
Aftere then, Process A is completed. But as lastid is updated, Process A's data is never sent to Server.

So, I think that for resolving this problem, you shuld insert the WRITE Table lock for DCmass_proxy_add_history() and DCmass_proxy_update_items().



 Comments   
Comment by Bashman [ 2013 May 08 ]

In my case, I'm using version 1.8.15, Zabbix proxy doesn't send all data to server.

  9817:20130508:121800.348 Trapper got [{
        "request":"history data"}] len 28
  9817:20130508:121800.348 In send_proxyhistory()
  9817:20130508:121800.348 In proxy_get_history_data() table:'proxy_history'
  9817:20130508:121800.348 In proxy_get_lastid() [proxy_history.history_lastid]
  9817:20130508:121800.349 query [txnlev:0] [select nextid from ids where table_name='proxy_history' and field_name='history_lastid']
  9814:20130508:121800.349 zbx_recv_response() '{
        "response":"success"}'
  9814:20130508:121800.349 End of zbx_recv_response():SUCCEED
  9814:20130508:121800.349 End of send_host_availability()
  9817:20130508:121800.349 End of proxy_get_lastid():601142242
  9817:20130508:121800.349 query [txnlev:0] [select p.id,h.host,i.key_,p.clock,p.timestamp,p.source,p.severity,p.value,p.logeventid from hosts h,items i,proxy_history p where h.hostid=i.hostid and i.itemid=p.itemid and p.id>601142242 order by p.id limit 1000]
  9817:20130508:121800.350 End of proxy_get_history_data():36 lastid:601142278
  9817:20130508:121800.350 In zbx_recv_response()
  9816:20130508:121800.351 Trapper got [{
        "request":"discovery data"}] len 30
  9816:20130508:121800.352 In send_discovery_data()
  9816:20130508:121800.352 In proxy_get_history_data() table:'proxy_dhistory'
  9816:20130508:121800.352 In proxy_get_lastid() [proxy_dhistory.dhistory_lastid]
  9816:20130508:121800.352 query [txnlev:0] [select nextid from ids where table_name='proxy_dhistory' and field_name='dhistory_lastid']
  9817:20130508:121800.352 zbx_recv_response() '{
        "response":"success"}'
  9817:20130508:121800.352 End of zbx_recv_response():SUCCEED
  9817:20130508:121800.352 In proxy_set_lastid() [proxy_history.history_lastid:601142278]
  9817:20130508:121800.352 query [txnlev:0] [select 1 from ids where table_name='proxy_history' and field_name='history_lastid']
  9816:20130508:121800.352 End of proxy_get_lastid():777
  9816:20130508:121800.352 query [txnlev:0] [select p.id,p.clock,p.druleid,p.dcheckid,p.type,p.ip,p.port,p.key_,p.value,p.status from proxy_dhistory p where p.id>777 order by p.id limit 1000]
  9817:20130508:121800.353 query without transaction detected
  9817:20130508:121800.353 query [txnlev:0] [update ids set nextid=601142278 where table_name='proxy_history' and field_name='history_lastid']
  9816:20130508:121800.353 End of proxy_get_history_data():0 lastid:0
  9816:20130508:121800.353 In zbx_recv_response()
  9818:20130508:121800.353 Trapper got [{
        "request":"auto registration"}] len 33
  9818:20130508:121800.353 In send_areg_data()
  9818:20130508:121800.353 In proxy_get_history_data() table:'proxy_autoreg_host'
  9818:20130508:121800.353 In proxy_get_lastid() [proxy_autoreg_host.autoreg_host_lastid]
  9818:20130508:121800.353 query [txnlev:0] [select nextid from ids where table_name='proxy_autoreg_host' and field_name='autoreg_host_lastid']
  9816:20130508:121800.353 zbx_recv_response() '{
        "response":"success"}'
  9816:20130508:121800.353 End of zbx_recv_response():SUCCEED
  9816:20130508:121800.353 End of send_discovery_data()
  9817:20130508:121800.354 End of proxy_set_lastid()
  9817:20130508:121800.354 End of send_proxyhistory()
  9818:20130508:121800.354 End of proxy_get_lastid():10306
  9818:20130508:121800.354 query [txnlev:0] [select p.id,p.clock,p.host from proxy_autoreg_host p where p.id>10306 order by p.id limit 1000]
  9818:20130508:121800.354 End of proxy_get_history_data():0 lastid:0
  9818:20130508:121800.354 In zbx_recv_response()
  9818:20130508:121800.354 zbx_recv_response() '{
        "response":"success"}'
  9818:20130508:121800.355 End of zbx_recv_response():SUCCEED
  9818:20130508:121800.355 End of send_areg_data()
mysql> select count(*) from proxy_history;
+----------+
| count(*) |
+----------+
|    15447 |
+----------+
1 row in set (0.01 sec)

Could be related to ZBX-5448.

Comment by Bashman [ 2013 May 08 ]

It doesn't happen with icmp.

Comment by Bashman [ 2013 May 08 ]

Which is the socket size limit?, I created a new zabbix_proxy with less hosts and it works fine.

Comment by MATSUDA Daiki [ 2013 May 09 ]

The important thing is that it occurs after commit delay as I wrote. So, did you see commit delay ?
At least Zabbix Proxy of 1.8.16, maybe also 2.0.x, does not have a good argorithm for this, it should wait.

Comment by Bashman [ 2013 May 09 ]

I don't see any delay after commitment:

 9825:20130508:121929.942 Syncing ...
  9825:20130508:121929.942 In DCsync_history() history_first:1419 history_num:27
  9825:20130508:121929.942 In DCinit_nextchecks()
  9825:20130508:121929.942 query [txnlev:1] [begin;]
  9824:20130508:121929.943 Syncing ...
  9824:20130508:121929.943 In DCsync_history() history_first:1446 history_num:0
  9824:20130508:121929.943 history syncer #3 spent 0.000023 seconds while processing 0 items
  9824:20130508:121929.943 sleeping for 5 seconds
  9825:20130508:121929.943 In DCmass_proxy_add_history()
  9825:20130508:121929.943 query [txnlev:1] [insert into proxy_history (itemid,clock,value) values (49192,1368008365,'33.000000'),(58183,1368008366,'1469584703.000000'),(53185,1368008368,'0.209273');
insert into proxy_history (itemid,clock,value) values (51935,1368008364,'3937898075'),(39832,1368008365,'0'),(38753,1368008365,'5537877'),(49221,1368008362,'1'),(60111,1368008362,'1'),(44452,1368008362,'1'),(42562,1368008362,'1'),(44272,1368008362,'1'),(49852,1368008362,'1'),(50032,1368008362,'1'),(66052,1368008362,'1'),(48951,1368008362,'1'),(50031,1368008362,'1'),(50570,1368008362,'0'),(49184,1368008366,'26'),(66005,1368008367,'0'),(39065,1368008367,'3570903802'),(43824,1368008365,'1'),(50574,1368008365,'0'),(50035,1368008365,'1'),(50575,1368008365,'0'),(52014,1368008365,'1'),(50034,1368008365,'1'),(50033,1368008365,'1');
]
  9823:20130508:121929.943 Syncing ...
  9823:20130508:121929.943 In DCsync_history() history_first:1446 history_num:0
  9823:20130508:121929.943 history syncer #2 spent 0.000034 seconds while processing 0 items
  9823:20130508:121929.943 sleeping for 5 seconds
  9825:20130508:121929.944 End of DCmass_proxy_add_history()
  9825:20130508:121929.944 In DCmass_proxy_update_items()
  9825:20130508:121929.944 End of DCmass_proxy_update_items()
  9825:20130508:121929.944 query [txnlev:1] [commit;]
  9825:20130508:121929.944 In DCflush_nextchecks()
  9825:20130508:121929.944 history syncer #4 spent 0.001907 seconds while processing 27 items
  9825:20130508:121929.944 sleeping for 5 seconds
  9822:20130508:121929.996 Syncing ...
  9822:20130508:121929.997 In DCsync_history() history_first:1446 history_num:0
  9822:20130508:121929.997 history syncer #1 spent 0.000033 seconds while processing 0 items
  9822:20130508:121929.997 sleeping for 5 seconds

The problem I see is that not all the proxy data is sent to the server. May be it is due to the high amount of data to be sent:

Host count: 381
Item count: 2378
Required performance (vps): 10.61

I tried with a new proxy:

Host count: 4
Item count: 80
Required performance (vps): 0.37

And everything is ok.

Comment by MATSUDA Daiki [ 2013 May 10 ]

What do you think about this problem? You only paste successful logs and do not try to confirm source code!

In fact, the problem occured in our customer's environment and DB often delays for a process to commit. And I confirmed the source code with their logs, 1.8.16 and 2.0.5. Because History syncers work asynchronous, there is a possiblity. And I attatch the patches.

Comment by Bashman [ 2013 May 10 ]

All I know is that apparently in the logs everything is correct, but the data that proxy must send to the server is not complete, there is missing data. However, if proxy has fewer hosts, it is able to send all the data to server.

Comment by MATSUDA Daiki [ 2013 May 10 ]

No, as I wrote at the first,

1. process A transaction starts and takes lastvalue N
2. process B transaction starts and takes last value N' (N' > N)
3. process A transaction not end and Process B transaction commited
4. process C starts to send the data from N + 1 to N' to the Server
5. process A transcation commited
6. the data to N never be sent

Do you understand? And have you seen this phonomena?

Comment by Bashman [ 2013 May 16 ]

Yeah, I understand. You're totally right:

mysql> SELECT max(id) from proxy_history;
+---------+
| max(id) |
+---------+
| 2906356 |
+---------+
1 row in set (0.00 sec)

mysql> select * from ids;
+--------+--------------------+---------------------+---------+
| nodeid | table_name         | field_name          | nextid  |
+--------+--------------------+---------------------+---------+
|      0 | proxy_autoreg_host | autoreg_host_lastid |   66285 |
|      0 | proxy_history      | history_lastid      | 2906383 |
+--------+--------------------+---------------------+---------+
2 rows in set (0.00 sec)
Comment by AndreaConsadori [ 2013 Jul 04 ]

i also see this issue because i've an ustable dsl connection that loose packet every 5 minutes

Comment by Andris Mednis [ 2013 Aug 06 ]

I looked into proposed patch for 2.0.6. It sets up a small shared memory segment for keeping a counting semaphore "How many DB syncers are doing their transactions at the moment". Every "DB syncer" process before starting a transaction:

  • obtains a mutex lock, increments the semaphore by 1, releases the lock,
  • does its transaction,
  • obtains the mutex lock, decrements the semaphore by 1, releases the lock.

The "Data sender" process observes the semaphore and waits until it becomes 0 (no one DB syncer is in transaction). Waiting is a sleeping for 0.1 second.
I wonder how this will behave on a heavy loaded proxies ? Is there a guarantee that "Data sender" will get access to data in a reasonable time ? Some time passes between the "Data sender" seeing that the semaphore is 0 and proceeding with reading data until the data is acquired by the "Data sender". As it is not an atomic operation, in the meantime "DB syncer" processes can do something, maybe even a transaction, who knows.

Comment by Andris Mednis [ 2013 Aug 07 ]

Proposed solution - modify functions proxy_get_history_data() and, maybe, proxy_get_history_data_simple():

  • when reading data from 'proxy_history' table, check whether "id" grows with step 1.
  • if yes, then good, there are no missing records, we can proceed with sending data to master server.
  • if no, some records are missing. May be they will be committed in a moment by one of DB syncers, may be they will never arrive if a rollback occurred in a DB syncer. In this case the Data sender sleeps a short time (0.1 second?) and re-reads the data from 'proxy_history' table starting from the old "lastid". This gives a chance for delayed records to be included in sending to the master server. If there are still missing records, we "write them off" and proceed with sending what is available.

Data sender cannot send missing records later and cannot wait too long for them.

Comment by MATSUDA Daiki [ 2013 Aug 08 ]

> I looked into proposed patch for 2.0.6. It sets up a small shared memory segment for keeping a counting semaphore "How many DB syncers are doing their transactions at the moment". Every "DB syncer" process before starting a transaction:
>
> obtains a mutex lock, increments the semaphore by 1, releases the lock,
> does its transaction,
> obtains the mutex lock, decrements the semaphore by 1, releases the lock.

> The "Data sender" process observes the semaphore and waits until it becomes 0 (no one DB syncer is in transaction). Waiting is a sleeping for 0.1 second.

Yes.

> I wonder how this will behave on a heavy loaded proxies ?

Maybe yes. But there is a usual behaviour that 2nd transaction is commited faster than 1st transaction.

> Is there a guarantee that "Data sender" will get access to data in a reasonable time ?

Maybe no, because I do not understand 'reasonable time'. But there is no problem when Data sender accesses the DB, no DB syncer does not start transaction.

> Some time passes between the "Data sender" seeing that the semaphore is 0 and proceeding with reading data until the data is acquired by the "Data sender". As it is not an atomic operation, in the meantime "DB syncer" processes can do something, maybe even a transaction, who knows.

I know the potential problem that Data sender wait for forever. So, I attached the patch to limit waiting time.

Comment by Alexander Vladishev [ 2013 Aug 14 ]

Changes in the development branch svn://svn.zabbix.com/branches/dev/ZBX-6249-trunk is successfully reviewed with a small fix. Please review r37798.
andris Thanks! Reviewed and accepted.

Comment by Andris Mednis [ 2013 Aug 16 ]

Available in development branches:

  • svn://svn.zabbix.com/branches/dev/ZBX-6249-18 (for version 1.8)
  • svn://svn.zabbix.com/branches/dev/ZBX-6249-trunk
Comment by Alexander Vladishev [ 2013 Aug 19 ]

svn://svn.zabbix.com/branches/dev/ZBX-6249-18 was successfully tested!

Comment by Alexander Vladishev [ 2013 Aug 19 ]

svn://svn.zabbix.com/branches/dev/ZBX-6249-trunk was successfully tested!

Comment by Andris Mednis [ 2013 Aug 20 ]

Available in pre-1.8.18 r37965, pre-2.0.9 r38027 and pre-2.1.3 (trunk) r37968.

Comment by Anton Samets [ 2013 Aug 27 ]

Guys, this patch is awesome! Right now in "Administration - Queue" we see only green zeroes at all.
Before this it was like 10-30 thousands of item with delay more than 1 minute. Permanent IO in htop in main zabbix-server has gone.
Recommend to use. Right now we using 2.0.8 with this patch.

Comment by Andris Mednis [ 2013 Aug 27 ]

Are you using a patch from ZBX-6249 attachments contributed by Matsuda Daiki or from 2.0.9rc1 ?

Generated at Thu Apr 25 22:42:55 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.