[ZBX-20187] Zabbix Proxy is not sending data fast enough when using Galera cluster as storage backend Created: 2021 Nov 08  Updated: 2024 Apr 12  Resolved: 2023 Mar 16

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: None
Affects Version/s: 5.4.6
Fix Version/s: None

Type: Problem report Priority: Trivial
Reporter: Majd Sarhan Assignee: Igor Gorbach
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2021-11-08-17-34-45-704.png     File zabbix-proxy.rar    

 Description   

Steps to reproduce:

  1. deploy Galera cluster on 3 nodes (Master-Master)
  2. run Zabbix Proxy as a container with the DB server as one of the nodes (or it can be a virtual IP using haproxy) 
  3. check data sender process : ps -ef | grep "data sender"

Result:

**

when increasing debug level to 5, the  following logs are repeated: 

221:20211104:131507.233 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.

the function that fails is proxy_get_history_data. 

due to this, the Proxy fails to send data fast enough. so the data will be stuck in the queue for hours or even days 

Expected:
to have data sent in mili seconds. 

data sender process not 100% busy



 Comments   
Comment by Igor Gorbach [ 2021 Nov 09 ]

Hello!

Has the same issue occuring on older Zabbix versions or you face it on 5.4.6 only?

Comment by Majd Sarhan [ 2021 Nov 09 ]

Hi, thanks alot 

actually, it's the first time we use the Zabbix proxy so yes. before we only used the Zabbix server. 

 

Comment by Majd Sarhan [ 2021 Nov 10 ]

Hi , 
any update about the issue? 

 

Comment by Igor Gorbach [ 2021 Nov 10 ]

Hello!
Could you provide debug level logs for server trapper processes and proxy data sender process?

Comment by Majd Sarhan [ 2021 Nov 11 ]

please find attached logs of proxy with debug level 5 
265:20211111:140129.813 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140129.916 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140129.916 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.019 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.019 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.122 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.122 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.224 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.224 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.327 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.327 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.430 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.430 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.532 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.532 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.635 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.660 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
   265:20211111:140130.762 proxy_get_history_data() 2 record(s) missing. No more retries.
   265:20211111:140130.762 proxy_get_history_data() 2 record(s) missing. Waiting 0.100000 sec, retrying.
those logs that i keep seeing and i think its related 

once i run single instance of mariadb it works fine 

Comment by Majd Sarhan [ 2021 Nov 17 ]

any update on this issue ? 

Comment by Igor Gorbach [ 2023 Jan 17 ]

Tried to reproduce on Galera 3-nodes Master-Master Cluster - no issues in 6.0/6.2

Zabbix 5.4 is an unsupported version. If the problem is actually on 6.0/6.2 - please, notify us with the exact reproducing  steps

Comment by Ash Crosby [ 2023 Feb 01 ]

I hit exactly this issue with 6.2.6, against a 3-node MariaDB 10.5.18 Galera cluster, using an identical config for my servers. The servers run fine, but the proxies deliver a steady 10 items / second, which causes the proxy queues to rapidly fill up. After confirming that the bottleneck wasn't in the network or the databases, increasing the logging level for the data sender process revealed logs identical to the above - 10 sets of:

2 record(s) missing. No more retries.

2 record(s) missing. Waiting 0.100000 sec, retrying.

After which it appears to send 10 records. I've replaced the databases with single instances now and the issue has disappeared, but it was consistently recreateable until I dropped Galera. If others can't recreate it, I'll spin up some more test instances and get more complete logs.

Comment by Igor Gorbach [ 2023 Feb 02 ]

Please, provide the details about your Galera cluster details as well - configuration, what have you using for requests re-direction in case of failover

Comment by Rath [ 2024 Apr 12 ]

Greetings!

We see the same issue with a MariaDB Galera setup.

The proxy is pulling data from its agents (5k+ items), but never sending it to the server.

 

Zabbix-Proxy: 6.4.13-1

MariaDB Server: 10.11.7

 

Even if there is just one node in the cluster - we hit this error: (at DebugLevel=4)

> journalctl -u zabbix-proxy.service -f | grep proxy_get_history_data

> proxy_get_history_data() 1 record(s) missing. No more retries

 

We traced its source to this area in the code: https://github.com/zabbix/zabbix/tree/master/src/libs/zbxproxybuffer

I think it ocurrs at pb_history.c => pb_history_get_rows_db

It feels like the proxy is stuck in an endless retry loop. Maybe because of some insert-delay caused by Galera? (as far as I understand it: proxy writes data to db and checks that all records that were written, do actually exist - before sending them to the server)

See also the comment inside the code:

> At least one record is missing. It can happen if some DB syncer process has started but not yet committed a transaction or a rollback occurred in a DB syncer.

 

For now we rolled-back the proxy to a standalone MariaDB instance as the data of the proxy is not (that) important.

Generated at Tue Jan 07 22:27:18 EET 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.