[ZBX-9362] Zabbix server queued after upgrade to 2.4.4 Created: 2015 Mar 03  Updated: 2022 Oct 08  Resolved: 2015 Mar 24

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.4.2, 2.4.3, 2.4.4
Fix Version/s: None

Type: Incident report Priority: Critical
Reporter: Artur Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: queue
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS CentOS6.4 Storage: 4x320GB RAID10, RAM: 8GB, CPU: 8 cores. VMware, Portgres 9.3 partition.
Number of hosts: 5865, Number of items: 298750, Number of triggers: 86861, Number of users:74, Required server performance, new values per second: 1795.2


Attachments: PNG File 2.4.4 Bulk queue.png     PNG File 2.4.4 bulk.png     PNG File 2.4.4-AQ.png     PNG File 2.4.4-AQS.png     PNG File 2.4.4-patch.png     File 2.4.4.-with-patch.log     PNG File 2.4.4.png     File DumpWireshark.7z     PNG File chart2.png     File dumpget.7z     Text File log-after-upgrade.txt     Text File log2.4.4Bulk.txt     File log2.4.4BulkDump2     File no-net-snmp-retries.patch     File tcpdump.7z     HTML File tcpdump2     PNG File tcpdump2-queue.png     PNG File timeout1.png     XML File zbx_export_templates.xml    
Issue Links:
Duplicate
duplicates ZBXNEXT-1096 Configurable Timeout per item (host i... Closed

 Description   

After update zabbix from version 2.4.1 to 2.4.2 or high zabbix queue graph looks pretty anomalous. Picture in the attachment.

value processed by Zabbix server per second decreased from 1500 to 1300
and queued increased from 600 to 90,000

What information do I need to provide?



 Comments   
Comment by Aleksandrs Saveljevs [ 2015 Mar 03 ]

According to "Administration" -> "Queue", which item types are queueing? Are the queued items monitored by proxies?

Comment by Artur [ 2015 Mar 03 ]

Picture with According to "Administration" -> "Queue" in the attachment.
I don't use proxies.

Comment by Oleksii Zagorskyi [ 2015 Mar 04 ]

As I recall there were changes after 2.4.1 (or 2.4.2?) related to queue calculations and displaying.

This issue looks more like a support request but not a bug report.

Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ]

To clarify, in the issue description you mention upgrade from 2.4.1 to 2.4.2, but the graphs compare 2.4.1 and 2.4.4. Do you experience the same issue with both 2.4.2 and 2.4.4? If so, we can concentrate on the changes made between 2.4.1 and 2.4.2.

The vast majority of the queueing items are SNMP items. Does Zabbix server log have any errors related to them?

Comment by Artur [ 2015 Mar 04 ]

The problem arises also in 2.4.2 version.
Log zabbix-server in attachment.

The bulkget is disable on all hosts. The hosts does not always answer to the snmp request because CPU of many hosts is the high-load.

Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ]

Are you using SNMP items with dynamic indices ( https://www.zabbix.com/documentation/2.4/manual/config/items/itemtypes/snmp/dynamicindex )? How many SNMP items are there on each host? Does it take a long time to snmpwalk them?

Comment by Artur [ 2015 Mar 04 ]

I don't use dynamic indices ( https://www.zabbix.com/documentation/2.4/manual/config/items/itemtypes/snmp/dynamicindex ).
I use only low-level discovery on all hosts with snmp items.

Example templates in the attachments.
Number of snmp items on hosts from 100 to 3000

Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ]

Here is one idea on what could have caused this. One ChangeLog entry for Zabbix 2.4.2 speaks about the following:

.......PS. [ZBX-8538] added Net-SNMP retry of 1 for cases where Zabbix will not be retrying itself

This means that, with bulk disabled, it will take 2 * Timeout seconds for a poller to find out that a host is unreachable. Previously, in Zabbix 2.4.1, it would only take Timeout seconds. Considering that in 2.4.1 pollers were over 80% busy, this could have lead to them being 100% busy in 2.4.2.

Since you write that "hosts does not always answer to the snmp request because CPU of many hosts is the high-load", this looks plausible. If you increase the number of pollers, does it solve the problem?

Comment by Artur [ 2015 Mar 04 ]

yes, no more queues, but StartPollers was 400 at 81% utilization pollers to version 2.4.1, on version 2.4.4 with StartPollers = 800 utilization is 68%.

Graphs in the attachments

Is it possible to make this option switchable?

Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ]

If we provide you with a patch for Zabbix 2.4.4 that reverts ZBX-8538, will it be possible for you to recompile Zabbix with the patch applied and test?

Comment by Artur [ 2015 Mar 04 ]

Yes there is.

Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ]

Artur, please find "no-net-snmp-retries.patch" attached.

Comment by Artur [ 2015 Mar 04 ]

After the patch all good with StartPoolers=400.

Graph and log file in attachments.

Comment by Aleksandrs Saveljevs [ 2015 Mar 05 ]

This proves that the poller load increase in your case is due to ZBX-8538. However, ZBX-8538 is a good improvement, because it makes Zabbix more resilient to network errors. In "log-after-upgrade.txt", there are approximately 276 / 17 = 16.2 network errors per minute, whereas in "2.4.4.-with-patch.log" there are around 214 / 5 = 42.8 network errors per minute.

What is the value of Timeout parameter in Zabbix server configuration file? It is advised to keep it relatively small, so that timeouting operations do not have a very significant effect on Zabbix performance.

Also, why do you have SNMP bulk disabled? Unless your hardware does not support bulk properly, you should use bulk, because it should lower Zabbix poller load significantly.

Comment by Artur [ 2015 Mar 05 ]

Timeout=10 because there are external scripts.

When I enable SNMP bulk, puller has always been loaded 100% though StartPollers=800.
I have thought that this blame drops snmp packages. So I disable SNMP bulk.

Comment by Aleksandrs Saveljevs [ 2015 Mar 06 ]

Do you know the reason why pollers are fully loaded with bulk enabled? Is the reason that devices do not answer to bulk requests or answer very slowly? Does tcpdump give any clue?

Comment by Artur [ 2015 Mar 10 ]

I do not know the reason why pollers fully loaded.
I can do tkpdump and attach it.

Comment by Artur [ 2015 Mar 17 ]

Tcpdumt, log and graphs in attachments.
That happened after enable bulk.

Tcpdump2 make after command "ethtool --offload eth0 rx off tx off"

Comment by Aleksandrs Saveljevs [ 2015 Mar 17 ]

Neither of the attached tcpdump files open with Wireshark: it says "The capture file appears to be damaged or corrupt." in both cases. Could you please check?

Comment by Artur [ 2015 Mar 18 ]

First two dump in text format.
DumpWireshark.7z in format Wireshark.

Comment by Aleksandrs Saveljevs [ 2015 Mar 18 ]

Oh, it did not occur to me to check file contents. Thank you!

If we look at tcpdump.7z, we will see that there are 13759 requests and 8859 responses. This means that 4900 requests are without replies, which is around 35.6%. This is a very big loss percentage. Looking at request packets, response loss does not seem to be apparently connected to request size - requests with just a few variables are lost, too. So it might be that the loss is due to all requests being made simultaneously and it is handled in ZBXNEXT-2200.

With bulk enabled, Zabbix retries either one or two times, and with Timeout=10 this additional retrying seems to have a bigger impact than ZBX-8538, which introduces just one retry with bulk disabled. This is how I can explain the increased poller load currently.

Would it be possible to somehow reduce loss percentage in your network? For instance, by using Zabbix proxies with a smaller Timeout and putting them closer to the monitored SNMP devices?

Comment by Artur [ 2015 Mar 18 ]

tcpdump.7z was to command "ethtool --offload eth0 rx off tx off" there was a lot of errors "bad udp cksum", and as I said, I have got many hosts with CPU load > 80% and they do not respond to all requests

for example dumpget.7z made without bulk 4% loss
cat ./tcpdumpGET | grep -i GetRequest | wc -l
45593
cat ./tcpdumpGET | grep -i GetResponse | wc -l
43709

Comment by Aleksandrs Saveljevs [ 2015 Mar 23 ]

As we have discovered above, there is no bug in Zabbix. It is just that the change in ZBX-8538, which introduced a single retry for non-bulk requests, making Zabbix more resilient to network errors, has a negative effect with a big Timeout in networks with hosts which often do not respond. Therefore, it is desirable to be able to customize the number of retries. It is thus proposed to close this issue as a duplicate of ZBXNEXT-1096.

Comment by Oleksii Zagorskyi [ 2015 Mar 23 ]

I agree with Aleksandrs.

Comment by Artur [ 2015 Mar 23 ]

I also agree the ticket can will be close, but ideal variant to share timeout external scripts and snmp-get, because I have only four host with a script which requires timeout=10s.

Comment by Oleksii Zagorskyi [ 2015 Mar 23 ]

Unexpected, but seems we don't have a separate ZBXNEXT for individual timeout handling.

Should we edit/extend the ZBXNEXT-1096 to include configurable timeout feature ? (preferred, IMO)
Or would it be better to create separate request?

Here is a list to post to a proper issue:
ZBX-7862 - for IPMI
ZBXNEXT-2252 - for SSH
ZBXNEXT-2721 - a bit different thing, but still - for network discovery

Comment by Artur [ 2015 Mar 23 ]

How you will be more convenient?
If you want I can create a new ZBXNEXT.

Comment by Oleksii Zagorskyi [ 2015 Mar 23 ]

Artur, thanks, but do not hurry, let us discuss and decide what to do

Comment by Artur [ 2015 Mar 23 ]

Ok. I can try to reduce the timeout and attach graphics but external scripts will stop working.

Graph with timeout=1 in attachment

Comment by Aleksandrs Saveljevs [ 2015 Mar 24 ]

Oleksiy, feel free to extend ZBXNEXT-1096 with timeout configuration.

zalex_ua Done !

Comment by Artur [ 2015 Mar 24 ]

Oleksiy, can not add a timeout for the interface ZBXNEXT-1096? And add a timeout for items?

Comment by Oleksii Zagorskyi [ 2015 Mar 24 ]

Artur, feel free to post your point of view in the ZBXNEXT-1096.

Generated at Wed Jul 09 11:29:55 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.