[ZBX-9362] Zabbix server queued after upgrade to 2.4.4 Created: 2015 Mar 03 Updated: 2022 Oct 08 Resolved: 2015 Mar 24 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.4.2, 2.4.3, 2.4.4 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Critical |
Reporter: | Artur | Assignee: | Unassigned |
Resolution: | Duplicate | Votes: | 0 |
Labels: | queue | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
OS CentOS6.4 Storage: 4x320GB RAID10, RAM: 8GB, CPU: 8 cores. VMware, Portgres 9.3 partition. |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||
Issue Links: |
|
Description |
After update zabbix from version 2.4.1 to 2.4.2 or high zabbix queue graph looks pretty anomalous. Picture in the attachment. value processed by Zabbix server per second decreased from 1500 to 1300 What information do I need to provide? |
Comments |
Comment by Aleksandrs Saveljevs [ 2015 Mar 03 ] |
According to "Administration" -> "Queue", which item types are queueing? Are the queued items monitored by proxies? |
Comment by Artur [ 2015 Mar 03 ] |
Picture with According to "Administration" -> "Queue" in the attachment. |
Comment by Oleksii Zagorskyi [ 2015 Mar 04 ] |
As I recall there were changes after 2.4.1 (or 2.4.2?) related to queue calculations and displaying. This issue looks more like a support request but not a bug report. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ] |
To clarify, in the issue description you mention upgrade from 2.4.1 to 2.4.2, but the graphs compare 2.4.1 and 2.4.4. Do you experience the same issue with both 2.4.2 and 2.4.4? If so, we can concentrate on the changes made between 2.4.1 and 2.4.2. The vast majority of the queueing items are SNMP items. Does Zabbix server log have any errors related to them? |
Comment by Artur [ 2015 Mar 04 ] |
The problem arises also in 2.4.2 version. The bulkget is disable on all hosts. The hosts does not always answer to the snmp request because CPU of many hosts is the high-load. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ] |
Are you using SNMP items with dynamic indices ( https://www.zabbix.com/documentation/2.4/manual/config/items/itemtypes/snmp/dynamicindex )? How many SNMP items are there on each host? Does it take a long time to snmpwalk them? |
Comment by Artur [ 2015 Mar 04 ] |
I don't use dynamic indices ( https://www.zabbix.com/documentation/2.4/manual/config/items/itemtypes/snmp/dynamicindex ). Example templates in the attachments. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ] |
Here is one idea on what could have caused this. One ChangeLog entry for Zabbix 2.4.2 speaks about the following: .......PS. [ZBX-8538] added Net-SNMP retry of 1 for cases where Zabbix will not be retrying itself This means that, with bulk disabled, it will take 2 * Timeout seconds for a poller to find out that a host is unreachable. Previously, in Zabbix 2.4.1, it would only take Timeout seconds. Considering that in 2.4.1 pollers were over 80% busy, this could have lead to them being 100% busy in 2.4.2. Since you write that "hosts does not always answer to the snmp request because CPU of many hosts is the high-load", this looks plausible. If you increase the number of pollers, does it solve the problem? |
Comment by Artur [ 2015 Mar 04 ] |
yes, no more queues, but StartPollers was 400 at 81% utilization pollers to version 2.4.1, on version 2.4.4 with StartPollers = 800 utilization is 68%. Graphs in the attachments Is it possible to make this option switchable? |
Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ] |
If we provide you with a patch for Zabbix 2.4.4 that reverts |
Comment by Artur [ 2015 Mar 04 ] |
Yes there is. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ] |
Artur, please find "no-net-snmp-retries.patch" attached. |
Comment by Artur [ 2015 Mar 04 ] |
After the patch all good with StartPoolers=400. Graph and log file in attachments. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 05 ] |
This proves that the poller load increase in your case is due to What is the value of Timeout parameter in Zabbix server configuration file? It is advised to keep it relatively small, so that timeouting operations do not have a very significant effect on Zabbix performance. Also, why do you have SNMP bulk disabled? Unless your hardware does not support bulk properly, you should use bulk, because it should lower Zabbix poller load significantly. |
Comment by Artur [ 2015 Mar 05 ] |
Timeout=10 because there are external scripts. When I enable SNMP bulk, puller has always been loaded 100% though StartPollers=800. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 06 ] |
Do you know the reason why pollers are fully loaded with bulk enabled? Is the reason that devices do not answer to bulk requests or answer very slowly? Does tcpdump give any clue? |
Comment by Artur [ 2015 Mar 10 ] |
I do not know the reason why pollers fully loaded. |
Comment by Artur [ 2015 Mar 17 ] |
Tcpdumt, log and graphs in attachments. Tcpdump2 make after command "ethtool --offload eth0 rx off tx off" |
Comment by Aleksandrs Saveljevs [ 2015 Mar 17 ] |
Neither of the attached tcpdump files open with Wireshark: it says "The capture file appears to be damaged or corrupt." in both cases. Could you please check? |
Comment by Artur [ 2015 Mar 18 ] |
First two dump in text format. |
Comment by Aleksandrs Saveljevs [ 2015 Mar 18 ] |
Oh, it did not occur to me to check file contents. Thank you! If we look at tcpdump.7z, we will see that there are 13759 requests and 8859 responses. This means that 4900 requests are without replies, which is around 35.6%. This is a very big loss percentage. Looking at request packets, response loss does not seem to be apparently connected to request size - requests with just a few variables are lost, too. So it might be that the loss is due to all requests being made simultaneously and it is handled in ZBXNEXT-2200. With bulk enabled, Zabbix retries either one or two times, and with Timeout=10 this additional retrying seems to have a bigger impact than Would it be possible to somehow reduce loss percentage in your network? For instance, by using Zabbix proxies with a smaller Timeout and putting them closer to the monitored SNMP devices? |
Comment by Artur [ 2015 Mar 18 ] |
tcpdump.7z was to command "ethtool --offload eth0 rx off tx off" there was a lot of errors "bad udp cksum", and as I said, I have got many hosts with CPU load > 80% and they do not respond to all requests for example dumpget.7z made without bulk 4% loss |
Comment by Aleksandrs Saveljevs [ 2015 Mar 23 ] |
As we have discovered above, there is no bug in Zabbix. It is just that the change in |
Comment by Oleksii Zagorskyi [ 2015 Mar 23 ] |
I agree with Aleksandrs. |
Comment by Artur [ 2015 Mar 23 ] |
I also agree the ticket can will be close, but ideal variant to share timeout external scripts and snmp-get, because I have only four host with a script which requires timeout=10s. |
Comment by Oleksii Zagorskyi [ 2015 Mar 23 ] |
Unexpected, but seems we don't have a separate ZBXNEXT for individual timeout handling. Should we edit/extend the Here is a list to post to a proper issue: |
Comment by Artur [ 2015 Mar 23 ] |
How you will be more convenient? |
Comment by Oleksii Zagorskyi [ 2015 Mar 23 ] |
Artur, thanks, but do not hurry, let us discuss and decide what to do |
Comment by Artur [ 2015 Mar 23 ] |
Ok. I can try to reduce the timeout and attach graphics but external scripts will stop working. Graph with timeout=1 in attachment |
Comment by Aleksandrs Saveljevs [ 2015 Mar 24 ] |
Oleksiy, feel free to extend zalex_ua Done ! |
Comment by Artur [ 2015 Mar 24 ] |
Oleksiy, can not add a timeout for the interface ZBXNEXT-1096? And add a timeout for items? |
Comment by Oleksii Zagorskyi [ 2015 Mar 24 ] |
Artur, feel free to post your point of view in the |