[ZBX-8528] random lost UDP packets lead to not bulk snmp requests and as results increased CPU usage etc Created: 2014 Jul 27 Updated: 2024 May 13 Resolved: 2024 May 13 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 2.2.5rc1, 2.3.2 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Trivial |
Reporter: | Oleksii Zagorskyi | Assignee: | Unassigned |
Resolution: | Won't fix | Votes: | 9 |
Labels: | bulk, network, retry, snmp, timeout | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||
Issue Links: |
|
Description |
Will be described in first comment. |
Comments |
Comment by Oleksii Zagorskyi [ 2014 Jul 28 ] |
This is follow up from In general, right after proxy restart its CPU load and other metrics look better than they were on 2.2.2, but after some uptime they become worse, as I just said. I had a guess that it's related to bulk snmp operations introduced in 2.2.3 and I was right. I have two proxies, 1st one (on left side) is doing different checks, including SNMPv3. 2nd one (on right) is doing only SNMPv3 checks. Upgrade to 2.2.3 was done 2014-04-10. 2014-06-07 to both proxies I applied different patches (using 2.2.4 sources). I had to wait one week to make sure that they have any effect. Analysis: 1. This is a user param which collects number of snmp hosts with network error state ("select count This is just points when proxies were restarted. This is interesting. We see that after some uptime CPU load i increasing. When it drops - zabbix proxy deamon was restarted. CPU utilization is a bit different from CPU load. This correlates to CPU load above - especially on 002 wee see 2 times less CPU interrupts with snmp bulk. Pollers load are correlated with CPU load: NVPS just in case. All this graphs on one picture you can find in attachments as "0.all_on_screen.png" Consequences and suggestions are in next comment. |
Comment by Oleksii Zagorskyi [ 2014 Jul 28 ] |
Why this happens: According to logic described on https://www.zabbix.com/documentation/2.2/manual/config/items/itemtypes/snmp#internal_workings_of_bulk_processing when snmp get request is failed during "discovering maximal supported objects" it halves numbers of objects and then increase it by 1 until reaches min_fail-1 again. For example I have 10 items on an interface (monitoring localhost snmpd), then I: As we can see every time because of single lost UDP packets (similarly to my experiment) zabbix will use less and less number of objects until it reaches 1 object per get request and it will stay this state until daemon restart. |
Comment by Oleksii Zagorskyi [ 2014 Jul 28 ] |
Suggestion: For example we could introduce number of retries to reach maximum of supported objects. Another way is introduce sort of TTL for the cache interface entries which performed zabbix retries and decreased number of supported objects. After TTL zabbix should try again discover maximal objects number using 2nd strategy (increasing by 1). |
Comment by Oleksii Zagorskyi [ 2014 Jul 28 ] |
(1) documentation Such text is a bit precise for me |
Comment by Oleksii Zagorskyi [ 2014 Jul 28 ] |
One more thing I wanted to describe. As in 2.2.3 we got bulk snmp support - now zabbix performs retries - halving number of objects (if it successfully initially discovered some supported number of objects). In other words in my experiment with snmpd in server log file we will NOT see "first network error". |
Comment by richlv [ 2014 Jul 28 ] |
it sounds like adding a separate parameter to control snmp retry count (in zabbix) and setting it to 1 by default could help, too zalex_ua I wanted to ask that as partial ant fast solution in separate ZBX. Now done in |
Comment by Raimonds Treimanis [ 2014 Aug 07 ] |
Eventually all monitored hosts in my setup end up getting only 1 value per request after some time (hours to days). |
Comment by Aleksandrs Saveljevs [ 2014 Aug 18 ] |
I have spent the last several working days thinking about the approach we could take on this issue and would now like to share the considerations I have in mind. Some of them seem to be along the lines of what zalex_ua is proposing. If you have any other ideas or elaborations on the general idea below, please share. So, when we send a request and get timeout, we need to distinguish between 3 cases:
The current approach described at https://www.zabbix.com/documentation/2.2/manual/config/items/itemtypes/snmp#internal_workings_of_bulk_processing deals with cases #1 and #2, and it seems successful at that, but it has so far disregarded case #3. Introducing retries, as suggested in One of the ideas that I had is to divide the workings of bulk processing in two phases: before the optimal number of OIDs is discovered and after. During the first phase, when we do not yet know the characteristics of the device, we can use a large number of Net-SNMP retries (say, 4). This will make (almost) sure that if we get a timeout during this phase, this is not due to case #3. However, once the optimal number of OIDs has been discovered, we can lower the number of Net-SNMP retries (say, 1, or even make it 0), because we also do the retrying ourselves, and if we get a timeout now, it shall not affect the optimal number of OIDs (assuming characteristics of the device do not change). So, in short, two phases of bulk processing are proposed: one before the number of OIDs has stabilized and after. During the first phase, large Net-SNMP retry setting. During the second phase, small Net-SNMP retry setting. However, it is a bit more difficult than that. In the internal workings of bulk processing, we limit the size of the request by the number of OIDs in it. However, the actual limit on the size of the response that a device can provide is device-specific and is (most probably) not measurable in the number of OIDs - the size of the UDP packet, perhaps. We do not know the exact reason why a device cannot handle large requests, but we estimate that using the number of OIDs. A practical manifestation of the above is that there is a certain M number of OIDs up to which a device can always handle the general kind of requests. There is also a certain N, starting from which the device can never handle the general kind of requests. However, there is also some small range between M and N where the device can either answer or not. For instance, a device can always answer requests of 48 OIDs, can answer requests of 49 to 50 OIDs with a varying degree of success, and can never answer requests with 51 OIDs. The approach that we take should be able to arrive at 48 OIDs in the example above, to reduce timeouts. There are other open questions, too:
|
Comment by Aleksandrs Saveljevs [ 2014 Oct 23 ] |
There is quite a number of considerations described above and it might be non-trivial to address all of them at once. So we have decided to start with a simple solution by introducing two ideas:
This will be implemented under |
Comment by Aleksandrs Saveljevs [ 2015 Mar 24 ] |
Issue |
Comment by Oleksii Zagorskyi [ 2015 Apr 10 ] |
a year ago after snmp bulk implemented ... A short list of changes, it's marked by lines on the screen:
2014-04-10 (proxy001, proxy002) -> 223
2014-07-06 (proxy002) 2014-07-13 (proxy001) -> patched with one libsnmp retry for proxies
2014-12-24 (proxy001) 2015-01-14 (proxy002) -> 228
2015-03-16 (proxy001, proxy002) -> 229
Here is a screen the same as attached 0.all_on_screen.png but with 1 year period: |
Comment by Oleksii Zagorskyi [ 2024 May 13 ] |
Many things has been changed since the time I troubleshooted this. |