[ZBX-7798] after upgrade to 2.2.2 zabbix queue graph looks anomalous due to icmp ping items Created: 2014 Feb 13  Updated: 2024 Apr 10  Resolved: 2019 Sep 16

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.2
Fix Version/s: 4.0.13rc1, 4.2.7rc1, 4.4.0alpha3, 4.4 (plan)

Type: Problem report Priority: Minor
Reporter: Robert Jerzak Assignee: Michael Veksler
Resolution: Fixed Votes: 3
Labels: icmpping, queue
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2014-02-13 at 00.14.20.png     PNG File Screen Shot 2014-02-13 at 00.16.18.png     PNG File Screen Shot 2014-02-13 at 13.44.11.png     PNG File Screen Shot 2014-02-13 at 13.52.01.png     File debugging-1.patch    
Issue Links:
Duplicate
duplicates ZBX-5062 icmppingloss (may be other pinger bas... Closed
Sub-task
depends on ZBX-16368 fping double call Closed
Team: Team A
Sprint: Sprint 56 (Sep 2019), Sprint 55 (Aug 2019), Sprint 53 (Jun 2019), Sprint 54 (Jul 2019)
Story Points: 1

 Description   

After update zabbix from version 2.2.1 to 2.2.2 zabbix queue graph looks pretty anomalous. Picture in the attachment shows it, I've made the update around 14.00.
zabbix queue on the graph is taken from zabbix[queue] item.

Before the update avg value was around 1.4. After the update avg is around 32 but what's more interesting is looks very unnatural, just jumps from very low value to around 100 periodically.

There is no other drawbacks I'm aware of.

I've confirmed this behaviour on two our zabbix servers.

If you consider description or symptoms too general or irrelevant feel free to close the ticket.



 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Feb 13 ]

Would it be possible for you to identify item types that are queueing periodically? For instance, in "Administration" -> "Queue"?

Comment by Oleksii Zagorskyi [ 2014 Feb 13 ]

You need to investigate server log, performance graphs and other points.
Closed as requested

Comment by Oleksii Zagorskyi [ 2014 Feb 13 ]

This project for bug reports only.

Comment by Aleksandrs Saveljevs [ 2014 Feb 13 ]

In this particular case, I am very interested in the cause. Simply upgrading from 2.2.1 to 2.2.2 should not have such drastic consequences.

Comment by Robert Jerzak [ 2014 Feb 13 ]

According to "Administration" -> "Queue" it's Simple check type of item. It jumps from 0 to around 100.

Comment by Aleksandrs Saveljevs [ 2014 Feb 13 ]

Do you have an idea which simple checks might be queueing? Which simple checks do you use a lot? Do you use VMware monitoring?

Comment by Aleksandrs Saveljevs [ 2014 Feb 13 ]

Do you use ping items a lot on a single host? Since ZBX-7649 all ping items with the same parameters are checked together, so that might have had this effect.

Comment by Robert Jerzak [ 2014 Feb 13 ]

Majority of my simple checks are icmpping and icmppingsec, so I would guess that these are most suspicious. On a single host I have usually one icmpping, on some hosts there are both icmpping and icmppingsec.

Comment by Aleksandrs Saveljevs [ 2014 Feb 13 ]

How busy is the pinger process? Could you please attach the graph which shows how busy Zabbix processes are, before and after the upgrade?

Comment by Robert Jerzak [ 2014 Feb 13 ]

Before the update (about 14.00) zabbix busy icmp pinger process was about 24%, after the update there is around 28%. I've added graph with zabbix processes usage.

Comment by Robert Jerzak [ 2014 Feb 13 ]

In zabbix_server.conf I have:

StartPingers=12

Comment by Aleksandrs Saveljevs [ 2014 Feb 13 ]

Do you use the default settings for pinging? By how much are these simple checks delayed?

Comment by Robert Jerzak [ 2014 Feb 13 ]

Yes, default settings. These are literally "icmpping" and "icmppingsec" witout additional parameters. Interval is 60s. Response time for hosts is relatively low, it's usually around 1-2ms.

Comment by Aleksandrs Saveljevs [ 2014 Feb 14 ]

Reopening, so that the issue is not forgotten.

You mentioned previously that according to "Administration" -> "Queue" it is simple checks that are queueing. Could you please show by how much are they delayed (i.e., which column, "5 seconds", "10 seconds", "30 seconds", ... are they in)?

Comment by Robert Jerzak [ 2014 Feb 14 ]

Queue value jumps from 0 to around 100 only in the "5 seconds" column. Almost every browser refresh of this "Queue" page and the value in "5 seconds" is 0 or ~100.

Comment by Aleksandrs Saveljevs [ 2014 Feb 17 ]

How many hosts do you have and how many of them have ping items? I shall try to reproduce the issue in our environemnt with the same settings.

Comment by Aleksandrs Saveljevs [ 2014 Feb 18 ]

I have currently tried approximately 100 ICMP ping values per second and the queue is always 0.

If we provide a patch for you, for instance, one that adds some debug logging to "zabbix[queue]" item to print out items that are delayed, would it be possible to recompile and run this patched server?

Comment by Robert Jerzak [ 2014 Feb 18 ]

I have about 1900 hosts, most of then has one "icmpping" item, some of them has second "icmppingsec" item. Interval of those items is 60s.

Sure, I can run zabbix_server with your patch on my testing environment.

Comment by Aleksandrs Saveljevs [ 2014 Feb 18 ]

Robert, I have attached debugging-1.patch.

It does two things:

  1. when processing "zabbix[queue]" item, it prints out the list of items that are delayed;
  2. it sets logging level of pinger processes to LOG_LEVEL_DEBUG.

The log it will produce might contain private information. Either try to strip it out or send it to me by email at [my-first-name].[my-last-name]@zabbix.com.

Comment by Robert Jerzak [ 2014 Feb 19 ]

I've sent you an email with logs.

Comment by Aleksandrs Saveljevs [ 2014 Feb 19 ]

The logs that Robert sent us were very useful. There seems to be no problem with Zabbix 2.2.2 compared to 2.2.1, it is just that the changes in ZBX-7649 lead to hosts being pinged in a different pattern, and that manifested the issue that always existed.

So during investigation we have uncovered a behavior of fping that we were not aware of.

Suppose we have 1 host to ping, the default interval between pings is 1000 ms and the default timeout is 500 ms. The fping invocation with "-C3" (three pings) in the worst case takes around 2500 ms.

Now, suppose we have 10 hosts to ping. We thought that it should also take 2500 ms to ping, however, that is not true. The reason is that apart from "-p" (interval) and "-t" (timeout) options, fping also has "-i" option with the default value of 25 ms, which specifies the interval between successive ping packets (not just to one host, but to all hosts). So pinging 10 hosts in the worst case takes 2000 + 9 * 25 + 500 = 2725.

With around 100 hosts, as in the Robert's case, it took nearly 7 seconds to ping all hosts. The problem is doubled by the fact that we launch both fping and fping6, so both invocations can take 14 seconds in total, and that is why there are spikes on the queue.

There are two obvious ways to fix that:

  1. pass "-i 10" option to fping (10 ms is the minimum value);
  2. reduce MAX_ITEMS in pinger.c from 128 to a smaller value.

Ideally, the solution is a combination of both, or a different approach altogether.

Comment by Cicero Silva [ 2014 Mar 13 ]

after upgrading from 2.2.1 to 2.2.2 my queue increased monitoring and packet loss are all negative (-100%).

ICMP loss 12 Mar 2014 21:33:25 -100 %

Comment by Aleksandrs Saveljevs [ 2014 Mar 13 ]

The icmppingloss issue was fixed already in ZBX-7840.

Comment by Tomasz ChruĊ›ciel [ 2014 May 29 ]

Hi all, watch this output. It seems like a kind of 2s timeout is added to pinger total execution time. I'm not sure for 100%, but before 2.2.2 a pinger execution time was proportional to a number of pinged items.
In my case (200 hosts) it causes a queue of 200+ simple checks (delayed up to 1 minute)

zabbix(/root)# ps -ef|grep icmp

zabbix 8681 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #1 [got 0 values in 0.000003 sec, idle 1 s
zabbix 8682 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #2 [got 0 values in 0.000006 sec, idle 1 s
zabbix 8682 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #2 [got 9 values in 2.055720 sec, idle 1 s
zabbix 8682 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #2 [got 6 values in 2.030326 sec, idle 1 s
zabbix 8681 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #1 [got 3 values in 2.005456 sec, idle 1 s
zabbix 8681 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #1 [got 9 values in 2.054962 sec, idle 1 s
zabbix 8682 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #2 [got 0 values in 0.000007 sec, idle 1 s
zabbix 8681 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #1 [got 6 values in 2.029924 sec, idle 1 s
zabbix 8682 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #2 [got 6 values in 2.031920 sec, idle 1 s
zabbix 8683 8649 0 09:03 ? 00:00:01 /usr/sbin/zabbix_server: icmp pinger #3 [got 12 values in 2.080042 sec, idle 1

Regards
-TCH

Comment by Aleksandrs Saveljevs [ 2014 May 29 ]

Tomasz, the output you provided is correct and is expected. If it takes a pinger 2 seconds to do its job, then it comes from 3 pings with 1 second delay in-between successive pings of the same host.

Comment by peter erbst [ 2014 Oct 03 ]

we have plans to monitor about 4000 devices, about 2000 of them also with ping delay and ping loss.

currently, with ~ 1300 devices:
StartPingers = 50, but only 10 of them are really active, the others are not doing anything and the zabbix queue is non-zero (usually 60-110 items with 5 second delay).
icmp pinger #50 [got 0 values in 0.000002 sec, idle 1 sec]

reduce MAX_ITEMS in pinger.c from 128 to a smaller value - that variable is no longer in the file (zabbix 2.2.4)
is there any way to utilize all 50 StartPingers?
would upgrade to zabbix 2.4 help in this matter?

Comment by Aleksandrs Saveljevs [ 2014 Oct 03 ]

Peter, upgrade to Zabbix 2.4 will probably not help in this matter. However, the variable you are looking to reduce is MAX_PINGER_ITEMS in include/dbcache.h (currently 128).

Comment by Vladislavs Sokurenko [ 2019 Jun 28 ]

If anyone still experience the issue, please share your configuration.

Comment by Vladislavs Sokurenko [ 2019 Jun 28 ]

One pinger can process 128 items at a time (only items with same configuration are processed in bulk). Time to process those items depend on configuration but can be predicted with the following command if we wish to send 3 pings to each IP address:

time fping -c3 -g 127.0.0.1/25

If fping6 is also used then this will take twice longer

Comment by Michael Veksler [ 2019 Sep 09 ]

Available in:

  • 4.0.13rc1 b9ce9ee3a0
  • 4.2.7rc1 784fa1090e
  • 4.4.0beta1(master) 57abe5a1f2
Generated at Fri Apr 26 19:02:17 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.