[ZBX-19614] native proc.num support to Zabbix agent 2 sometimes return 0 even the process is actually running Created: 2021 Jun 30 Updated: 2024 Apr 10 Resolved: 2021 Sep 17 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 5.0.13, 5.4.2 |
Fix Version/s: | 5.0.16rc1, 5.4.5rc1, 6.0.0alpha3, 6.0 (plan) |
Type: | Problem report | Priority: | Trivial |
Reporter: | Khachain Wangthammang | Assignee: | Eriks Sneiders |
Resolution: | Fixed | Votes: | 42 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() |
||||||||||||
Issue Links: |
|
||||||||||||
Team: | |||||||||||||
Sprint: | Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021) | ||||||||||||
Story Points: | 1 |
Description |
Steps to reproduce:
Result: proc.num sometimes return 0. Here's 2 days graph before upgrade from 5.0.12 to 5.0.13 for proc.num[redis]. The actual process is running without stopped for several months.
Expected: proc.num return correct number of process matching condition specified in the item like before.
reference change: ZBXNEXT-6596 |
Comments |
Comment by Tim Harman [ 2021 Jun 30 ] |
I also experience this issue on two hosts since upgrading to 5.4.2 I am using Zabbix Agent 2 with active checks. I am checking the status of my ntpd process every minute and every now and then I'll get told it's not running. I go and check and it's still there, running, with the same PID. Restarting zabbix-agent2 doesn't help. It's not always the ntpd process that reports, sometimes it's REDIS, sometimes it's Apache2. All of them make me get nervous something is broken! My systems: Debian 10 Hosts (2) running Zabbix Agent 2 5.4.2 Debian 10 Server running Zabbix Server 5.4.2
I have resolved this issue by downgrading Zabbix Agent on my two Debian Hosts to 5.4.1 and setting the package to be held so it doesn't upgrade accidentally again. The Server is still running 5.4.2 so it definetly appears to the Zabbix Agent 2. I'm happy to provide graphs (mine look like the poster above) or any other details that might help pinpoint the issue. Thanks! |
Comment by Ferry Boender [ 2021 Jun 30 ] |
Also experiencing this issue on all systems where we upgrades the agent from 5.4.1 to 5.4.2. Downgrading manually back to 5.4.1 fixes the issue. The changelog for 5.4.2 mentions proc.num: " Looks like a rewrite of the code for proc.num? |
Comment by Ferry Boender [ 2021 Jun 30 ] |
Some additional info: Graph of a few days of history for a proc.num check: The dip in the graph starts when we upgraded to 5.4.2 and it goes back up to '2' after downgrading the agent back to 5.4.1. The graph has been averaged out since yesterday, but the problem would be intermittent, with the output of the check going from 2 to 0 every few minutes. Sometimes it would be okay for 5 to 10 minutes, at other times it would fluctuate every minute. Item configuration looks like this: I confirmed that the processes were running for more than 13h according to the output of `ps -p <PID> -o etime`. According to the Go code for the change to proc.num in 5.4.2, it uses `/proc/PID/status`. I put a simple watch on the two java processes to see if any errors occurred in permissions or something weird:
$ sudo -u zabbix -s /bin/sh $ while true; do cat /proc/26406/status > /dev/null; cat /proc/1758/status > /dev/null; sleep 1; done No errors or anything showed up in about an hour of running this. It does seem to be a problem in the agent. |
Comment by Robert Masztalerz [ 2021 Jul 01 ] |
We also experience this problem since upgrading zabbix-agent2 to version 5.0.13. It's pretty severe for us, because it generates a lot of false positives. |
Comment by Renats Valiahmetovs (Inactive) [ 2021 Jul 01 ] |
Hello Robert! |
Comment by Juri Malinovski [ 2021 Jul 14 ] |
This kind of bug should be covered by QA team first, imho |
Comment by De Beuckelaer Donovan [ 2021 Jul 19 ] |
This is a very severe, annoying bug, that deserves the highest prio, i don't understand it is not fixed in 5.0.14 |
Comment by Raoel Oomen [ 2021 Jul 22 ] |
not fixed in 5.4.3 either.. please vote for this issue if you have not done this yet |
Comment by Jordan Barnartt [ 2021 Jul 29 ] |
I can also confirm that we are experiencing this issue with Zabbix Agent 2 5.4.3 on Ubuntu 18.04 and 20.04 hosts. |
Comment by Sascha Glade [ 2021 Aug 02 ] |
Any updates on this? It's already been a month since the first report. A bit annoying to tell our customers the monitoring is flaky. |
Comment by Tim Harman [ 2021 Aug 03 ] |
@Sascha Glade: You realise it's open source software right? You've got a few options:
1) Pay for support and see what they suggest 2) Rollback to 5.4.1, which doesn't have the problem 3) Stop ussing Zabbix Agent2 and instead go back to Zabbix Agent 1 4) Examine the code, find the bug and patch it.
If you have to tell your customers monitoring is flaky, that's on you, not Zabbix.
|
Comment by John Gelnaw [ 2021 Aug 05 ] |
I also have this problem on Agent 5.4.3 on RHEL/CentOS 7.x, from Zabbix repositories. @Tim Harman: This is basic functionality for a monitoring package. Open source or not, I can't do my intended dog+pony demonstration of Zabbix to my upper management to convince them to move to Zabbix enterprise, if the open source version can't get a basic function working. And since this is open source, you left out 5) "Switch to a different system". |
Comment by Tim Harman [ 2021 Aug 05 ] |
@John: Do you have something in agent2 that you urgently need? The problem is only in agent2, if you use original Agent you don't have the problem. Even if there's something in Agent2 that you DO urgently need, do you urgently need 5.4.2/.3? Why can't you rollback to 5.4.1.
I agree, it's an annoying** bug but I think you'd have to have a pretty amazing edge case where you can't showcase 5.4.1 agent2 working for your management team. |
Comment by John Gelnaw [ 2021 Aug 05 ] |
@Tim: The agent2 built-in modules simplify my deployment by a significant factor. Of course, rolling back to 5.4.1 is an option (and I'm in the process of doing just that, now that I know it's a bug), but the bug is labeled "trivial" and "unresolved". Personally, in spite of being a decent coder with decades of experience with open source (certainly before the term was coined), I find the attitude of "well, fix it yourself" objectionable, especially in a product which is offered as both open source and commercial. But this isn't really a discussion forum. I would be happy to help debug, however, so if there is any information I can provide, I'd be more than willing. |
Comment by Yurii Polenok [ 2021 Aug 05 ] |
We have temporarily replaced the key with something like this:
system.run["pgrep -cx 'nginx'"]
Although not ideal, it works without problems and the version of the agent did not have to be changed.
|
Comment by Jean Gau [ 2021 Aug 09 ] |
I also have this issue on zabbix-agent2 5.4.3 on CentOS 6/7. Good luck finding a solution. |
Comment by Arda Beyazoglu [ 2021 Aug 20 ] |
I have the same issue on ubuntu 18.04 and 20.04 with zabbix-agent2 5.4.3, for redis and nginx monitoring. |
Comment by Raoel Oomen [ 2021 Aug 30 ] |
https://support.zabbix.com/browse/ZBX-19689 is this a duplicate? |
Comment by Mathew [ 2021 Aug 30 ] |
@raoel It indeed looks like the same issue.
|
Comment by Mathew [ 2021 Aug 30 ] |
I ran zabbix-agent2 in debug log level and noted no log entries of note around the failing time. |
Comment by Artem [ 2021 Sep 03 ] |
Sorry for posting to the resolved issue, but i want leave some comments. We have many servers with normal process count about 6000, and we have same issue. We found two solutions. First is a use Readdir(-1) as in current commit. But it is reverts commit [DEV-1192] improved performance when reading directory. So, we stay on second solution - rebuild zabbix-agent2 with a latest golang 1.17. Standart zabbix-agent2: zabbix-agent2 with golang 1.17: |
Comment by Vladislavs Sokurenko [ 2021 Sep 08 ] |
About issue being not reproducible with latest versions, is supported Go version used to build Zabbix agent when issue occurs ? Note from https://golang.org/doc/devel/release#policy
|
Comment by Vladislavs Sokurenko [ 2021 Sep 09 ] |
Issue was in Go os.Readdir() function calling lstat() and failing if file is removed in process and then incorrectly handling this error, it is fixed in Go 1.16 but also starting from Go 1.16 it is encouraged not to use os.Readdir() but to use new function os.ReadDir() as it does not call lstat() function. So also replaced os.Readdir() with os.ReadDir() but simply recompiling older version with Go 1.16 also solves the issue. |
Comment by Eriks Sneiders [ 2021 Sep 09 ] |
Fixed in:
Documentation updated
|