[ZBX-16460] Zabbix daemons do not exit on FreeBSD Created: 2019 Aug 04 Updated: 2024 Apr 10 Resolved: 2019 Sep 05 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G), Proxy (P), Server (S) |
Affects Version/s: | 4.0.11, 4.0.12, 4.2.5, 4.2.6 |
Fix Version/s: | 4.0.13rc1, 4.2.7rc1, 4.4.0alpha3, 4.4 (plan) |
Type: | Problem report | Priority: | Major |
Reporter: | Christian Ullrich | Assignee: | Vladislavs Sokurenko |
Resolution: | Fixed | Votes: | 1 |
Labels: | SIGTERM | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
FreeBSD 12.0-RELEASE-p8 |
Issue Links: |
|
||||||||
Team: | |||||||||
Sprint: | Sprint 55 (Aug 2019) | ||||||||
Story Points: | 0.125 |
Description |
Since 4.0.11, the Zabbix daemons (at least the agent and server; I'm not running the proxy) cannot be stopped using normal means on FreeBSD. (The "normal means" is sending SIGTERM to the process in the pid file.) Until 4.0.10, the reaction of the parent process to receiving SIGTERM was to SIGTERM its children, wait for them to exit, then exit itself: kill(54207,SIGTERM) = 0 (0x0) Since 4.0.11, it instead sends SIGUSR2, immediately followed by SIGABRT, which it fails to deliver: kill(82225,SIGUSR2) = 0 (0x0) It does nothing after that, i.e. it does not wait for children to exit. Bisecting between the 4.0.10 and 4.0.11 tags leads to commit a5ab84490f2046b65de5b8e53b5452acd2f37c52 as the guilty commit, which unfortunately is a merge. My git-fu is not good enough to determine where to look further. Steps to reproduce:
Result: The processes do not exit. The processes exit. |
Comments |
Comment by Vladislavs Sokurenko [ 2019 Aug 04 ] |
Do you know why it fails to deliver SIGABRT? |
Comment by Christian Ullrich [ 2019 Aug 04 ] |
No idea, sorry. I tried to figure it out, but there is no mention that this can happen in any signals-related man pages, there is nothing on Google, and I could not find anything in the kernel source either. The processes are all running as the same user, are not jailed, and are in the same session. There is no apparent reason why it should fail. The only thing I can think of is that perhaps the error happens when the target process has the signal blocked? Also, there is no delay between sending SIGUSR2 and SIGABRT. The Zabbix source looks like SIGUSR2 is a gentle invitation to exit, while SIGABRT forces the exit. That the two signals are sent immediately after one another does not look right to me either. |
Comment by Christian Ullrich [ 2019 Aug 04 ] |
Running as the same user as the Zabbix processes, sending SIGABRT to a subprocess works: [zabbix@infra /tmp]$ sleep 10 & |
Comment by Christian Ullrich [ 2019 Aug 07 ] |
OK, I'm absolutely positive that my truss output above used to say SIGABRT where it now has five SIGQUITs fail to send. I just retested it, and the signal after the SIGUSR2 is definitely SIGABRT. Here is the same operation as seen by ktrace. The incoming SIGABRT in the second line from the bottom is me manually putting the agent out of its misery. {{ 15409 zabbix_agentd 0.000000 RET wait4 -1 errno 4 Interrupted system call}} |
Comment by DaneT [ 2019 Aug 09 ] |
Hello Christian |
Comment by Christian Ullrich [ 2019 Aug 09 ] |
Hello Marcis, my build is from the FreeBSD Ports Collection, net-mgmt/zabbix4-{server,agent}. There are no source code patches currently associated with the two ports. The arguments to configure used to build the agent are: --enable-agent --sysconfdir=/usr/local/etc/zabbix4 --datadir=/usr/local/etc/zabbix4 --with-iconv=/usr --with-libcurl --without-gnutls --enable-ipv6 --with-openssl --prefix=/usr/local The problem occurs both when using the port-provided rc script (which is not identical to yours, but has the same effect), and when manually stopping the agent (kill -TERM pid). FreeBSD's /bin/sh when using the script, Bash when not. |
Comment by Christian Ullrich [ 2019 Aug 09 ] |
Found it. The cause is a FreeBSD sysctl variable. From kill(2): For a process to have permission to send a signal to a process designated Once I disable the sysctl, the problem is gone. This sysctl has existed since 2003, and is enabled by default. |
Comment by DaneT [ 2019 Aug 09 ] |
Thank you Christian. Would appreciate if you would share your findings. |
Comment by Vladislavs Sokurenko [ 2019 Aug 09 ] |
Do you think replacing SIGABRT to SIGTSTP would help with the issue ? |
Comment by Christian Ullrich [ 2019 Aug 09 ] |
Probably, but it's not really a good match for the purpose (from the point of view of understanding the code). I'm not sure what - Also, do you know why there is no delay between SIGUSR2 and SIGABRT now? I have a case here where the SIGABRT even arrived _before _the SIGUSR2. That can't be right. # kdump -E | egrep 'kill|PSIG' | egrep '68236|0x10a8c' |
Comment by Vladislavs Sokurenko [ 2019 Aug 09 ] |
Currently all child processes ignore SIGTERM, only parent process can receive it and then ask child processes to terminate politely by sending SIGUSR2. Then parent process will wait for history syncers processes to finish their work and send SIGABRT to remaining processes. Not sure why the order is wrong some times but it should be harmless. |
Comment by DaneT [ 2019 Aug 09 ] |
Could it be some issue with particular FreeBSD patch lvl? Christian's system has P8, latest is P9. |
Comment by Christian Ullrich [ 2019 Aug 09 ] |
More information, but this is really getting confusing. Short version: The immediate cause is in the FreeBSD port. The problem occurs whenever the agent (I assume it's the same for the others) is started as a different user from the one it is configured to run and therefore setuid()s itself to the user from the configuration. In this case, it cannot send the signal. FreeBSD has a facility to start services as a different user, by setting (in this case) zabbix_agentd_user="zabbix" in /etc/rc.conf. The default implementation of the "start" subcommand turns this into starting the service via /usr/bin/su. The port's replacement rc script, however, overrides that implementation with one that ignores this option and always starts the service as the user that runs the script, i.e. root. Because the service itself notices and setuid()s itself, this was not really noticeable before the signal changes in 4.0.11. I have filed a FreeBSD bug report (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239746) to find out what the port's maintainer thinks. |
Comment by Danilo G. Baio [ 2019 Aug 10 ] |
@mbuhholcs steps to reproduce in a fresh install with Vagrant freebsd/FreeBSD-12.0-RELEASE box.
pw useradd -n zabbix -c "Zabbix User" -s /bin/sh -m pkg install curl pcre fetch -o zabbix-4.2.5.tar.gz https://sourceforge.net/projects/zabbix/files/ZABBIX%20Latest%20Stable/4.2.5/zabbix-4.2.5.tar.gz/download extract/cd ./configure --enable-agent \ --prefix=/usr/local/zabbix42manual \ --with-iconv=/usr --with-libcurl \ --without-gnutls --enable-ipv6 \ --with-openssl make / install As root: /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/zabbix_agentd.conf ps axu | grep zabbix user | pid | command zabbix 27449 /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/z zabbix 27639 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd) zabbix 27658 zabbix_agentd: listener #1 [waiting for connection] (zabbix_agentd) zabbix 27929 zabbix_agentd: listener #2 [waiting for connection] (zabbix_agentd) zabbix 28119 zabbix_agentd: listener #3 [waiting for connection] (zabbix_agentd) zabbix 28242 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd) kill 27449 ps axu | grep zabbix user | pid | command zabbix 27449 /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/z zabbix 27639 zabbix_agentd: collector #1 [terminated] (zabbix_agentd) zabbix 27658 zabbix_agentd: listener #1 [terminated] (zabbix_agentd) zabbix 27929 zabbix_agentd: listener #2 [terminated] (zabbix_agentd) zabbix 28119 zabbix_agentd: listener #3 [terminated] (zabbix_agentd) zabbix 28242 zabbix_agentd: active checks #1 [terminated] (zabbix_agentd) ... As zabbix user: /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/zabbix_agentd.conf ps axu | grep zabbix user | pid | command zabbix 87240 /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/zabbix_agentd.conf zabbix 87257 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd) zabbix 87289 zabbix_agentd: listener #1 [waiting for connection] (zabbix_agentd) zabbix 87547 zabbix_agentd: listener #2 [waiting for connection] (zabbix_agentd) zabbix 87658 zabbix_agentd: listener #3 [waiting for connection] (zabbix_agentd) zabbix 87755 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd) kill 87240 $ ps axu | grep zabbix <none>
|
Comment by Danilo G. Baio [ 2019 Aug 29 ] |
Hi. Issue still on the latest release. Affects Version/s: >= 4.0.11, >= 4.2.5 Regards. |
Comment by Vladislavs Sokurenko [ 2019 Aug 29 ] |
Fixed in pull request: |
Comment by Vladislavs Sokurenko [ 2019 Aug 29 ] |
Fixed in:
|
Comment by Danilo G. Baio [ 2019 Aug 29 ] |
Thank you |