[ZBX-16460] Zabbix daemons do not exit on FreeBSD Created: 2019 Aug 04  Updated: 2024 Apr 10  Resolved: 2019 Sep 05

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G), Proxy (P), Server (S)
Affects Version/s: 4.0.11, 4.0.12, 4.2.5, 4.2.6
Fix Version/s: 4.0.13rc1, 4.2.7rc1, 4.4.0alpha3, 4.4 (plan)

Type: Problem report Priority: Major
Reporter: Christian Ullrich Assignee: Vladislavs Sokurenko
Resolution: Fixed Votes: 1
Labels: SIGTERM
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

FreeBSD 12.0-RELEASE-p8


Issue Links:
Duplicate
is duplicated by ZBX-16549 Zabbix Agent problem on FreeBSD 11.2 Closed
Team: Team A
Sprint: Sprint 55 (Aug 2019)
Story Points: 0.125

 Description   

Since 4.0.11, the Zabbix daemons (at least the agent and server; I'm not running the proxy) cannot be stopped using normal means on FreeBSD. (The "normal means" is sending SIGTERM to the process in the pid file.)

Until 4.0.10, the reaction of the parent process to receiving SIGTERM was to SIGTERM its children, wait for them to exit, then exit itself:

kill(54207,SIGTERM) = 0 (0x0)
kill(54208,SIGTERM) = 0 (0x0)
kill(54209,SIGTERM) = 0 (0x0)
kill(54210,SIGTERM) = 0 (0x0)
kill(54211,SIGTERM) = 0 (0x0)
wait4(54207,{ EXITED,val=1 },0x0,0x0) = 54207 (0xd3bf)
wait4(54208,{ EXITED,val=1 },0x0,0x0) = 54208 (0xd3c0)
wait4(54209,{ EXITED,val=1 },0x0,0x0) = 54209 (0xd3c1)
wait4(54210,{ EXITED,val=1 },0x0,0x0) = 54210 (0xd3c2)
wait4(54211,{ EXITED,val=1 },0x0,0x0) = 54211 (0xd3c3)

Since 4.0.11, it instead sends SIGUSR2, immediately followed by SIGABRT, which it fails to deliver:

kill(82225,SIGUSR2) = 0 (0x0)
kill(82226,SIGUSR2) = 0 (0x0)
kill(82227,SIGUSR2) = 0 (0x0)
kill(82228,SIGUSR2) = 0 (0x0)
kill(82229,SIGUSR2) = 0 (0x0)
kill(82225,SIGQUIT) ERR#1 'Operation not permitted'
kill(82226,SIGQUIT) ERR#1 'Operation not permitted'
kill(82227,SIGQUIT) ERR#1 'Operation not permitted'
kill(82228,SIGQUIT) ERR#1 'Operation not permitted'
kill(82229,SIGQUIT) ERR#1 'Operation not permitted'

It does nothing after that, i.e. it does not wait for children to exit.

Bisecting between the 4.0.10 and 4.0.11 tags leads to 

commit a5ab84490f2046b65de5b8e53b5452acd2f37c52
Merge: effc9419b6 8d515dacf0
Author: Miks Kronkalns <[email protected]>
Date: Fri Jul 5 17:30:44 2019 +0300

as the guilty commit, which unfortunately is a merge. My git-fu is not good enough to determine where to look further.

Steps to reproduce:

  1. service zabbix_agentd start
  2. service zabbix_agentd stop
  3. Sit and gather dust

Result:

The processes do not exit.
Expected:

The processes exit.



 Comments   
Comment by Vladislavs Sokurenko [ 2019 Aug 04 ]

Do you know why it fails to deliver SIGABRT?

Comment by Christian Ullrich [ 2019 Aug 04 ]

No idea, sorry. I tried to figure it out, but there is no mention that this can happen in any signals-related man pages, there is nothing on Google, and I could not find anything in the kernel source either.

The processes are all running as the same user, are not jailed, and are in the same session. There is no apparent reason why it should fail. The only thing I can think of is that perhaps the error happens when the target process has the signal blocked?

Also, there is no delay between sending SIGUSR2 and SIGABRT. The Zabbix source looks like SIGUSR2 is a gentle invitation to exit, while SIGABRT forces the exit. That the two signals are sent immediately after one another does not look right to me either.

Comment by Christian Ullrich [ 2019 Aug 04 ]

Running as the same user as the Zabbix processes, sending SIGABRT to a subprocess works:

[zabbix@infra /tmp]$ sleep 10 &
[1] 90996
[zabbix@infra /tmp]$ kill -ABRT 90996
[zabbix@infra /tmp]$
[1]+ Abort trap sleep 10

Comment by Christian Ullrich [ 2019 Aug 07 ]

OK, I'm absolutely positive that my truss output above used to say SIGABRT where it now has five SIGQUITs fail to send. I just retested it, and the signal after the SIGUSR2 is definitely SIGABRT.

Here is the same operation as seen by ktrace. The incoming SIGABRT in the second line from the bottom is me manually putting the agent out of its misery.

{{ 15409 zabbix_agentd 0.000000 RET wait4 -1 errno 4 Interrupted system call}}
{{ 15409 zabbix_agentd 0.000009 PSIG SIGTERM caught handler=0x800ba48e0 mask=0x0>}}
{{ 15409 zabbix_agentd 0.000021 CALL sigprocmask(SIG_SETMASK,0x7fffffffd54c,0)}}
{{ 15409 zabbix_agentd 0.000028 RET sigprocmask 0}}
{{ 15409 zabbix_agentd 0.000035 CALL getpid}}
{{ 15409 zabbix_agentd 0.000040 RET getpid 15409/0x3c31}}
{{ 15409 zabbix_agentd 0.000076 CALL sigprocmask(SIG_BLOCK,0x7ffffffecfe0,0x26b8>}}
{{ 15409 zabbix_agentd 0.000088 RET sigprocmask 0}}
{{ 15409 zabbix_agentd 0.000095 CALL semop(0x90009,0x7ffffffecfc8,0x1)}}
{{ 15409 zabbix_agentd 0.000101 RET semop 0}}
{{ 15409 zabbix_agentd 0.000108 CALL fstatat(AT_FDCWD,0x26ac60,0x7ffffffecf10,0)}}
{{ 15409 zabbix_agentd 0.000114 NAMI "/var/log/zabbix/agentd.log"}}
{{ 15409 zabbix_agentd 0.000128 STRU struct stat {dev=16458170496707336697, ino=>}}
{{ 15409 zabbix_agentd 0.000144 RET fstatat 0}}
{{ 15409 zabbix_agentd 0.000162 CALL open(0x26ac60,0x20a<O_RDWR|O_APPEND|O_CREAT>}}
{{ 15409 zabbix_agentd 0.000170 NAMI "/var/log/zabbix/agentd.log"}}
{{ 15409 zabbix_agentd 0.000204 RET open 6}}
{{ 15409 zabbix_agentd 0.000212 CALL lseek(0x6,0,SEEK_END)}}
{{ 15409 zabbix_agentd 0.000219 RET lseek 106052/0x19e44}}
{{ 15409 zabbix_agentd 0.000235 CALL getpid}}
{{ 15409 zabbix_agentd 0.000242 RET getpid 15409/0x3c31}}
{{ 15409 zabbix_agentd 0.000249 CALL fstat(0x6,0x7ffffffec9e0)}}
{{ 15409 zabbix_agentd 0.000256 STRU struct stat {dev=16458170496707336697, ino=>}}
{{ 15409 zabbix_agentd 0.000274 RET fstat 0}}
{{ 15409 zabbix_agentd 0.000323 CALL write(0x6,0x800f52280,0x73)}}
{{ 15409 zabbix_agentd 0.000351 GIO fd 6 wrote 115 bytes}}
{{ " 15409:20190807:221724.625 Got signal [signal:15(SIGTERM),sender_pid:1}}
{{ 5417,sender_uid:0,reason:65537]. Exiting ...}}
{{ "}}
{{ 15409 zabbix_agentd 0.000362 RET write 115/0x73}}
{{ 15409 zabbix_agentd 0.000368 CALL close(0x6)}}
{{ 15409 zabbix_agentd 0.000375 RET close 0}}
{{ 15409 zabbix_agentd 0.000384 CALL semop(0x90009,0x7ffffffecff8,0x1)}}
{{ 15409 zabbix_agentd 0.000390 RET semop 0}}
{{ 15409 zabbix_agentd 0.000396 CALL sigprocmask(SIG_SETMASK,0x7ffffffed010,0)}}
{{ 15409 zabbix_agentd 0.000406 RET sigprocmask 0}}
{{ 15409 zabbix_agentd 0.000412 CALL sigprocmask(SIG_BLOCK,0x7fffffffd0d0,0)}}
{{ 15409 zabbix_agentd 0.000418 RET sigprocmask 0}}
{{ 15409 zabbix_agentd 0.000437 CALL kill(0x3c32,SIGUSR2)}}
{{ 15409 zabbix_agentd 0.000452 RET kill 0}}
{{ 15409 zabbix_agentd 0.000459 CALL kill(0x3c33,SIGUSR2)}}
{{ 15409 zabbix_agentd 0.000469 RET kill 0}}
{{ 15409 zabbix_agentd 0.000479 CALL kill(0x3c34,SIGUSR2)}}
{{ 15409 zabbix_agentd 0.000489 RET kill 0}}
{{ 15409 zabbix_agentd 0.000495 CALL kill(0x3c35,SIGUSR2)}}
{{ 15409 zabbix_agentd 0.000503 RET kill 0}}
{{ 15409 zabbix_agentd 0.000508 CALL kill(0x3c36,SIGUSR2)}}
{{ 15409 zabbix_agentd 0.000515 RET kill 0}}
{{ 15409 zabbix_agentd 0.000521 CALL kill(0x3c32,SIGABRT)}}
{{ 15409 zabbix_agentd 0.000535 RET kill -1 errno 1 Operation not permitted}}
{{ 15409 zabbix_agentd 0.000541 CALL kill(0x3c33,SIGABRT)}}
{{ 15409 zabbix_agentd 0.000547 RET kill -1 errno 1 Operation not permitted}}
{{ 15409 zabbix_agentd 0.000552 CALL kill(0x3c34,SIGABRT)}}
{{ 15409 zabbix_agentd 0.000558 RET kill -1 errno 1 Operation not permitted}}
{{ 15409 zabbix_agentd 0.000567 CALL kill(0x3c35,SIGABRT)}}
{{ 15409 zabbix_agentd 0.000578 RET kill -1 errno 1 Operation not permitted}}
{{ 15409 zabbix_agentd 0.000587 CALL kill(0x3c36,SIGABRT)}}
{{ 15409 zabbix_agentd 0.000595 RET kill -1 errno 1 Operation not permitted}}
{{ 15409 zabbix_agentd 0.000606 CALL wait4(0x3c32,0x7fffffffd0fc,0,0)}}
{{ 15409 zabbix_agentd 6.409242 RET wait4 -1 errno 4 Interrupted system call}}
{{ 15409 zabbix_agentd 6.409280 PSIG SIGABRT caught handler=0x800ba48e0 mask=0x8>}}
{{ 15409 zabbix_agentd 6.409292 CALL sigprocmask(SIG_SETMASK,0x7fffffffca8c,0)}}

Comment by DaneT [ 2019 Aug 09 ]

Hello Christian
I am sorry, I tried to reproduce your issue but without success. I compiled Zabbix 4.0.11 from sources and was able to stop both agent and server using SIGTERM and process ID from .pid file. Maybe i am doing something differently from you.
Could you please elaborate a bit further - did you tried to stop from the same user which started it?
Any modifications to the sources? Which options you used during compilation?
Which shell?

Comment by Christian Ullrich [ 2019 Aug 09 ]

Hello Marcis,

my build is from the FreeBSD Ports Collection, net-mgmt/zabbix4-{server,agent}. There are no source code patches currently associated with the two ports.

The arguments to configure used to build the agent are:

--enable-agent --sysconfdir=/usr/local/etc/zabbix4 --datadir=/usr/local/etc/zabbix4 --with-iconv=/usr --with-libcurl --without-gnutls --enable-ipv6 --with-openssl --prefix=/usr/local

The problem occurs both when using the port-provided rc script (which is not identical to yours, but has the same effect), and when manually stopping the agent (kill -TERM pid).

FreeBSD's /bin/sh when using the script, Bash when not.

Comment by Christian Ullrich [ 2019 Aug 09 ]

Found it. The cause is a FreeBSD sysctl variable.

From kill(2):

For a process to have permission to send a signal to a process designated
by pid, the user must be the super-user, or the real or saved user ID of
the receiving process must match the real or effective user ID of the
sending process. A single exception is the signal SIGCONT, which may
always be sent to any process with the same session ID as the sender. In
addition, if the security.bsd.conservative_signals sysctl(9) is set to 1,
the user is not a super-user, and the receiver is set-uid, then only job
control and terminal control signals may be sent (in particular, only
SIGKILL, SIGINT, SIGTERM, SIGALRM, SIGSTOP, SIGTTIN, SIGTTOU, SIGTSTP,
SIGHUP, SIGUSR1, SIGUSR2).

Once I disable the sysctl, the problem is gone. This sysctl has existed since 2003, and is enabled by default.

Comment by DaneT [ 2019 Aug 09 ]

Thank you Christian. Would appreciate if you would share your findings.

Comment by Vladislavs Sokurenko [ 2019 Aug 09 ]

Do you think replacing SIGABRT to SIGTSTP would help with the issue ?

Comment by Christian Ullrich [ 2019 Aug 09 ]

Probably, but it's not really a good match for the purpose (from the point of view of understanding the code). I'm not sure what -ZBX-9867- was about (that is the one that changed this, right?). Just for the aesthetics, I'd prefer SIGINT or SIGTERM.

Also, do you know why there is no delay between SIGUSR2 and SIGABRT now? I have a case here where the SIGABRT even arrived _before _the SIGUSR2. That can't be right.

# kdump -E | egrep 'kill|PSIG' | egrep '68236|0x10a8c'
68231 zabbix_agentd 6.666932 CALL kill(0x10a8c,SIGUSR2)
68231 zabbix_agentd 6.667182 CALL kill(0x10a8c,SIGABRT)
68236 zabbix_agentd 6.667359 PSIG SIGABRT caught handler=0x800bb2a10 mask=0x0 code=SI_USER
68236 zabbix_agentd 6.667396 PSIG SIGUSR2 caught handler=0x800bb2a10 mask=0x20 code=SI_USER

Comment by Vladislavs Sokurenko [ 2019 Aug 09 ]

Currently all child processes ignore SIGTERM, only parent process can receive it and then ask child processes to terminate politely by sending SIGUSR2. Then parent process will wait for history syncers processes to finish their work and send SIGABRT to remaining processes.

Not sure why the order is wrong some times but it should be harmless.

Comment by DaneT [ 2019 Aug 09 ]

Could it be some issue with particular FreeBSD patch lvl? Christian's system has P8, latest is P9.
I wasn't able to reproduce it on "fresh" 12.0 without patches either.

Comment by Christian Ullrich [ 2019 Aug 09 ]

More information, but this is really getting confusing. Short version: The immediate cause is in the FreeBSD port.

The problem occurs whenever the agent (I assume it's the same for the others) is started as a different user from the one it is configured to run and therefore setuid()s itself to the user from the configuration. In this case, it cannot send the signal.

FreeBSD has a facility to start services as a different user, by setting (in this case) zabbix_agentd_user="zabbix" in /etc/rc.conf. The default implementation of the "start" subcommand turns this into starting the service via /usr/bin/su.

The port's replacement rc script, however, overrides that implementation with one that ignores this option and always starts the service as the user that runs the script, i.e. root. Because the service itself notices and setuid()s itself, this was not really noticeable before the signal changes in 4.0.11.

I have filed a FreeBSD bug report (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239746) to find out what the port's maintainer thinks.

Comment by Danilo G. Baio [ 2019 Aug 10 ]

@mbuhholcs  steps to reproduce in a fresh install with Vagrant freebsd/FreeBSD-12.0-RELEASE box.

 

pw useradd -n zabbix -c "Zabbix User" -s /bin/sh -m

pkg install curl pcre

fetch -o zabbix-4.2.5.tar.gz https://sourceforge.net/projects/zabbix/files/ZABBIX%20Latest%20Stable/4.2.5/zabbix-4.2.5.tar.gz/download

extract/cd

./configure --enable-agent \
--prefix=/usr/local/zabbix42manual \
--with-iconv=/usr --with-libcurl \
--without-gnutls --enable-ipv6 \
--with-openssl

make / install

As root:

/usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/zabbix_agentd.conf

ps axu | grep zabbix
user | pid | command
zabbix 27449 /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/z
zabbix 27639 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd)
zabbix 27658 zabbix_agentd: listener #1 [waiting for connection] (zabbix_agentd)
zabbix 27929 zabbix_agentd: listener #2 [waiting for connection] (zabbix_agentd)
zabbix 28119 zabbix_agentd: listener #3 [waiting for connection] (zabbix_agentd)
zabbix 28242 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd)

kill 27449

ps axu | grep zabbix
user | pid | command
zabbix 27449 /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/z
zabbix 27639 zabbix_agentd: collector #1 [terminated] (zabbix_agentd)
zabbix 27658 zabbix_agentd: listener #1 [terminated] (zabbix_agentd)
zabbix 27929 zabbix_agentd: listener #2 [terminated] (zabbix_agentd)
zabbix 28119 zabbix_agentd: listener #3 [terminated] (zabbix_agentd)
zabbix 28242 zabbix_agentd: active checks #1 [terminated] (zabbix_agentd)


...

As zabbix user:

/usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/zabbix_agentd.conf

ps axu | grep zabbix
user | pid | command
zabbix 87240 /usr/local/zabbix42manual/sbin/zabbix_agentd -c /usr/local/zabbix42manual/etc/zabbix_agentd.conf
zabbix 87257 zabbix_agentd: collector [idle 1 sec] (zabbix_agentd)
zabbix 87289 zabbix_agentd: listener #1 [waiting for connection] (zabbix_agentd)
zabbix 87547 zabbix_agentd: listener #2 [waiting for connection] (zabbix_agentd)
zabbix 87658 zabbix_agentd: listener #3 [waiting for connection] (zabbix_agentd)
zabbix 87755 zabbix_agentd: active checks #1 [idle 1 sec] (zabbix_agentd)

kill 87240

$ ps axu | grep zabbix
<none>

 

Comment by Danilo G. Baio [ 2019 Aug 29 ]

Hi.

Issue still on the latest release.
Please, change the issue details to:

Affects Version/s: >= 4.0.11, >= 4.2.5
Component/s: Agent ,Server and Proxy

Regards.

Comment by Vladislavs Sokurenko [ 2019 Aug 29 ]

Fixed in pull request:
https://git.zabbix.com/projects/ZBX/repos/zabbix/pull-requests/363/overview

Comment by Vladislavs Sokurenko [ 2019 Aug 29 ]

Fixed in:

  • pre-4.0.13rc1 10aee49918f
  • pre-4.2.7rc1 890ef645e5f
  • pre-4.4.0alpha3 (trunk) a62b1cfefc1
Comment by Danilo G. Baio [ 2019 Aug 29 ]

Thank you

Generated at Fri Apr 26 16:13:10 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.