ZBX-3788 investigationA test environment for reproducing the problemMachine: CPU AMD Athlon(tm) II X2 B24 (dual-core), 1.7 GB memory, Debian GNU/Linux 6.0 (64-bit).Zabbix 1.8.11rc1 source code was downloaded and compiled.For increasing the probability of daemon process hanging after sending "killall -15 zabbix_agentd" the file zabbix_agentd.conf was customized by setting:DebugLevel=4 StartAgents=100 For convenient automation a script test-ZBX-3788.sh was developed. The script starts the zabbix_agentd processes and applies "killall -15 zabbix_agentd" to them. The starting/stopping is repeated until at least one of the zabbix_agentd processes hangs.Here is a text of the script test-ZBX-3788.sh:---------------------------------------------------------------------------------------#!/bin/shZBX_AGENTD_PATH=~/ZBX-3788/sbinZBX_AGENTD_CONF=~/ZBX-3788-Andris/zabbix_agentd.confNPROC_TARGET=103 # If "StartAgents=100" is configured in # zabbix_agentd.conf,# 103 processes are started.OUT_DIR=~/ZBX-3788-AndrisWITH_TRACING=0 # 0 - run without tracing, 1 - run with tracing.USER=$(whoami)NRUN=1while [ 1 ]; doecho NRUN=$NRUN Starting ...$ZBX_AGENTD_PATH/zabbix_agentd -c $ZBX_AGENTD_CONFsleep 4# Log processes.ps -fu $USER > $OUT_DIR/ps_after_start_$NRUN# Count processes.NPROC=$(ps -fu $USER | grep -v grep \| grep './sbin/zabbix_agentd -c' | wc -l)echo "$NPROC" processes started.if [ $NPROC -ne $NPROC_TARGET ]; thenecho ERROR: "$NPROC process(es) started."fi # Find the 'zabbix_agentd' parent process PID.PID=$(ps -fu $USER \| awk '/zabbix_agentd -c /{ if ( $3 == "1" ) print $2 }')# Trace the 'zabbix_agentd' parent process.if [ $WITH_TRACING -eq 1 ]; thenltrace -tt -T -S -o $OUT_DIR/ltrace_$NRUN -p $PID &fi# Trace "killall" to see in what order processes were killed.echo NRUN=$NRUN Stopping $PIDif [ $WITH_TRACING -eq 1 ]; thenstrace -tt -T -v -o $OUT_DIR/strace_killall_$NRUN \killall -15 zabbix_agentdelsekillall -15 zabbix_agentdfisleep 4# Log processes.ps -fu $USER > $OUT_DIR/ps_after_stop__$NRUN# Count processes again.NPROC=$(ps -fu $USER | grep -v grep \| grep './sbin/zabbix_agentd -c' | wc -l)if [ $NPROC -gt 0 ]; thenecho GOT IT: "$NPROC process(es) hanging. See NRUN $NRUN"exitfi NRUN=$(( NRUN + 1 ))done---------------------------------------------------------------------------------------(You may want to adapt ZBX_AGENTD_PATH, ZBX_AGENTD_CONF and OUT_DIR to your environment).The process hang: simple reproductionLet's start the script and leave it running. Be patient, it can take hundreds of a attempts before the process hang occurs (it took half an hour in our test).$ ./test-ZBX-3788.sh NRUN=1 Starting ... 103 processes started. NRUN=1 Stopping 10128 NRUN=2 Starting ... 103 processes started. NRUN=2 Stopping 10252 ...NRUN=196 Starting ... 103 processes started. NRUN=196 Stopping 1884 GOT IT: 1 process(es) hanging. See NRUN 196 Let's look what processes are hanging:$ ps -fu zabbix18 | lessUID PID PPID C STIME TTY TIME CMD zabbix18 1884 1 0 16:40 ? 00:00:00 /home/zabbix18/ZBX-3788/sbin/zabbix_agentd -c /home/zabbix18/ZBX-3788-Andris/zabbix_agentd.conf zabbix18 1885 1884 0 16:40 ? 00:00:00 [zabbix_agentd] <defunct> zabbix18 1949 1884 0 16:40 ? 00:00:00 [zabbix_agentd] <defunct> ....Of 103 processes one process (parent zabbix_agentd) is hanging and the rest are <defunct> . The waiting is on futex: $ strace -p 1884 Process 1884 attached - interrupt to quit futex(0x7f386ddc49f4, FUTEX_WAIT, 2, NULL So, at least one problem case is reproduced.The process hang: reproduction with tracing enabledTerminate hanging processes with $ killall -9 zabbix_agentdEdit the script test-ZBX-3788.sh, set the parameter WITH_TRACING to 1.Start the script again:$ ./test-ZBX-3788.shNRUN=1 Starting ... 103 processes started. NRUN=1 Stopping 3666 NRUN=2 Starting ... 103 processes started. NRUN=2 Stopping 3790 NRUN=3 Starting ... ...NRUN=98 Starting ... 103 processes started. NRUN=98 Stopping 15842 GOT IT: 1 process(es) hanging. See NRUN 98 Let's examine normal and problem cases using log files and ltrace results.Log and trace file analysisA case of normal stop (without process hang) in log file looks like this:....3768:20120213:134343.929 Sleeping for 1 second(s) --- some zabbix_agentd child processes get SIGTERM from “killall”: ---3667:20120213:134344.094 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3668:20120213:134344.094 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3669:20120213:134344.095 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3670:20120213:134344.095 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3671:20120213:134344.096 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3672:20120213:134344.096 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3673:20120213:134344.096 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3674:20120213:134344.097 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3675:20120213:134344.097 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3676:20120213:134344.098 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3677:20120213:134344.098 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3678:20120213:134344.098 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... --- zabbix_agentd parent process gets SIGCHLD signal: ---3666:20120213:134344.096 One child process died (PID:3667,exitcode/signal:255). Exiting ... --- “killall” manages to terminate few more child processes: ---3679:20120213:134344.099 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3680:20120213:134344.099 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3681:20120213:134344.100 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3682:20120213:134344.100 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3683:20120213:134344.101 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3684:20120213:134344.101 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... --- parent process starts exiting: ---3666:20120213:134344.101 zbx_on_exit() called --- “killall” continues to terminate child processes: ---3685:20120213:134344.102 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3686:20120213:134344.102 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3687:20120213:134344.102 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3688:20120213:134344.103 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... 3689:20120213:134344.103 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting ... --- the rest of child processes get SIGTERM from the parent process: ---3690:20120213:134344.104 Got signal [signal:15(SIGTERM),sender_pid:3666,sender_uid:1001,reason:0]. Exiting ... 3692:20120213:134344.104 Got signal [signal:15(SIGTERM),sender_pid:3666,sender_uid:1001,reason:0]. Exiting ... 3691:20120213:134344.104 Got signal [signal:15(SIGTERM),sender_pid:3666,sender_uid:1001,reason:0]. Exiting ... The problem case (with process hang) in log file looks like this:....15945:20120213:135706.109 Sleeping for 1 second(s) --- all zabbix_agentd child processes get SIGTERM from “killall”: ---15844:20120213:135706.323 Got signal [signal:15(SIGTERM),sender_pid:15957,sender_uid:1001,reason:0]. Exiting ... 15845:20120213:135706.324 Got signal [signal:15(SIGTERM),sender_pid:15957,sender_uid:1001,reason:0]. Exiting ... 15846:20120213:135706.325 Got signal [signal:15(SIGTERM),sender_pid:15957,sender_uid:1001,reason:0]. Exiting ... ....In case of process hang there are no records from the parent process in the log file.Unlike ' strace', the 'ltrace' conveniently shows both function calls from dynamic libraries (e.g. libc) and system calls. 'ltrace' log for the parent process in normal case looks like this:3666 13:43:44.043310 SYS_wait4(-1, 0x7fff14a3f21c, 0, 0, 3666) = -512 <0.050239> --- zabbix_agentd parent process gets SIGTERM signal from “killall”: ---3666 13:43:44.093935 --- SIGTERM (Terminated) ------ and starts logging it: --- 3666 13:43:44.094040 __xstat(1, "/tmp/zabbix_agentd.log", 0x7fff14a3df50 <unfinished ...> 3666 13:43:44.094179 SYS_stat("/tmp/zabbix_agentd.log", 0x7fff14a3df50) = 0 <0.000035> 3666 13:43:44.094364 <... __xstat resumed> ) = 0 <0.000233> 3666 13:43:44.094432 fopen("/tmp/zabbix_agentd.log", "a+" <unfinished ...> --- zabbix_agentd parent process gets SIGCHLD signal from a child process: ---3666 13:43:44.094970 --- SIGCHLD (Child exited) --- --- and starts logging it, too: --- 3666 13:43:44.095051 getpid() = 3666 <0.000051> 3666 13:43:44.095217 __xstat(1, "/tmp/zabbix_agentd.log", 0x7fff14a2d2d0 <unfinished ...> 3666 13:43:44.095335 SYS_stat("/tmp/zabbix_agentd.log", 0x7fff14a2d2d0) = 0 <0.000379> 3666 13:43:44.095828 <... __xstat resumed> ) = 0 <0.000528> 3666 13:43:44.095890 fopen("/tmp/zabbix_agentd.log", "a+" <unfinished ...> 3666 13:43:44.096021 SYS_open("/tmp/zabbix_agentd.log", 1090, 0666) = 5 <0.000028> 3666 13:43:44.096155 <... fopen resumed> ) = 0xda0bf0 <0.000188> 3666 13:43:44.096582 gettimeofday(0x7fff14a2d380, NULL) = 0 <0.000054> 3666 13:43:44.096745 localtime(0x7fff14a2d380 <unfinished ...> 3666 13:43:44.096865 SYS_stat("/etc/localtime", 0x7fff14a1c900) = 0 <0.000026> 3666 13:43:44.097355 <... localtime resumed> ) = 0x7f5b0b8eb380 <0.000547> 3666 13:43:44.097420 getpid() = 3666 <0.000050> 3666 13:43:44.097559 fprintf(0xda0bf0, "%6li:%.4d%.2d%.2d:%.2d%.2d%.2d.%"..., 3666, 2012, 2, 13, 13, 43, 44 <unfinished ...> 3666 13:43:44.097726 SYS_fstat(5, 0x7fff14a1c170) = 0 <0.000365> 3666 13:43:44.098157 SYS_mmap(0, 4096, 3, 34, 0xffffffff) = 0x7f5b0bf9d000 <0.000024> 3666 13:43:44.098388 <... fprintf resumed> ) = 27 <0.000705> 3666 13:43:44.098446 vfprintf(0xda0bf0, "One child process died (PID:%d,e"..., 0x7fff14a2d360) = 66 <0.000399> 3666 13:43:44.098997 fputc('\n', 0xda0bf0) = 10 <0.000048> 3666 13:43:44.099165 fclose(0xda0bf0 <unfinished ...> 3666 13:43:44.099270 SYS_write(5, " 3666:20120213:134344.096 One c"..., 94) = 94 <0.000363> 3666 13:43:44.099730 SYS_close(5) = 0 <0.000019> 3666 13:43:44.099813 SYS_munmap(0x7f5b0bf9d000, 4096) = 0 <0.000029> 3666 13:43:44.099937 <... fclose resumed> ) = 0 <0.000697> 3666 13:43:44.099990 __xstat(1, "/tmp/zabbix_agentd.log", 0x7fff14a2d250 <unfinished ...> 3666 13:43:44.100101 SYS_stat("/tmp/zabbix_agentd.log", 0x7fff14a2d250) = 0 <0.000024> 3666 13:43:44.100577 <... __xstat resumed> ) = 0 <0.000507> 3666 13:43:44.100630 fopen("/tmp/zabbix_agentd.log", "a+" <unfinished ...> 3666 13:43:44.100739 SYS_open("/tmp/zabbix_agentd.log", 1090, 0666) = 5 <0.000025> 3666 13:43:44.100865 <... fopen resumed> ) = 0xda0bf0 <0.000161> 3666 13:43:44.100944 gettimeofday(0x7fff14a2d300, NULL) = 0 <0.000045> 3666 13:43:44.101084 localtime(0x7fff14a2d300 <unfinished ...> 3666 13:43:44.101218 SYS_stat("/etc/localtime", 0x7fff14a1c880) = 0 <0.000021> 3666 13:43:44.101348 <... localtime resumed> ) = 0x7f5b0b8eb380 <0.000208> 3666 13:43:44.101404 getpid() = 3666 <0.000152> 3666 13:43:44.101607 fprintf(0xda0bf0, "%6li:%.4d%.2d%.2d:%.2d%.2d%.2d.%"..., 3666, 2012, 2, 13, 13, 43, 44 <unfinished ...> 3666 13:43:44.101653 SYS_fstat(5, 0x7fff14a1c0f0) = 0 <0.000005> 3666 13:43:44.101675 SYS_mmap(0, 4096, 3, 34, 0xffffffff) = 0x7f5b0bf9d000 <0.000005> 3666 13:43:44.101733 <... fprintf resumed> ) = 27 <0.000090> 3666 13:43:44.101749 vfprintf(0xda0bf0, "zbx_on_exit() called", 0x7fff14a2d2e0) = 20 <0.000018> 3666 13:43:44.101929 fputc('\n', 0xda0bf0) = 10 <0.000014> 3666 13:43:44.101977 fclose(0xda0bf0 <unfinished ...> 3666 13:43:44.102007 SYS_write(5, " 3666:20120213:134344.101 zbx_o"..., 48) = 48 <0.000006> 3666 13:43:44.102037 SYS_close(5) = 0 <0.000007> 3666 13:43:44.102060 SYS_munmap(0x7f5b0bf9d000, 4096) = 0 <0.000008> 3666 13:43:44.102093 <... fclose resumed> ) = 0 <0.000097> 3666 13:43:44.102107 sigemptyset(0x7fff14a2d400) = 0 <0.000012> 3666 13:43:44.102152 sigaddset(0x7fff14a2d400, 17) = 0 <0.000015> 3666 13:43:44.102198 sigprocmask(0, 0x7fff14a2d400, NULL <unfinished ...> 3666 13:43:44.102229 SYS_rt_sigprocmask(0, 0x7fff14a2d400, 0, 8, 0x7f5b0b8e7e40) = 0 <0.000005> 3666 13:43:44.102276 <... sigprocmask resumed> ) = 0 <0.000056> 3666 13:43:44.102291 kill(3667, 15 <unfinished ...> 3666 13:43:44.102328 SYS_kill(3667, 15) = 0 <0.000006> 3666 13:43:44.102357 <... kill resumed> ) = 0 <0.000045> 3666 13:43:44.102371 kill(3668, 15 <unfinished ...> 3666 13:43:44.102399 SYS_kill(3668, 15) = 0 <0.000005> 3666 13:43:44.102427 <... kill resumed> ) = 0 <0.000036> 3666 13:43:44.102444 kill(3669, 15 <unfinished ...> 3666 13:43:44.102470 SYS_kill(3669, 15) = 0 <0.000005> 3666 13:43:44.102501 <... kill resumed> ) = 0 <0.000037> ...--- zabbix_agentd parent process releases resources and terminates: ---...3666 13:43:46.155592 free(0x00d9eb60) = <void> <0.000085> 3666 13:43:46.155807 free(0x00d9f000) = <void> <0.000086> 3666 13:43:46.156086 shmctl(294918, 0, 0, 123936, 0x7f5b0bf8e700 <unfinished ...> 3666 13:43:46.156328 SYS_shmctl(294918, 0, 0, 0, 0x7f5b0bf8e700) = 0 <0.000113> 3666 13:43:46.156720 <... shmctl resumed> ) = 0 <0.000483> 3666 13:43:46.156845 exit(0 <unfinished ...> 3666 13:43:46.157268 SYS_exit_group(0 <no return ...> 3666 13:43:46.157805 +++ exited (status 0) +++ In problem case – process hang – things look differently:15842 13:57:06.169775 SYS_wait4(-1, 0x7fffb698ff8c, 0, 0, 15842) = -512 <0.151751> --- zabbix_agentd parent process gets SIGTERM signal from “killall”: ---15842 13:57:06.322104 --- SIGTERM (Terminated) --- --- and starts logging it: ---15842 13:57:06.322346 __xstat(1, "/tmp/zabbix_agentd.log", 0x7fffb698ecd0 <unfinished ...> 15842 13:57:06.322527 SYS_stat("/tmp/zabbix_agentd.log", 0x7fffb698ecd0) = 0 <0.000026> 15842 13:57:06.322659 <... __xstat resumed> ) = 0 <0.000178> 15842 13:57:06.322728 fopen("/tmp/zabbix_agentd.log", "a+" <unfinished ...> 15842 13:57:06.322875 SYS_open("/tmp/zabbix_agentd.log", 1090, 0666) = 5 <0.000032> 15842 13:57:06.323027 <... fopen resumed> ) = 0x1ce0bf0 <0.000200> 15842 13:57:06.323085 gettimeofday(0x7fffb698ed80, NULL) = 0 <0.000046> 15842 13:57:06.323229 localtime(0x7fffb698ed80 <unfinished ...> 15842 13:57:06.323342 SYS_stat("/etc/localtime", 0x7fffb697e300) = 0 <0.000023> --- logging of SIGTERM is in progress, but another signal arrives: ---15842 13:57:06.323440 --- SIGCHLD (Child exited) --- 15842 13:57:06.323502 getpid() = 15842 <0.000054> --- ... and another logging is started while the first is somewhere in localtime(). ---15842 13:57:06.323666 __xstat(1, "/tmp/zabbix_agentd.log", 0x7fffb697dc70 <unfinished ...> 15842 13:57:06.323775 SYS_stat("/tmp/zabbix_agentd.log", 0x7fffb697dc70) = 0 <0.000023> 15842 13:57:06.323893 <... __xstat resumed> ) = 0 <0.000154> 15842 13:57:06.323947 fopen("/tmp/zabbix_agentd.log", "a+" <unfinished ...> 15842 13:57:06.324071 SYS_open("/tmp/zabbix_agentd.log", 1090, 0666) = 6 <0.000027> 15842 13:57:06.324208 <... fopen resumed> ) = 0x1ce0e30 <0.000176> 15842 13:57:06.324261 gettimeofday(0x7fffb697dd20, NULL) = 0 <0.000044> 15842 13:57:06.324399 localtime(0x7fffb697dd20 <unfinished ...> --- here the parent process hangs, its child processes are <defunct>... ------ ... until we terminate it with “killall -9”.------ Then 'ltrace' reveals the hanging was on futex. --- 15842 13:57:06.324486 SYS_futex(0x7f6a133689f4, 0, 2, 0, 4096 <no return ...> 15842 15:12:49.992621 +++ killed by SIGKILL +++ You could say: localtime() is not a reentrant function. Right. But why does it hang ?Let's look into libc source code. On our Debian GNU/Linux 6.0 machine the interesting part is in eglibc-2.11.3/time/localtime.c:-------------------------------------------------------------------------...struct tm * __localtime_r (t, tp) const time_t *t; struct tm *tp; { return __tz_convert (t, 1, tp); } weak_alias (__localtime_r, localtime_r) /* Return the `struct tm' representation of *T in local time. */ struct tm * localtime (t) const time_t *t; { return __tz_convert (t, 1, &_tmbuf); } libc_hidden_def (localtime) -------------------------------------------------------------------------Here localtime() and localtime_r() are almost the same, they both call __tz_convert().The __tz_convert() “lives” in eglibc-2.11.3/time/tzset.c:-------------------------------------------------------------------------.../* This locks all the state variables in tzfile.c and this file. */ __libc_lock_define_initialized (static, tzset_lock) .../* Return the `struct tm' representation of *TIMER in the local timezone. Use local time if USE_LOCALTIME is nonzero, UTC otherwise. */ struct tm * __tz_convert (const time_t *timer, int use_localtime, struct tm *tp) { long int leap_correction; int leap_extra_secs; if (timer == NULL) { __set_errno (EINVAL); return NULL; } __libc_lock_lock (tzset_lock); <-- I think blocking on futex() starts here./* Update internal database according to current TZ setting. POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname. This is a good idea since this allows at least a bit more parallelism. */ tzset_internal (tp == &_tmbuf && use_localtime, 1); if (__use_tzfile) __tzfile_compute (*timer, use_localtime, &leap_correction, &leap_extra_secs, tp); else { if (! __offtime (timer, 0, tp)) tp = NULL; else __tz_compute (*timer, tp, use_localtime); leap_correction = 0L; leap_extra_secs = 0; } if (tp) { if (! use_localtime) { tp->tm_isdst = 0; tp->tm_zone = "GMT"; tp->tm_gmtoff = 0L; } if (__offtime (timer, tp->tm_gmtoff - leap_correction, tp)) tp->tm_sec += leap_extra_secs; else tp = NULL; } __libc_lock_unlock (tzset_lock); <-- Could be unblocking of futex.return tp; } -------------------------------------------------------------------------I think that blocking on futex is related with this__libc_lock_lock (tzset_lock)...__libc_lock_unlock (tzset_lock)hidden in localtime().Proposed explanationMy explanation of hanging:When some event is to be logged (e.g. SIGTERM), the localtime() is invoked. It sets up an internal locking while setting time zone information.While the localtime() is in progress, a signal arrives (e.g. SIGCHLD). The signal handler is immediately invoked, it tries to log the signal information and invokes the localtime().Now there is a dead-lock: the first invocation of localtime() was interrupted, but still keeps locking, another localtime() was invoked from the signal handler and cannot obtain locking.Should we replace localtime() with localtime_r() ?“man localtime” on Linux says:“The ...functions .... localtime() return a pointer to static data and hence are not thread-safe. Thread-safe versions ...localtime_r() are specified by SUSv2, ...” Similarly AIX documentationhttp://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.basetechref%2Fdoc%2Fbasetrf1%2Fctime.htm:says:“...Do not use the ....localtime ... subroutine in a multithreaded environment. See the multithread alternatives in the ....localtime_r .... subroutine article.”But Zabbix on UNIX/Linux is not multithreaded. It's multi-process. The process hang occurs when a process execution flow is interrupted by it's own signal handler and the localtime() is used in both at the same time.So, replacing localtime() with localtime_r() will not solve the problem.The right solution: using only asynchronous-signal-safe functions in signal handlers Number of documents list the functions which can be used in signal handlers without restrictions. Aleksandrs Saveljevs already pointed to http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html .There is also CERT C Secure Coding Standard athttps://www.securecoding.cert.org/confluence/display/seccode/CERT+C+Secure+Coding+Standardwhich puts forward recommendations:SIG00-C. Mask signals handled by noninterruptible signal handlers SIG01-C. Understand implementation-specific details regarding signal handler persistence SIG02-C. Avoid using signals to implement normal functionalityand rules:SIG30-C. Call only asynchronous-safe functions within signal handlers (see the list of safe functions)SIG31-C. Do not access or modify shared objects in signal handlers SIG32-C. Do not call longjmp() from inside a signal handler SIG33-C. Do not recursively invoke the raise() function SIG34-C. Do not call signal() from within interruptible signal handlers SIG35-C. Do not return from SIGSEGV, SIGILL, or SIGFPE signal handlersViolation of the rule SIG30-C. Call only asynchronous-safe functions within signal handlers is rated as a high severity security risk.Unfortunately, the list of asynchronous-signal-safe functions does not include many popular functions, even printf() is not safe. Obviously, localtime() is not in the list, too.This brings a question about redesigning our signal handlers to meet secure coding best practice.Quick mitigation: a small change in signal handler logicChanging signal handlers to use only safe functions is a larger task, which requires redesigning our logging architecture. There is a quicker practical solution how to significantly reduce probability of process hanging.Let's turn to our source code, file src/libs/zbxnix/daemon.c:-------------------------------------------------------------------------....static void child_signal_handler(int sig, siginfo_t *siginfo, void *context) { ...switch (sig) { ...case SIGQUIT: case SIGINT: case SIGTERM: zabbix_log(parent_pid == CHECKED_FIELD(siginfo, si_pid) ? LOG_LEVEL_DEBUG : LOG_LEVEL_WARNING, "Got signal [signal:%d(%s),sender_pid:%d,sender_uid:%d,reason:%d]. Exiting ...", sig, get_signal_name(sig), CHECKED_FIELD(siginfo, si_pid), CHECKED_FIELD(siginfo, si_uid), CHECKED_FIELD(siginfo, si_code)); if (!PARENT_PROCESS) exit(FAIL); if (0 == exiting) { exiting = 1; zbx_on_exit(); } break; ...} } static void parent_signal_handler(int sig, siginfo_t *siginfo, void *context) { switch (sig) { case SIGCHLD: if (!PARENT_PROCESS) exit(FAIL); if (0 == exiting) { zabbix_log(LOG_LEVEL_CRIT, "One child process died (PID:%d,exitcode/signal:%d). Exiting ...", CHECKED_FIELD(siginfo, si_pid), CHECKED_FIELD(siginfo, si_status)); exiting = 1; zbx_on_exit(); } break; default: child_signal_handler(sig, siginfo, context); } } ....-------------------------------------------------------------------------This code fragment shows that in parent process, when SIGTERM or SIGCHLD is received:- the signal reception is logged (zabbix_log()),- “exiting” variable is checked and, if necessary, set and exit sequence started (zbx_on_exit()).The proposed mitigation is to change the logic to:When the parent process receives SIGTERM or SIGCHLD:- check “exiting” variable.- If it is 1, this means that the exit sequence is already in progress and no further actions are taken in signal handler (simply return).- if it is 0, then:- set it to 1 immediately,- log the signal reception (zabbix_log()),- start exit sequence (zbx_on_exit()).This will significantly reduce probability of invoking localtime() multiple times and blocking.(Although it will not completely eliminate it as there is still a small time interval between checking of “exiting” variable and setting it to 1.)The proposed mitigation is successfully tested with zabbix_agentd on a server with 4 CPU cores.