[ZBX-2634] proxy trapper gets SIGSEGV in write on solaris Created: 2010 Jul 01  Updated: 2017 May 30  Resolved: 2010 Jul 12

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P)
Affects Version/s: 1.8.2
Fix Version/s: 1.8.3, 1.9.0 (alpha)

Type: Incident report Priority: Major
Reporter: frankg gleason Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Solaris 10, Spark. Unknown if this happens with x86


Issue Links:
Duplicate
is duplicated by ZBX-1942 zabbix_agentd instability with Solari... Closed

 Description   

I send application performance data to zabbix via sender and the proxy. About half the time I stop my script one of the proxy trapper child processes die. I have done enough debugging to convince myself it is crashing in zbx_tcp_send_ext in the write system calls (not always the same one, there are 3). This is a Sparc Solaris 10 system. Upgrading to 1.8.3 did not fix the problem.

10318:20100629:084210.117 Timeout while answering request
10318:20100629:084210.117 Got signal [signal:11(SIGSEGV),reason:1,refaddr:c]. Crashing ...
10318:20100629:084210.117 ====== Fatal information: ======
10318:20100629:084210.117 program counter not available for this architecture
10318:20100629:084210.117 === Registers: ===
10318:20100629:084210.117 register dump not available for this architecture
10318:20100629:084210.117 === Backtrace: ===
10318:20100629:084210.117 backtrace not available for this platform
10318:20100629:084210.118 === Memory map: ===
10318:20100629:084210.118 memory map not available for this platform
10318:20100629:084210.118 ================================
10291:20100629:084210.126 One child process died (PID:10318,exitcode/signal:-1). Exiting ...
10291:20100629:084210.126 zbx_on_exit() called

Output from runme_on_app_crash
Program: zabbix_proxy
Process ID: 10318
Received signal: 11

Application Debugging Data
--------------------------

> /bin/pstack 10318
10318: /opt/thunder/sbin/zabbix_proxy -c /opt/thunder/etc/zabbix_proxy.conf
0003ab68 child_signal_handler (d, 622a0, ffbeb8f0, 0, 0, 0) + 100
fec44b4c __sighndlr (d, 0, ffbeb8f0, 3aa68, 0, 1) + c
fec39b24 call_user_handler (d, 0, 8, 0, feed2a00, ffbeb8f0) + 3b8
fec45f64 _write (6, ffbebc68, 8, 0, 0, 0) + c
000448e4 zbx_tcp_send_ext (ffbfef40, ffbebce4, 0, 0, ffbebc70, a39b0) + 9c
00045668 zbx_send_response (ffbfef40, 0, ffffefe8, 3, fa, 73800) + cc
00022750 process_trapper_child (ffbfef40, 5c8b8, 0, e, 10b4, fec73a80) + 534
00022cf4 child_trapper_main (5c800, ffbfef40, 9f000, 9e800, ffbfeed0, a39b0) + b4
00018d2c MAIN_ZABBIX_ENTRY (1a, fec39cac, 85c00, 85c00, 85c00, 85c00) + 570
0003af04 daemon_start (ffbffbd0, ffbffc6c, ffbffc7c, a3cfc, feed0140, feed0180) + 2fc
00017edc _start (0, 0, 0, 0, 0, 0) + 5c

> /bin/pmap -x 10318
10318: /opt/thunder/sbin/zabbix_proxy -c /opt/thunder/etc/zabbix_proxy.conf
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 408 352 - - r-x-- zabbix_proxy
00084000 104 24 8 - rwx-- zabbix_proxy
0009E000 16 8 8 - rwx-- zabbix_proxy
000A2000 744 680 648 - rwx-- [ heap ]
FA000000 13928 96 - - rwxs- [ shmid=0x1000001 ]
FB000000 16392 16 - - rwxs- [ shmid=0x1000000 ]
FC400000 16392 16 - - rwxs- [ shmid=0x7f ]
FD580000 2464 16 - - rwxs- [ shmid=0x1000002 ]
FD800000 16416 24 - - rwxs- [ shmid=0x7e ]
FEA40000 32 32 - - r-x-- nss_files.so.1
FEA58000 8 8 - - rwx-- nss_files.so.1
FEA5C000 8 8 - - rwxs- [ anon ]
FEA70000 64 64 8 - rwx-- [ anon ]
FEA90000 64 32 - - rwx-- [ anon ]
FEAB0000 8 8 - - rwx-- [ anon ]
FEAC0000 128 88 - - r-x-- libelf.so.1
FEAE0000 8 8 - - rwx-- libelf.so.1
FEAF0000 80 16 - - r-x-- libmd.so.1
FEB14000 8 8 - - rwx-- libmd.so.1
FEB20000 32 16 - - r-x-- libaio.so.1
FEB38000 8 8 8 - rwx-- libaio.so.1
FEB40000 40 24 - - r-x-- libintl.so.3.4.3
FEB58000 8 8 - - rwx-- libintl.so.3.4.3
FEB60000 40 32 - - r-x-- libgcc_s.so.1
FEB78000 8 8 - - rwx-- libgcc_s.so.1
FEB80000 888 832 - - r-x-- libc.so.1
FEC6E000 32 32 24 - rwx-- libc.so.1
FEC76000 8 8 8 - rwx-- libc.so.1
FEC80000 920 32 - - r-x-- libiconv.so.2.4.0
FED74000 16 16 - - rwx-- libiconv.so.2.4.0
FED80000 248 208 - - r-x-- libresolv.so.2
FEDCE000 16 16 - - rwx-- libresolv.so.2
FEDE0000 8 8 - - r-x-- libkstat.so.1
FEDF2000 8 8 - - rwx-- libkstat.so.1
FEE00000 680 144 - - r-x-- libm.so.2
FEEB8000 32 24 - - rwx-- libm.so.2
FEED0000 24 16 8 - rwx-- [ anon ]
FEEE0000 16 8 - - r-x-- libkvm.so.1
FEEF4000 8 8 - - rwx-- libkvm.so.1
FEF00000 584 304 - - r-x-- libnsl.so.1
FEFA2000 40 40 8 - rwx-- libnsl.so.1
FEFAC000 24 16 - - rwx-- libnsl.so.1
FEFC0000 8 8 - - rwx-- [ anon ]
FEFD0000 8 8 - - rwx-- [ anon ]
FEFE0000 72 24 - - r-x-- libz.so.1.0.2
FF000000 8 8 - - rwx-- libz.so.1.0.2
FF010000 8 8 - - r-x-- libdl.so.1
FF022000 8 8 - - rwx-- libdl.so.1
FF030000 48 40 - - r-x-- libsocket.so.1
FF04C000 8 8 8 - rwx-- libsocket.so.1
FF060000 24 24 - - r-x-- librt.so.1
FF076000 8 8 - - rwx-- librt.so.1
FF080000 1168 744 - - r-x-- libcrypto.so.0.9.8
FF1B2000 96 96 - - rwx-- libcrypto.so.0.9.8
FF1CA000 8 - - - rwx-- libcrypto.so.0.9.8
FF1E0000 256 256 - - r-x-- libssl.so.0.9.8
FF22E000 24 24 - - rwx-- libssl.so.0.9.8
FF240000 192 32 - - r-x-- libidn.so.11.6.1
FF27E000 16 16 - - rwx-- libidn.so.11.6.1
FF290000 8 8 - - rwx-- [ anon ]
FF2A0000 296 64 - - r-x-- libcurl.so.4.2.0
FF2F8000 16 16 - - rwx-- libcurl.so.4.2.0
FF300000 488 456 - - r-x-- libsqlite3.so.0.8.6
FF388000 16 16 16 - rwx-- libsqlite3.so.0.8.6
FF3A0000 16 16 - - r-x-- libc_psr.so.1
FF3B0000 208 208 - - r-x-- ld.so.1
FF3F4000 8 8 8 - rwx-- ld.so.1
FF3F6000 8 8 8 - rwx-- ld.so.1
FFBC8000 224 224 176 - rwx-- [ stack ]
-------- ------- ------- ------- -------
total Kb 74208 5656 944 -

> /bin/pfiles 10318
10318: /opt/thunder/sbin/zabbix_proxy -c /opt/thunder/etc/zabbix_proxy.conf
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0666 dev:311,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_RDONLY
/devices/pseudo/mm@0:null
1: S_IFREG mode:0664 dev:136,8 ino:552348 uid:125 gid:1 size:5330396
O_WRONLY|O_APPEND|O_CREAT
/opt/thunder/var/log/zabbix_proxy.log
2: S_IFREG mode:0664 dev:136,8 ino:552348 uid:125 gid:1 size:5330396
O_WRONLY|O_APPEND|O_CREAT
/opt/thunder/var/log/zabbix_proxy.log
3: S_IFREG mode:0664 dev:136,8 ino:734536 uid:125 gid:1 size:5
O_WRONLY|O_CREAT|O_TRUNC
advisory write lock set by process 10291
/opt/thunder/var/run/zabbix_proxy.pid
4: S_IFSOCK mode:0666 dev:317,0 ino:57069 uid:0 gid:0 size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_ NEXTHOP(0.0.192.0)
sockname: AF_INET 0.0.0.0 port: 10052
5: S_IFREG mode:0644 dev:136,8 ino:552349 uid:125 gid:1 size:336896
O_RDWR|O_LARGEFILE FD_CLOEXEC
/opt/thunder/var/data/zabbix_proxy.db
6: S_IFSOCK mode:0666 dev:317,0 ino:21831 uid:0 gid:0 size:0
O_RDWR
SOCK_STREAM
SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(49152),IP_ NEXTHOP(0.0.192.0)
sockname: AF_INET 0.0.0.0 port: 0



 Comments   
Comment by Aleksandrs Saveljevs [ 2010 Jul 02 ]

We could not reproduce it yet, but the debugging data you posted has also convinced me it is crashing in write() system calls, so we can think about the possible solutions.

In that particular case, the proxy is crashing in the call on the following line:

if( ZBX_TCP_ERROR == ZBX_TCP_WRITE(s->socket, (char *) &len64, sizeof(len64)) )

This is a perfectly legitimate system call with no danger of NULL pointers. So, then it is probably crashing because it receives SIGALRM (as the log message suggests) while doing the write() and the system does not happen to like it. If so, then this is not a very nice behavior from the system. And if so, what could we possibly do about it?

Which update (/etc/release) and kernel patch (uname -X) are you running? Maybe we could go from there.

Comment by frankg gleason [ 2010 Jul 02 ]

I had come to the same conclusion. I did to enough debugging with zabbix_log to show it can happen in the other write statements also which makes sense if it's not the particular system call but the handling of the SIGALRM. I would not be surprised if is some kind of thread bug.

This is happening on our production system (Netra-T12) and my dev box (Blade-100)

System = SunOS
Node = bthmindur01
Release = 5.10
KernelID = Generic_137111-08
Machine = sun4u
BusType = <unknown>
Serial = <unknown>
Users = <unknown>
OEM# = 0
Origin# = 1
NumCPU = 8

System = SunOS
Node = ulysses
Release = 5.10
KernelID = Generic_127111-07
Machine = sun4u
BusType = <unknown>
Serial = <unknown>
Users = <unknown>
OEM# = 0
Origin# = 1
NumCPU = 1

This is the script I am running to collect the data and send it to the server via a proxy. The perl and awk reformat the data. Changing the zabbix_sender args does not make any difference. I just run this and cntr-c out a couple of times and the crash occurs.

THUNDER=/opt/thunder

$THUNDER/local/bin/perl $THUNDER/bin/tdp.pl /opwv/imail/log/imdircacheserv.stat | $THUNDER/local/bin/mawk -f $THUNDER/bin/ldap-stats.awk | $THUNDER/bin/zabbix_sender -vv -r -c $THUNDER/etc/zabbix_agentd.conf -T -i -

Comment by Aleksandrs Saveljevs [ 2010 Jul 06 ]

We are yet to reproduce the issue on the Solaris boxes we have at our disposal.

However, there is a related issue where the agent crashes during zbx_tcp_send() on Solaris 9:

27344: 46.1479 write(5, " Z B X D01", 5) = 5
27344: 46.1484 write(5, "02\0\0\0\0\0\0\0", 8) = 8
27344: 46.1486 write(5, " O K", 2) Err#32 EPIPE
27344: 46.1488 Received signal #13, SIGPIPE [caught]
27344: 46.1490 Incurred fault #6, FLTBOUNDS %pc = 0x00026398
27344: siginfo: SIGSEGV SEGV_MAPERR addr=0x0000000C
27344: 46.1501 Received signal #11, SIGSEGV [default]
27344: siginfo: SIGSEGV SEGV_MAPERR addr=0x0000000C
27340: 46.1554 Received signal #18, SIGCLD, in waitid() [caught]
27340: siginfo: SIGCLD CLD_KILLED pid=27344 status=0x000B

The error looks highly similar to what we have here. The information on the Web hints that it might or might not be a compiler or Solaris bug, however, none of the sources I have found are clear on this point.

To somehow advance on this issue, there are two ideas I am willing to try:

(1) Agent and proxy handle SIGPIPE signal. Somehow, during the crash, they do not fully get there. So I wish to know where that instruction pointed to by %pc from "Incurred fault #6, FLTBOUNDS %pc = 0x00026398" is located: namely, is it in our code and, if so, what it does. To help find this out, could you please run proxy under truss and disassemble zabbix_proxy with "dis -n zabbix_proxy" or similar?

(2) What compiler and what version are you using? Does proxy crash if compiled with a different compiler (e.g., gcc or Sun Studio)?

Sorry for the burden. If you have other ideas, please let us know.

Comment by frankg gleason [ 2010 Jul 06 ]

I'm happy to help. I'll work on this today.

Comment by Aleksandrs Saveljevs [ 2010 Jul 07 ]

Aha! Based on "%pc = 0x00026398" and the disassembly of Solaris 9 agent available in the download area on Zabbix website, we can see that siginfo argument to child_signal_handler() in src/libs/zbxnix/daemon.c can be NULL, and http://hackage.haskell.org/trac/ghc/ticket/3790 confirms it.

I will prepare a patch for this, but you can also try checking for siginfo being NULL independently and see whether proxy still crashes. There is probably no need to work on (1) and (2) mentioned in the previous post.

Comment by Aleksandrs Saveljevs [ 2010 Jul 07 ]

Could you please install proxy from svn://svn.zabbix.com/branches/dev/zbx-2634-solaris-signals and see whether it works for you?

Comment by frankg gleason [ 2010 Jul 07 ]

Thank you. I will test the patched version today.

Comment by frankg gleason [ 2010 Jul 10 ]

I tested the patched version and was unable to reproduce the crash. Looks like it's fixed. Thanks very much.

Comment by Aleksandrs Saveljevs [ 2010 Jul 12 ]

Thanks for help!

Comment by Aleksandrs Saveljevs [ 2010 Jul 12 ]

Fixed in pre-1.8.3 in r13256.

Generated at Fri Apr 26 11:34:13 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.