[ZBX-2634] proxy trapper gets SIGSEGV in write on solaris Created: 2010 Jul 01 Updated: 2017 May 30 Resolved: 2010 Jul 12 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P) |
Affects Version/s: | 1.8.2 |
Fix Version/s: | 1.8.3, 1.9.0 (alpha) |
Type: | Incident report | Priority: | Major |
Reporter: | frankg gleason | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Solaris 10, Spark. Unknown if this happens with x86 |
Issue Links: |
|
Description |
I send application performance data to zabbix via sender and the proxy. About half the time I stop my script one of the proxy trapper child processes die. I have done enough debugging to convince myself it is crashing in zbx_tcp_send_ext in the write system calls (not always the same one, there are 3). This is a Sparc Solaris 10 system. Upgrading to 1.8.3 did not fix the problem. 10318:20100629:084210.117 Timeout while answering request Output from runme_on_app_crash Application Debugging Data > /bin/pstack 10318 > /bin/pmap -x 10318 > /bin/pfiles 10318 |
Comments |
Comment by Aleksandrs Saveljevs [ 2010 Jul 02 ] |
We could not reproduce it yet, but the debugging data you posted has also convinced me it is crashing in write() system calls, so we can think about the possible solutions. In that particular case, the proxy is crashing in the call on the following line: if( ZBX_TCP_ERROR == ZBX_TCP_WRITE(s->socket, (char *) &len64, sizeof(len64)) ) This is a perfectly legitimate system call with no danger of NULL pointers. So, then it is probably crashing because it receives SIGALRM (as the log message suggests) while doing the write() and the system does not happen to like it. If so, then this is not a very nice behavior from the system. And if so, what could we possibly do about it? Which update (/etc/release) and kernel patch (uname -X) are you running? Maybe we could go from there. |
Comment by frankg gleason [ 2010 Jul 02 ] |
I had come to the same conclusion. I did to enough debugging with zabbix_log to show it can happen in the other write statements also which makes sense if it's not the particular system call but the handling of the SIGALRM. I would not be surprised if is some kind of thread bug. This is happening on our production system (Netra-T12) and my dev box (Blade-100) System = SunOS System = SunOS This is the script I am running to collect the data and send it to the server via a proxy. The perl and awk reformat the data. Changing the zabbix_sender args does not make any difference. I just run this and cntr-c out a couple of times and the crash occurs. THUNDER=/opt/thunder $THUNDER/local/bin/perl $THUNDER/bin/tdp.pl /opwv/imail/log/imdircacheserv.stat | $THUNDER/local/bin/mawk -f $THUNDER/bin/ldap-stats.awk | $THUNDER/bin/zabbix_sender -vv -r -c $THUNDER/etc/zabbix_agentd.conf -T -i - |
Comment by Aleksandrs Saveljevs [ 2010 Jul 06 ] |
We are yet to reproduce the issue on the Solaris boxes we have at our disposal. However, there is a related issue where the agent crashes during zbx_tcp_send() on Solaris 9: 27344: 46.1479 write(5, " Z B X D01", 5) = 5 The error looks highly similar to what we have here. The information on the Web hints that it might or might not be a compiler or Solaris bug, however, none of the sources I have found are clear on this point. To somehow advance on this issue, there are two ideas I am willing to try: (1) Agent and proxy handle SIGPIPE signal. Somehow, during the crash, they do not fully get there. So I wish to know where that instruction pointed to by %pc from "Incurred fault #6, FLTBOUNDS %pc = 0x00026398" is located: namely, is it in our code and, if so, what it does. To help find this out, could you please run proxy under truss and disassemble zabbix_proxy with "dis -n zabbix_proxy" or similar? (2) What compiler and what version are you using? Does proxy crash if compiled with a different compiler (e.g., gcc or Sun Studio)? Sorry for the burden. If you have other ideas, please let us know. |
Comment by frankg gleason [ 2010 Jul 06 ] |
I'm happy to help. I'll work on this today. |
Comment by Aleksandrs Saveljevs [ 2010 Jul 07 ] |
Aha! Based on "%pc = 0x00026398" and the disassembly of Solaris 9 agent available in the download area on Zabbix website, we can see that siginfo argument to child_signal_handler() in src/libs/zbxnix/daemon.c can be NULL, and http://hackage.haskell.org/trac/ghc/ticket/3790 confirms it. I will prepare a patch for this, but you can also try checking for siginfo being NULL independently and see whether proxy still crashes. There is probably no need to work on (1) and (2) mentioned in the previous post. |
Comment by Aleksandrs Saveljevs [ 2010 Jul 07 ] |
Could you please install proxy from svn://svn.zabbix.com/branches/dev/zbx-2634-solaris-signals and see whether it works for you? |
Comment by frankg gleason [ 2010 Jul 07 ] |
Thank you. I will test the patched version today. |
Comment by frankg gleason [ 2010 Jul 10 ] |
I tested the patched version and was unable to reproduce the crash. Looks like it's fixed. Thanks very much. |
Comment by Aleksandrs Saveljevs [ 2010 Jul 12 ] |
Thanks for help! |
Comment by Aleksandrs Saveljevs [ 2010 Jul 12 ] |
Fixed in pre-1.8.3 in r13256. |