[ZBX-2467] Zabbix server dies (logrt trigger related ?) Created: 2010 May 26 Updated: 2017 May 30 Resolved: 2014 Jun 02 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 1.8.2 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Major |
Reporter: | Jean-Denis Girard | Assignee: | Unassigned |
Resolution: | Cannot Reproduce | Votes: | 0 |
Labels: | crash, server | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Linux Mandriva 2008.0 32 bits, PostgreSQL |
Attachments: | analysis.asm no-sigpipe-crash.diff no-stack-frame.diff sigpipe.c zabbix_server.log.Fatal.bz2 zabbix_server.log.bz2 zabbix_server.objdump.bz2 zabbix_server.objdump2.bz2 zabbix_server.zip |
Description |
27637:20100521:045025.363 Sending [net.if.out[lo,bytes] ]}] |
Comments |
Comment by Alexei Vladishev [ 2010 May 28 ] |
Can you reproduce it? Please could you try running the latest pre-1.8.3 from the nightly builds. It's quite likely the problem was |
Comment by Jean-Denis Girard [ 2010 May 29 ] |
It died again on Thursday (trace below), and it already died a couple of times before. I will try pre-1.8.3 as soon as possible, but I don't have access to the server now. 32487:20100527:091116.919 Sending [system.cpu.util[,user,avg1] ] |
Comment by Jean-Denis Girard [ 2010 May 29 ] |
Forgot to add that logrt was probably not be the problem: I deactivated it on the server after previous crash. |
Comment by Aleksandrs Saveljevs [ 2010 Jul 13 ] |
We would appreciate more information on the issue. Could you please try with the latest pre-1.8.3? |
Comment by Jean-Denis Girard [ 2010 Jul 13 ] |
I have installed pre-zabbix-1.8.3.x-12444 on June 2nd, and it crashed in the same way. |
Comment by Aleksandrs Saveljevs [ 2010 Jul 14 ] |
If you already have pre-1.8.3@r12444 then you can stay with it, although trying the latest pre-1.8.3 would be nice too. r12444 is already capable of producing a lot of useful debugging information on 32-bit Linux when it crashes. The next time it goes down, could you please attach the server log along with the disassembly of the server (objdump -D -S zabbix_server)? |
Comment by Aleksandrs Saveljevs [ 2010 Aug 23 ] |
Zabbix 1.8.3 has been released recently. Does 1.8.3 still crash? We need its debugging output in order to continue working on this issue. |
Comment by Jean-Denis Girard [ 2010 Aug 27 ] |
I was on vacation, only returned to my customer yesterday. Zabbix-1.8.3.x-12444 has not crashed for at least 6 weeks. I don't know what to say, maybe close the issue, it could have been a temporary issue with the server. |
Comment by richlv [ 2010 Aug 29 ] |
thanks, please reopen if this happens again and you have backtrace |
Comment by Jean-Denis Girard [ 2011 Jun 06 ] |
After upgrading to 1.8.5, crashes started again. I managed to get backtraces on two zabbix threads: Process X: Process Y: |
Comment by Jean-Denis Girard [ 2011 Jun 06 ] |
Zabbix server 1.8.5 log |
Comment by Aleksandrs Saveljevs [ 2011 Jun 07 ] |
Thanks for the log! However, I could not find any evidence of a crash there. Namely, the only thing related to Zabbix server termination in that log are the following lines: $ grep SIGTERM zabbix_server.log They clearly show that Zabbix server was killed by root (sender_uid:0). When Zabbix server crashes due to SIGSEGV, for instance, it prints a backtrace, contents of stack and registers, and some other useful information (called "fatal information"). There is no such information in this log file. So the question is: are you sure Zabbix terminated without root's help and do you have a log file with fatal information? When a log file has fatal information, "grep Fatal.information zabbix_server.log" should output something. |
Comment by Jean-Denis Girard [ 2011 Jun 09 ] |
Here is a new log with "fatal information" |
Comment by Aleksandrs Saveljevs [ 2011 Jun 10 ] |
Thanks! Could you please also attach the disassembly of the crashing zabbix_server (can be obtained with "objdump -D -S zabbix_server")? We have noticed that in all crashes "ebp" register is 0, so zabbix_server crashes (after a crash) when dumping a stack frame. This means it does not give us any backtrace information. If you would be willing to recompile the server with no-stack-frame.diff patch applied to it, running it and attaching another log file with a crash, that might be useful, too. |
Comment by Jean-Denis Girard [ 2011 Jun 10 ] |
I don't have access to this system until next week, but I will send the disassembly, and apply the patch. |
Comment by Jean-Denis Girard [ 2011 Jun 23 ] |
Here are the objdump and the log of a crash with patch applied |
Comment by Aleksandrs Saveljevs [ 2011 Jun 27 ] |
Thanks, but are you sure you have disassembled zabbix_server that crashed (the one with the patch applied, not the old one)? I can see no instructions at 80850d2 (call 14 in the backtrace) and at 8090240 (call 7 in the backtrace). The last address is also in function zbx_tcp_listen(), not in zbx_tcp_send_ext(), as the backtrace claims. Could you please check? |
Comment by Jean-Denis Girard [ 2011 Jun 27 ] |
Yes, the first disassemble I sent was for the unpatched version of zabbix_server; I thought you needed it for older traces. Here is the objdump for the patched version. |
Comment by Aleksandrs Saveljevs [ 2011 Jun 28 ] |
Thanks, that looks more like it. Sorry if my previous request was misleading. |
Comment by Aleksandrs Saveljevs [ 2011 Jun 28 ] |
Attaching a slightly annotated excerpt of disassembly with references to Zabbix 1.8.5 source code (see analysis.asm). So the scenario goes like this. Agent connects to the server, asking for active checks. Server obtains those active checks, prepares a JSON, and tries to send it by calling zbx_tcp_send_ext(). There, while performing a write() on a socket, server receives SIGPIPE (probably because something happened to the connection). In child_signal_handler(), it tries to print some debugging information about SIGPIPE being received and crashes, because "siginfo" pointer (%ebx register) has value 0x33 (can be seen from the register dump), which is not exactly the most valid pointer out there. Provided the analysis is correct, it is then highly interesting why an operating system would give such a bad pointer to a signal handler. |
Comment by Aleksandrs Saveljevs [ 2011 Jun 28 ] |
Not sure where to start now, but maybe you could provide the Linux kernel version you are using (the output of "uname -a")? |
Comment by Jean-Denis Girard [ 2011 Jun 29 ] |
Linux Supervision_DDI.edt.pf 2.6.22.9-desktop-1mdv #1 SMP Thu Sep 27 04:07:04 CEST 2007 i686 Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz GNU/Linux |
Comment by Aleksandrs Saveljevs [ 2011 Jun 30 ] |
I have prepared a small program to print out pointers our signal handler received along with SIGPIPE on your system. Could you please compile, run this program using the following commands, and tell us what it prints to stderr? $ gcc sigpipe.c -o sigpipe |
Comment by Jean-Denis Girard [ 2011 Jun 30 ] |
Here is what I get: |
Comment by Aleksandrs Saveljevs [ 2011 Jul 14 ] |
Apparently, the issue does not manifest itself with a simple program, so I have prepared another patch for Zabbix 1.8.5 server for you. See no-sigpipe-crash.diff. The patch does two things. First, it incorporates the previous patch no-stack-frame.diff, so that if the server crashes, we get some useful debugging information. Second, if the crash analysis is correct, it should protect your Zabbix server from crashing again, because it does not try to dereference the "siginfo" pointer upon receiving SIGPIPE signal. However, it will now try to log a warning message with a substring " There is no reason to run the server with DebugLevel=4, please run it with DebugLevel=3, as usual. |
Comment by Jean-Denis Girard [ 2011 Aug 24 ] |
Sorry for the delay, but there were vacations, so I didn't have access to the server. I applied the latest patch to zabbix-1.8.6 last week. Now it works fine, The following message appears quite randomly in the logs: |
Comment by richlv [ 2011 Oct 18 ] |
has the server with debug patch applied crashed at least once ? |
Comment by Aleksandrs Saveljevs [ 2011 Oct 18 ] |
Oh, I have been on vacation, too, so I did not notice the comment by Jean-Denis on August 24. It confirms that our analysis was correct: the system gives us a bad pointer in a signal handler. Now we have to figure out why it happens. |
Comment by Aleksandrs Saveljevs [ 2014 Jun 02 ] |
There has been no progress on the issue since 2011 and no further complaints from Jean-Denis regarding Zabbix, so it is proposed to close this issue. Please reopen if the problem is still relevant. |