[#ZBX-2467] Zabbix server dies (logrt trigger related ?)

[ZBX-2467] Zabbix server dies (logrt trigger related ?) Created: 2010 May 26 Updated: 2017 May 30 Resolved: 2014 Jun 02
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	1.8.2
Fix Version/s:	None

Type:

Incident report

Priority:

Major

Reporter:

Jean-Denis Girard

Assignee:

Unassigned

Resolution:

Cannot Reproduce

Votes:

Labels:

crash, server

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Environment:

Linux Mandriva 2008.0 32 bits, PostgreSQL

Attachments:

analysis.asm

no-sigpipe-crash.diff

no-stack-frame.diff

sigpipe.c

zabbix_server.log.Fatal.bz2

zabbix_server.log.bz2

zabbix_server.objdump.bz2

zabbix_server.objdump2.bz2

zabbix_server.zip

Description

27637:20100521:045025.363 Sending [net.if.out[lo,bytes]
]
27638:20100521:045025.371 Sending [vfs.fs.inode[/usr,free]
]
27646:20100521:045025.401 Query [txnlev:0] [select i.itemid,i.key_,h.host,h.port,i.delay,i.description,i.type,h.useip,h.ip,i.history,i.lastvalue,i.prevvalue,i.hostid,i.value_type,i.delta,i.prevorgvalue,i.lastclock,i.units,i.multiplier,i.formula,i.status,i.valuemapid,h.dns,i.trends,i.lastlogsize,i.data_type,i.mtime from hosts h,items i where i.hostid=h.hostid and h.status=0 and i.type=7 and h.hostid=10017 and h.proxy_hostid=0 and (i.status=0 or (i.status=3 and i.lastclock+600<=1274453425))]
27646:20100521:045025.402 Item 'logrt[/var/log/messages,.*\slogger:\s.*,,100]' was successfully found in the server cache. Sending.
27646:20100521:045025.402 In substitute_simple_macros (data:'logrt[/var/log/messages,.*\slogger:\s.*,,100]')
27646:20100521:045025.402 Sending [{
"response":"success",
"data":[

{ "key":"logrt[\/var\/log\/messages,.*\\slogger:\\s.*,,100]", "delay":"30", "lastlogsize":"89219", "mtime":"1273171363"}

]}]
27625:20100521:045025.402 One child process died (PID:27646). Exiting ...
27625:20100521:045027.651 Connect to the database
27625:20100521:045027.730 Query [txnlev:0] [select oid from pg_type where typname = 'bytea']
27625:20100521:045028.268 PostgreSQL Server version: 80213
27625:20100521:045028.268 Query [txnlev:0] [set escape_string_warning to off]
27625:20100521:045028.268 In free_database_cache()
27625:20100521:045028.268 In DCsync_all()
27625:20100521:045028.268 In DCsync_history(history_first:54305 history_num:187)
27625:20100521:045028.268 Syncing history data...
27625:20100521:045028.268 Query [txnlev:1] [begin;]
27625:20100521:045028.268 In DCmass_update_item()

Comments

Comment by Alexei Vladishev [ 2010 May 28 ]

Can you reproduce it? Please could you try running the latest pre-1.8.3 from the nightly builds. It's quite likely the problem was
fixed recently, however I can find no evidence in ChangeLog.

Comment by Jean-Denis Girard [ 2010 May 29 ]

It died again on Thursday (trace below), and it already died a couple of times before. I will try pre-1.8.3 as soon as possible, but I don't have access to the server now.

32487:20100527:091116.919 Sending [system.cpu.util[,user,avg1]
]
32487:20100527:091116.919 Get value from agent result: '6.430118'
32487:20100527:091116.919 End of get_value():SUCCEED
32487:20100527:091116.919 In calculate_item_nextcheck (18473,10,"",1274987476)
32487:20100527:091116.919 End calculate_item_nextcheck (result:1274987483)
32487:20100527:091116.919 In substitute_simple_macros (data:'system.cpu.util[,user,avg1]')
32487:20100527:091116.919 In get_value() key:'system.cpu.util[,user,avg1]'
32487:20100527:091116.919 In get_value_agent() host:'Bora Bora' addr:'10.158.19.220' key:'system.cpu.util[,user,avg1]'
32484:20100527:091116.925 Sending [vfs.fs.size[/usr,used]
]
32487:20100527:091116.939 Sending [system.cpu.util[,user,avg1]
]
32489:20100527:091116.939 Sending [

{ "response":"success", "data":[]}

]
32476:20100527:091116.940 One child process died (PID:32489). Exiting ...
32476:20100527:091119.022 Connect to the database
32476:20100527:091120.023 Query [txnlev:0] [select oid from pg_type where typname = 'bytea']
32476:20100527:091122.421 PostgreSQL Server version: 80213

Comment by Jean-Denis Girard [ 2010 May 29 ]

Forgot to add that logrt was probably not be the problem: I deactivated it on the server after previous crash.

Comment by Aleksandrs Saveljevs [ 2010 Jul 13 ]

We would appreciate more information on the issue. Could you please try with the latest pre-1.8.3?

Comment by Jean-Denis Girard [ 2010 Jul 13 ]

I have installed pre-zabbix-1.8.3.x-12444 on June 2nd, and it crashed in the same way.
I will have access to the system next thursday, I'll upgrade the server to a newer version. Is there any thing else you recommend ?

Comment by Aleksandrs Saveljevs [ 2010 Jul 14 ]

If you already have pre-1.8.3@r12444 then you can stay with it, although trying the latest pre-1.8.3 would be nice too.

r12444 is already capable of producing a lot of useful debugging information on 32-bit Linux when it crashes. The next time it goes down, could you please attach the server log along with the disassembly of the server (objdump -D -S zabbix_server)?

Comment by Aleksandrs Saveljevs [ 2010 Aug 23 ]

Zabbix 1.8.3 has been released recently. Does 1.8.3 still crash? We need its debugging output in order to continue working on this issue.

Comment by Jean-Denis Girard [ 2010 Aug 27 ]

I was on vacation, only returned to my customer yesterday. Zabbix-1.8.3.x-12444 has not crashed for at least 6 weeks. I don't know what to say, maybe close the issue, it could have been a temporary issue with the server.

Comment by richlv [ 2010 Aug 29 ]

thanks, please reopen if this happens again and you have backtrace

Comment by Jean-Denis Girard [ 2011 Jun 06 ]

After upgrading to 1.8.5, crashes started again. I managed to get backtraces on two zabbix threads:

Process X:
----------
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7bb03fb in poll () from /lib/i686/libc.so.6
#2 0xb7ee5328 in ?? () from /usr/lib/libpq.so.5
#3 0xb7ee54b6 in ?? () from /usr/lib/libpq.so.5
#4 0xb7ee554f in ?? () from /usr/lib/libpq.so.5
#5 0xb7ee3dba in PQgetResult () from /usr/lib/libpq.so.5
#6 0xb7ee3ed8 in ?? () from /usr/lib/libpq.so.5
#7 0x080acf91 in zbx_db_vexecute (fmt=0x80c49c1 "%s",
args=0xbfb30794 "\030�\025\b\004\b��\030\v\b �\025\b\003") at db.c:755
#8 0x080a50fc in __zbx_DBexecute (fmt=0x80c49c1 "%s") at db.c:249
#9 0x0807c3a2 in DCflush_nextchecks () at nextchecks.c:326
#10 0x0805bb64 in main_poller_loop (p=1 '\001', poller_type=0 '\0') at poller.c:683
#11 0x080556bb in MAIN_ZABBIX_ENTRY () at server.c:618
#12 0x0808499c in daemon_start (allow_root=0) at daemon.c:252
#13 0x08054d98 in main (argc=Cannot access memory at address 0x1
) at server.c:437

Process Y:
----------
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7bb03fb in poll () from /lib/i686/libc.so.6
#2 0xb7ee5328 in ?? () from /usr/lib/libpq.so.5
#3 0xb7ee54b6 in ?? () from /usr/lib/libpq.so.5
#4 0xb7ee554f in ?? () from /usr/lib/libpq.so.5
#5 0xb7ee3dba in PQgetResult () from /usr/lib/libpq.so.5
#6 0xb7ee3ed8 in ?? () from /usr/lib/libpq.so.5
#7 0x080acf91 in zbx_db_vexecute (fmt=0x80b21b0 "delete from %s where itemid=%llu and clock<%d",
args=0xbfb7f684 "28\f\br[") at db.c:755
#8 0x080a50fc in __zbx_DBexecute (fmt=0x80b21b0 "delete from %s where itemid=%llu and clock<%d")
at db.c:249
#9 0x08060e72 in delete_history (table=0x80c3832 "history", itemid=23410, keep_history=7,
now=1306882807) at housekeeper.c:295
#10 0x08060fbe in main_housekeeper_loop () at housekeeper.c:335
#11 0x0805575e in MAIN_ZABBIX_ENTRY () at server.c:687
#12 0x0808499c in daemon_start (allow_root=0) at daemon.c:252
#13 0x08054d98 in main (argc=Cannot access memory at address 0x1
) at server.c:437

Comment by Jean-Denis Girard [ 2011 Jun 06 ]

Zabbix server 1.8.5 log

Comment by Aleksandrs Saveljevs [ 2011 Jun 07 ]

Thanks for the log! However, I could not find any evidence of a crash there.

Namely, the only thing related to Zabbix server termination in that log are the following lines:

$ grep SIGTERM zabbix_server.log
10087: 2012:20110531:143433.008 Got signal [signal:15(SIGTERM),sender_pid:2944,sender_uid:0,reason:0]. Exiting ...
10088: 2007:20110531:143433.008 Got signal [signal:15(SIGTERM),sender_pid:2944,sender_uid:0,reason:0]. Exiting ...
10089: 2011:20110531:143433.008 Got signal [signal:15(SIGTERM),sender_pid:2944,sender_uid:0,reason:0]. Exiting ...
10090: 2009:20110531:143433.008 Got signal [signal:15(SIGTERM),sender_pid:2944,sender_uid:0,reason:0]. Exiting ...
10093: 1972:20110531:143433.009 Got signal [signal:15(SIGTERM),sender_pid:1965,sender_uid:507,reason:0]. Exiting ...
10094: 1973:20110531:143433.010 Got signal [signal:15(SIGTERM),sender_pid:1965,sender_uid:507,reason:0]. Exiting ...
...
207201: 3075:20110531:144819.802 Got signal [signal:15(SIGTERM),sender_pid:3765,sender_uid:0,reason:0]. Exiting ...
207202: 3065:20110531:144819.802 Got signal [signal:15(SIGTERM),sender_pid:3765,sender_uid:0,reason:0]. Exiting ...
207203: 3067:20110531:144819.802 Got signal [signal:15(SIGTERM),sender_pid:3765,sender_uid:0,reason:0]. Exiting ...
207204: 3068:20110531:144819.802 Got signal [signal:15(SIGTERM),sender_pid:3765,sender_uid:0,reason:0]. Exiting ...
207205: 3077:20110531:144819.802 Got signal [signal:15(SIGTERM),sender_pid:3765,sender_uid:0,reason:0]. Exiting ...
207208: 2973:20110531:144819.803 Got signal [signal:15(SIGTERM),sender_pid:2965,sender_uid:507,reason:0]. Exiting ...
...

They clearly show that Zabbix server was killed by root (sender_uid:0).

When Zabbix server crashes due to SIGSEGV, for instance, it prints a backtrace, contents of stack and registers, and some other useful information (called "fatal information"). There is no such information in this log file.

So the question is: are you sure Zabbix terminated without root's help and do you have a log file with fatal information? When a log file has fatal information, "grep Fatal.information zabbix_server.log" should output something.

Comment by Jean-Denis Girard [ 2011 Jun 09 ]

Here is a new log with "fatal information"

Comment by Aleksandrs Saveljevs [ 2011 Jun 10 ]

Thanks! Could you please also attach the disassembly of the crashing zabbix_server (can be obtained with "objdump -D -S zabbix_server")?

We have noticed that in all crashes "ebp" register is 0, so zabbix_server crashes (after a crash) when dumping a stack frame. This means it does not give us any backtrace information. If you would be willing to recompile the server with no-stack-frame.diff patch applied to it, running it and attaching another log file with a crash, that might be useful, too.

Comment by Jean-Denis Girard [ 2011 Jun 10 ]

I don't have access to this system until next week, but I will send the disassembly, and apply the patch.

Comment by Jean-Denis Girard [ 2011 Jun 23 ]

Here are the objdump and the log of a crash with patch applied

Comment by Aleksandrs Saveljevs [ 2011 Jun 27 ]

Thanks, but are you sure you have disassembled zabbix_server that crashed (the one with the patch applied, not the old one)? I can see no instructions at 80850d2 (call 14 in the backtrace) and at 8090240 (call 7 in the backtrace). The last address is also in function zbx_tcp_listen(), not in zbx_tcp_send_ext(), as the backtrace claims. Could you please check?

Comment by Jean-Denis Girard [ 2011 Jun 27 ]

Yes, the first disassemble I sent was for the unpatched version of zabbix_server; I thought you needed it for older traces. Here is the objdump for the patched version.

Comment by Aleksandrs Saveljevs [ 2011 Jun 28 ]

Thanks, that looks more like it. Sorry if my previous request was misleading.

Comment by Aleksandrs Saveljevs [ 2011 Jun 28 ]

Attaching a slightly annotated excerpt of disassembly with references to Zabbix 1.8.5 source code (see analysis.asm).

So the scenario goes like this. Agent connects to the server, asking for active checks. Server obtains those active checks, prepares a JSON, and tries to send it by calling zbx_tcp_send_ext(). There, while performing a write() on a socket, server receives SIGPIPE (probably because something happened to the connection). In child_signal_handler(), it tries to print some debugging information about SIGPIPE being received and crashes, because "siginfo" pointer (%ebx register) has value 0x33 (can be seen from the register dump), which is not exactly the most valid pointer out there.

Provided the analysis is correct, it is then highly interesting why an operating system would give such a bad pointer to a signal handler.

Comment by Aleksandrs Saveljevs [ 2011 Jun 28 ]

Not sure where to start now, but maybe you could provide the Linux kernel version you are using (the output of "uname -a")?

Comment by Jean-Denis Girard [ 2011 Jun 29 ]

Linux Supervision_DDI.edt.pf 2.6.22.9-desktop-1mdv #1 SMP Thu Sep 27 04:07:04 CEST 2007 i686 Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz GNU/Linux

Comment by Aleksandrs Saveljevs [ 2011 Jun 30 ]

I have prepared a small program to print out pointers our signal handler received along with SIGPIPE on your system.

Could you please compile, run this program using the following commands, and tell us what it prints to stderr?

$ gcc sigpipe.c -o sigpipe
$ ./sigpipe | head -5

Comment by Jean-Denis Girard [ 2011 Jun 30 ]

Here is what I get:
$ ./sigpipe | head -5
yes
yes
yes
yes
yes
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec
received signal:13 siginfo:0xbfbdac6c context:0xbfbdacec

Comment by Aleksandrs Saveljevs [ 2011 Jul 14 ]

Apparently, the issue does not manifest itself with a simple program, so I have prepared another patch for Zabbix 1.8.5 server for you. See no-sigpipe-crash.diff.

The patch does two things. First, it incorporates the previous patch no-stack-frame.diff, so that if the server crashes, we get some useful debugging information. Second, if the crash analysis is correct, it should protect your Zabbix server from crashing again, because it does not try to dereference the "siginfo" pointer upon receiving SIGPIPE signal. However, it will now try to log a warning message with a substring "~~ZBX-2467~~" every time it receives SIGPIPE. If you see a suspicious log message, please send the log to us. A suspicious log message is a log message that outputs "siginfo" pointer as having a low value, like 0x33.

There is no reason to run the server with DebugLevel=4, please run it with DebugLevel=3, as usual.

Comment by Jean-Denis Girard [ 2011 Aug 24 ]

Sorry for the delay, but there were vacations, so I didn't have access to the server. I applied the latest patch to zabbix-1.8.6 last week. Now it works fine, The following message appears quite randomly in the logs:
~~ZBX-2467~~ signal:13(SIGPIPE) siginfo:0x33
siginfo always has the same value.

Comment by richlv [ 2011 Oct 18 ]

has the server with debug patch applied crashed at least once ?

Comment by Aleksandrs Saveljevs [ 2011 Oct 18 ]

Oh, I have been on vacation, too, so I did not notice the comment by Jean-Denis on August 24. It confirms that our analysis was correct: the system gives us a bad pointer in a signal handler. Now we have to figure out why it happens.

Comment by Aleksandrs Saveljevs [ 2014 Jun 02 ]

There has been no progress on the issue since 2011 and no further complaints from Jean-Denis regarding Zabbix, so it is proposed to close this issue. Please reopen if the problem is still relevant.

Generated at Thu Apr 25 03:58:11 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-2467] Zabbix server dies (logrt trigger related ?) Created: 2010 May 26 Updated: 2017 May 30 Resolved: 2014 Jun 02