[ZBX-8060] zabbix_server crash - in poller (when calculating queue) Created: 2014 Apr 10  Updated: 2018 Jul 20  Resolved: 2014 Apr 11

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.3
Fix Version/s: 2.2.4rc1, 2.3.0

Type: Incident report Priority: Major
Reporter: Robert Jerzak Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: crash, dm, queue, regression, squashable
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File crash-fix-plus-debug.patch     Text File zabbix_server-crash.log    
Issue Links:
Duplicate
duplicates ZBX-14627 Zabbix services crash Closed
is duplicated by ZBX-8084 Zabbix Server 2.2.3 CRASH Closed
is duplicated by ZBX-8154 Zabbix server crashing in trapper DCg... Closed
is duplicated by ZBX-8155 Zabbix server crashing in poller DCge... Closed
is duplicated by ZBX-8215 Zabbix Server crashes with segmentati... Closed
is duplicated by ZBX-8303 Zabbix server crash Closed

 Description   

zabbix_server crashed, log in attachment.



 Comments   
Comment by richlv [ 2014 Apr 10 ]

backtrace :

 16252:20140410:111729.683 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0x48]. Crashing ...
 16252:20140410:111729.684 === Backtrace: ===
 16252:20140410:111729.690 10: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](print_fatal_info+0x9a) [0x449e58]
 16252:20140410:111729.690 9: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values]() [0x44a07e]
 16252:20140410:111729.690 8: /usr/lib/libc.so.6(+0x35240) [0x7ff5e7b25240]
 16252:20140410:111729.691 7: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](DCget_item_queue+0x80) [0x4422a8]
 16252:20140410:111729.691 6: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](get_value_internal+0x368) [0x41e594]
 16252:20140410:111729.691 5: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values]() [0x41d625]
 16252:20140410:111729.691 4: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](main_poller_loop+0x9e) [0x41e15a]
 16252:20140410:111729.691 3: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](MAIN_ZABBIX_ENTRY+0x545) [0x414fb8]
 16252:20140410:111729.691 2: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](daemon_start+0x211) [0x449978]
 16252:20140410:111729.691 1: /usr/lib/libc.so.6(__libc_start_main+0xf5) [0x7ff5e7b11a15]
 16252:20140410:111729.691 0: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values]() [0x4111c9]
Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ]

Might be a regression introduced in ZBX-5778.

Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ]

According to "refaddr:0x48", it crashes when host is NULL and we reference "host->maintenance_status".

Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8060 .

Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ]

The problem was also reported by Tibor Pittich (future) on #zabbix on Freenode and he says the server crashes reliably every 5 minutes. So the problem does not seem to occur only in rare circumstances initiated by parallel user actions in the frontend, as we though it does.

While the fix in the development branch would probably fix the crash, it looks like it would just hide the bigger problem. Reopening to investigate.

Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ]

Robert, how reliably does the server crash? Would it be possible for you to run a patched version of Zabbix server that adds a bit more debugging information?

Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ]

In case you have time to run a patched version, I am attaching "crash-fix-plus-debug.patch". It should protect your server from crashing and it also adds a debugging message when unexpected data is encountered. No need to run with DebugLevel=4 for now, DebugLevel=3 suffices.

The patch prints the item ID, its key and host ID. The host with the specified ID should be in the configuration cache, but is not there for some reason. Once you know their ID, please tell us more about this item and the host: what type the item is, is the host linked to a template, is it monitored by a proxy, is it discovered by network discovery, does it change its enabled/disabled status frequently, etc.

Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ]

Tibor Pittich on Freenode helped to debug this problem. The patched server in his case produced the following output:

32262:20140411:115800.411 DEBUG: host is NULL for itemid:100100000028744 key:'icmpping' hostid:12312300000010087

It can be seen that item ID belongs to one node (which is local), whereas host ID belongs to another. In the configuration cache, we only keep records for the local node, that is why the host record is unavailable.

We discussed this with sasha and this is one of the major issues with distributed monitoring, which is being removed in 2.4 (see ZBXNEXT-1343). Apparently, frontend or API created an item on the host from another node and gave it an incorrect ID.

No additional fixes in the development branch are necessary.

Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ]

Fixed in pre-2.2.4 r44330 and pre-2.3.0 (trunk) r44331.

Comment by Filipe Paternot [ 2014 Jun 13 ]

Im running with the patch crash-fix-plus-debug.patch without problems for a few weeks now. You should try it or wait for 2.2.4 to come out, which should be soon.

Comment by Fusic [ 2014 Jun 13 ]

can you give me a hint, how to work with this patch file? Or a website with a howto?
Thanks in advance!

Comment by Filipe Paternot [ 2014 Jun 13 ]

Sure...

This is what i did:

  1. cd zabbix-2.2.3/
  2. patch -p1 < /tmp/crash-fix-plus-debug.patch
  3. ./configure --with-mysql --enable-ipv6 --with-net-snmp --with-libcurl --with-ldap --with-openipmi --enable-server --prefix=/...../zabbix/
  4. make install
Generated at Fri Apr 19 08:26:16 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.