[ZBX-8060] zabbix_server crash - in poller (when calculating queue) Created: 2014 Apr 10 Updated: 2018 Jul 20 Resolved: 2014 Apr 11 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.2.3 |
Fix Version/s: | 2.2.4rc1, 2.3.0 |
Type: | Incident report | Priority: | Major |
Reporter: | Robert Jerzak | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | crash, dm, queue, regression, squashable | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | crash-fix-plus-debug.patch zabbix_server-crash.log | ||||||||||||||||||||||||||||
Issue Links: |
|
Description |
zabbix_server crashed, log in attachment. |
Comments |
Comment by richlv [ 2014 Apr 10 ] |
backtrace : 16252:20140410:111729.683 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0x48]. Crashing ... 16252:20140410:111729.684 === Backtrace: === 16252:20140410:111729.690 10: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](print_fatal_info+0x9a) [0x449e58] 16252:20140410:111729.690 9: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values]() [0x44a07e] 16252:20140410:111729.690 8: /usr/lib/libc.so.6(+0x35240) [0x7ff5e7b25240] 16252:20140410:111729.691 7: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](DCget_item_queue+0x80) [0x4422a8] 16252:20140410:111729.691 6: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](get_value_internal+0x368) [0x41e594] 16252:20140410:111729.691 5: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values]() [0x41d625] 16252:20140410:111729.691 4: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](main_poller_loop+0x9e) [0x41e15a] 16252:20140410:111729.691 3: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](MAIN_ZABBIX_ENTRY+0x545) [0x414fb8] 16252:20140410:111729.691 2: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values](daemon_start+0x211) [0x449978] 16252:20140410:111729.691 1: /usr/lib/libc.so.6(__libc_start_main+0xf5) [0x7ff5e7b11a15] 16252:20140410:111729.691 0: /usr/local/sbin/zabbix_server: poller #70 [got 9 values in 0.258612 sec, getting values]() [0x4111c9] |
Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ] |
Might be a regression introduced in |
Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ] |
According to "refaddr:0x48", it crashes when host is NULL and we reference "host->maintenance_status". |
Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8060 . |
Comment by Aleksandrs Saveljevs [ 2014 Apr 10 ] |
The problem was also reported by Tibor Pittich (future) on #zabbix on Freenode and he says the server crashes reliably every 5 minutes. So the problem does not seem to occur only in rare circumstances initiated by parallel user actions in the frontend, as we though it does. While the fix in the development branch would probably fix the crash, it looks like it would just hide the bigger problem. Reopening to investigate. |
Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ] |
Robert, how reliably does the server crash? Would it be possible for you to run a patched version of Zabbix server that adds a bit more debugging information? |
Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ] |
In case you have time to run a patched version, I am attaching "crash-fix-plus-debug.patch". It should protect your server from crashing and it also adds a debugging message when unexpected data is encountered. No need to run with DebugLevel=4 for now, DebugLevel=3 suffices. The patch prints the item ID, its key and host ID. The host with the specified ID should be in the configuration cache, but is not there for some reason. Once you know their ID, please tell us more about this item and the host: what type the item is, is the host linked to a template, is it monitored by a proxy, is it discovered by network discovery, does it change its enabled/disabled status frequently, etc. |
Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ] |
Tibor Pittich on Freenode helped to debug this problem. The patched server in his case produced the following output: 32262:20140411:115800.411 DEBUG: host is NULL for itemid:100100000028744 key:'icmpping' hostid:12312300000010087 It can be seen that item ID belongs to one node (which is local), whereas host ID belongs to another. In the configuration cache, we only keep records for the local node, that is why the host record is unavailable. We discussed this with sasha and this is one of the major issues with distributed monitoring, which is being removed in 2.4 (see No additional fixes in the development branch are necessary. |
Comment by Aleksandrs Saveljevs [ 2014 Apr 11 ] |
Fixed in pre-2.2.4 r44330 and pre-2.3.0 (trunk) r44331. |
Comment by Filipe Paternot [ 2014 Jun 13 ] |
Im running with the patch crash-fix-plus-debug.patch without problems for a few weeks now. You should try it or wait for 2.2.4 to come out, which should be soon. |
Comment by Fusic [ 2014 Jun 13 ] |
can you give me a hint, how to work with this patch file? Or a website with a howto? |
Comment by Filipe Paternot [ 2014 Jun 13 ] |
Sure... This is what i did:
|