[ZBX-10692] Zabbix server crash in escalator in DCexpression_expand_user_macros() Created: 2016 Apr 21 Updated: 2017 May 30 Resolved: 2016 May 06 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 3.0.2 |
Fix Version/s: | 3.0.3rc1, 3.2.0alpha1 |
Type: | Incident report | Priority: | Blocker |
Reporter: | Ilya Kruchinin | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | crash, escalator, server | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Updated to 3.0.2 today. The server crashed within 5 hours of running |
Attachments: | ZBX-10692.patch objdump.dump.tgz zabbix segfault.txt |
Description |
The server was running fine for about 5 hours after updating to v3.0.2. |
Comments |
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 21 ] |
Backtrace for easier search: 6988:20160421:160051.920 === Backtrace: === 6988:20160421:160051.921 15: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](print_fatal_info+0xae) [0x47e06e] 6988:20160421:160051.921 14: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x47e332] 6988:20160421:160051.921 13: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7f783d92c0e0] 6988:20160421:160051.921 12: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x4634f4] 6988:20160421:160051.921 11: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](DCexpression_expand_user_macros+0x21) [0x46dda1] 6988:20160421:160051.921 10: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](substitute_simple_macros+0xde0) [0x456a40] 6988:20160421:160051.921 9: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](substitute_simple_macros+0x385e) [0x4594be] 6988:20160421:160051.921 8: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x4405d0] 6988:20160421:160051.921 7: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x440def] 6988:20160421:160051.921 6: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](escalator_thread+0x157) [0x4410f7] 6988:20160421:160051.921 5: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](zbx_thread_start+0x45) [0x47ed65] 6988:20160421:160051.921 4: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](MAIN_ZABBIX_ENTRY+0x53c) [0x41d04c] 6988:20160421:160051.921 3: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](daemon_start+0x1c3) [0x47d903] 6988:20160421:160051.921 2: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](main+0x2f8) [0x417d78] 6988:20160421:160051.921 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f783d918b45] 6988:20160421:160051.921 0: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x418043] Could you attach the result of the following command? objdump -DSswx zabbix_server |
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 21 ] |
It's difficult to say for sure without a binary, but most likely call stack looks like so: main escalator loop -> process_escalation() -> execute_escalations() -> process_recovery_msg() -> substitute_simple_macros() -> substitute_simple_macros(..., MACRO_TYPE_TRIGGER_DESCRIPTION, ...) -> DCexpression_expand_user_macro() -> dc_express_expand_user_macros() -> dc_expand_user_macros() -> strchr(). In this scenario event (with SOURCE_TRIGGERS) has somehow event.trigger.expression == NULL. Actually, get_event_info() guarantees only that event trigger fields are initialized if event object is EVENT_OBJECT_TRIGGER. If event source is SOURCE_TRIGGERS but event object is not EVENT_OBJECT_TRIGGER we may eventually pass NULL to strchr(). To check this please try to select * from events where source=0 and object<>0; |
Comment by Ilya Kruchinin [ 2016 Apr 26 ] |
The query you mentioned, here's the output (or lack of one): |
Comment by Ilya Kruchinin [ 2016 Apr 26 ] |
The "zabbix_server" I'm using is the binary from the official repo |
Comment by Ilya Kruchinin [ 2016 Apr 26 ] |
Added "objdump -DSswx /usr/sbin/zabbix_server > /root/objdump.dump" output (file attached as a tar.gz archive) Please let me know if you need any additional information. |
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 28 ] |
My initial guess was a bit off, but now I think I've localised the place where crash happens. Thank you for the provided information! Can you check what these commands give you? describe globalmacro; describe hostmacro; |
Comment by Ilya Kruchinin [ 2016 Apr 29 ] |
MariaDB [zabbix]> describe globalmacro; +---------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+---------------------+------+-----+---------+-------+ | globalmacroid | bigint(20) unsigned | NO | PRI | NULL | | | macro | varchar(255) | NO | UNI | | | | value | varchar(255) | NO | | | | +---------------+---------------------+------+-----+---------+-------+ 3 rows in set (0.00 sec) MariaDB [zabbix]> describe hostmacro; +-------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------------------+------+-----+---------+-------+ | hostmacroid | bigint(20) unsigned | NO | PRI | NULL | | | hostid | bigint(20) unsigned | NO | MUL | NULL | | | macro | varchar(255) | NO | | | | | value | varchar(255) | NO | | | | +-------------+---------------------+------+-----+---------+-------+ 4 rows in set (0.00 sec) |
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 29 ] |
Thank you for response! So far everything looks fine. Can you see any suspicious entries in these tables? Like with empty macro field? Can you recall adding/deleting user/host macros and/or their context before the crash happened? |
Comment by Ilya Kruchinin [ 2016 May 02 ] |
Yes, I was changing the templates/hosts USERMACROs at about the same time when the crash happened. There are currently no MACROs with an empty MACRO field (but the server hasn't crashed since then, all looks good): MariaDB [zabbix]> select * from hostmacro where macro =''; Empty set (0.00 sec) There are 3 MACROS with empty VALUE fields (which, I guess, may not be a problem, just an empty value, and this is intended (undefined value in the template) from my part): MariaDB [zabbix]> select * from hostmacro where value =''; +-------------+--------+-----------------+-------+ | hostmacroid | hostid | macro | value | +-------------+--------+-----------------+-------+ | 187 | 10196 | {$IPMIIP} | | | 472 | 10355 | {$PLESKDNSNAME} | | | 474 | 10357 | {$PLESKDNSNAME} | | +-------------+--------+-----------------+-------+ There hasn't been an issue since the last crash - all working fine. |
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 02 ] |
Empty value shouldn't be an issue. Although I don't have a clear idea where and how Zabbix crashed in your case, there are couple of places where Zabbix can be reinforced. Thank you for cooperation and let me know if the crash happens again and/or new details become available. |
Comment by Oleksii Zagorskyi [ 2016 May 04 ] |
Got the same constant crash on escalator on 3.0.2, using postgres. 20822:20160504:213801.265 temporarily disabling Zabbix agent checks on host "AWS-L": host unavailable 20822:20160504:213801.359 temporarily disabling Zabbix agent checks on host "AWS-T": host unavailable 20822:20160504:213801.600 temporarily disabling Zabbix agent checks on host "Test-host": host unavailable 20853:20160504:213803.309 escalation cancelled: trigger id:14839 deleted. 20853:20160504:213803.325 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ... .... 20853:20160504:213803.325 === Backtrace: === 20853:20160504:213803.348 15: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](print_fatal_info+0x9e) [0x46b58e] 20853:20160504:213803.348 14: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x46b847] 20853:20160504:213803.348 13: /lib64/libc.so.6(+0x35640) [0x7f1c27af6640] 20853:20160504:213803.348 12: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x450fa4] 20853:20160504:213803.348 11: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](DCexpression_expand_user_macros+0x21) [0x45b751] 20853:20160504:213803.348 10: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](substitute_simple_macros+0xd50) [0x4444e0] 20853:20160504:213803.348 9: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](substitute_simple_macros+0x42a3) [0x447a33] 20853:20160504:213803.348 8: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x433432] 20853:20160504:213803.348 7: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x4341cf] 20853:20160504:213803.348 6: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](escalator_thread+0x143) [0x434463] 20853:20160504:213803.348 5: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](zbx_thread_start+0x45) [0x46c165] 20853:20160504:213803.348 4: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](MAIN_ZABBIX_ENTRY+0x52b) [0x41919b] 20853:20160504:213803.348 3: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](daemon_start+0x1ad) [0x46aecd] 20853:20160504:213803.348 2: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](main+0x336) [0x414246] 20853:20160504:213803.348 1: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1c27ae2af5] 20853:20160504:213803.348 0: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x4144b1] I have those 3 tables dump. To be able run zabbix server again I've cleaned out escalations table. |
Comment by Oleksii Zagorskyi [ 2016 May 05 ] |
Wow, I can reproduce it! CONFIRMED |
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 05 ] |
Can't wait! |
Comment by Oleksii Zagorskyi [ 2016 May 06 ] |
Well, while I've found a few issues and first of all I've faced with complex conditions how to reproduce the crash, now I've simplified the scenario:
When trigger will be recalculated - it will become unknown (state=1) with error message "Cannot evaluate expression: macro '{$FSPFREE}' is not found." 8122:20160506:093338.346 In process_escalations() 8122:20160506:093338.346 query [txnlev:0] [select escalationid,actionid,triggerid,eventid,r_eventid,nextcheck,esc_step,status,itemid from escalations where triggerid is not null order by actionid,triggerid,itemid,escalationid] 8122:20160506:093338.346 query [txnlev:1] [begin;] 8122:20160506:093338.346 In check_escalation() escalationid:1 status:active 8122:20160506:093338.346 query [txnlev:1] [select source,object from events where eventid=67] 8122:20160506:093338.346 query [txnlev:1] [select actionid,name,status,eventsource,esc_period,def_shortdata,def_longdata,r_shortdata,r_longdata,recovery_msg from actions where actionid=8] 8122:20160506:093338.347 End of check_escalation():SUCCEED error: '(null)' 8122:20160506:093338.347 In execute_escalation() escalationid:1 status:active 8122:20160506:093338.347 query [txnlev:1] [select eventid,source,object,objectid,clock,value,acknowledged,ns from events where eventid=67] 8122:20160506:093338.347 query [txnlev:1] [select description,expression,priority,comments,url from triggers where triggerid=13561] 8122:20160506:093338.347 In execute_operations() 8122:20160506:093338.347 query [txnlev:1] [select o.operationid,o.operationtype,o.esc_period,o.evaltype,m.operationid,m.default_msg,subject,message,mediatypeid from operations o left join opmessage m on m.operationid=o.operationid where o.actionid=8 ando.operationtype in (0,1) and o.esc_step_from<=1 and (o.esc_step_to=0 or o.esc_step_to>=1)] 8122:20160506:093338.347 In check_operation_conditions() operationid:8 8122:20160506:093338.347 query [txnlev:1] [select conditiontype,operator,value from opconditions where operationid=8 order by conditiontype] 8122:20160506:093338.347 End of check_operation_conditions():SUCCEED 8122:20160506:093338.347 Conditions match our event. Execute operation. 8122:20160506:093338.347 In add_object_msg() 8122:20160506:093338.347 query [txnlev:1] [select userid from opmessage_usr where operationid=8 union select g.userid from opmessage_grp m,users_groups g where m.usrgrpid=g.usrgrpid and m.operationid=8] 8122:20160506:093338.348 query [txnlev:1] [select count(*) from usrgrp g,users_groups ug where ug.userid=1 and g.usrgrpid=ug.usrgrpid and g.users_status=1] 8122:20160506:093338.348 In get_trigger_permission() 8122:20160506:093338.348 query [txnlev:1] [select distinct i.hostid from items i,functions f where i.itemid=f.itemid and f.triggerid=13561] 8122:20160506:093338.348 In get_host_permission() 8122:20160506:093338.348 query [txnlev:1] [select type from users where userid=1] 8122:20160506:093338.349 End of get_host_permission():rw 8122:20160506:093338.349 End of get_trigger_permission():rw 8122:20160506:093338.349 In substitute_simple_macros() data:'{TRIGGER.NAME}' 8122:20160506:093338.349 In substitute_simple_macros() data:'Some myname more than 500' 8122:20160506:093338.349 In get_N_functionid() expression:'{13250}>{$FSPFREE}' N_functionid:1 8122:20160506:093338.349 get_N_functionid() functionid:13250 8122:20160506:093338.349 End of get_N_functionid():SUCCEED 8122:20160506:093338.349 In get_N_functionid() expression:'>{$FSPFREE}' N_functionid:1 8122:20160506:093338.349 End of get_N_functionid():FAIL 8122:20160506:093338.349 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ... 8122:20160506:093338.349 === Backtrace: === 8122:20160506:093338.350 15: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](print_fatal_info+0x9e) [0x47898e] 8122:20160506:093338.350 14: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x478c32] 8122:20160506:093338.350 13: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7fe9ac6980e0] 8122:20160506:093338.350 12: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x45edf4] 8122:20160506:093338.350 11: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](DCexpression_expand_user_macros+0x21) [0x468bf1] 8122:20160506:093338.350 10: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](substitute_simple_macros+0xd68) [0x452ae8] 8122:20160506:093338.350 9: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](substitute_simple_macros+0x3f23) [0x455ca3] 8122:20160506:093338.350 8: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x43d3f0] 8122:20160506:093338.350 7: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x43dbdf] 8122:20160506:093338.350 6: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](escalator_thread+0x13f) [0x43decf] 8122:20160506:093338.350 5: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](zbx_thread_start+0x45) [0x479515] 8122:20160506:093338.350 4: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](MAIN_ZABBIX_ENTRY+0x598) [0x41b198] 8122:20160506:093338.350 3: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](daemon_start+0x1b0) [0x4782d0] 8122:20160506:093338.350 2: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](main+0x388) [0x4160d8] 8122:20160506:093338.350 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fe9ac684b45] 8122:20160506:093338.350 0: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x416365] 8094:20160506:093340.357 Zabbix Server stopped. Zabbix 3.0.3rc1 (revision 59873). |
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 06 ] |
Oh, finally! After so many hours... Long story short. To expand {TRIGGER.NAME} macro properly we need to take trigger name and resolve all macros there. Trigger name may contain positional macro ($1, $2, ...) and therefore we need to parse trigger expression in search for constants. To do this we must first resolve user macros in trigger expression. If there is an unknown macro dc_expand_user_macros() wants to provide an error message with *error = zbx_dsprintf(*error, "macro '%.*s' is not found", len, ptr);
But substitute_simple_macros() is not bothered and simply passes NULL as error: DCexpression_expand_user_macros(event->trigger.expression, NULL) Caused by Huge thanks to zalex_ua for so much effort put into investigation! |
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 06 ] |
Fix for version 3.0 is available in development branch svn://svn.zabbix.com/branches/dev/ZBX-10692 revision 59917. Patch attached for those who need it urgently. |
Comment by Oleksii Zagorskyi [ 2016 May 06 ] |
More complex scenario how it's possible to trigger such a crash (only once for each trigger prototype) described in |
Comment by Alexander Vladishev [ 2016 May 06 ] |
Successfully tested! Have a look at my changes in r59921. glebs.ivanovskis Changes look fine. Accepted! |
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 06 ] |
Fixed in pre-3.0.3rc1 r59924, pre-3.1.0 (trunk) r59926. |