[ZBX-10692] Zabbix server crash in escalator in DCexpression_expand_user_macros() Created: 2016 Apr 21  Updated: 2017 May 30  Resolved: 2016 May 06

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 3.0.2
Fix Version/s: 3.0.3rc1, 3.2.0alpha1

Type: Incident report Priority: Blocker
Reporter: Ilya Kruchinin Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: crash, escalator, server
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Updated to 3.0.2 today. The server crashed within 5 hours of running


Attachments: File ZBX-10692.patch     File objdump.dump.tgz     Text File zabbix segfault.txt    

 Description   

The server was running fine for about 5 hours after updating to v3.0.2.
I was adding hosts/configuring the server, when all of a sudden the server crashed.



 Comments   
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 21 ]

Backtrace for easier search:

  6988:20160421:160051.920 === Backtrace: ===
  6988:20160421:160051.921 15: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](print_fatal_info+0xae) [0x47e06e]
  6988:20160421:160051.921 14: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x47e332]
  6988:20160421:160051.921 13: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7f783d92c0e0]
  6988:20160421:160051.921 12: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x4634f4]
  6988:20160421:160051.921 11: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](DCexpression_expand_user_macros+0x21) [0x46dda1]
  6988:20160421:160051.921 10: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](substitute_simple_macros+0xde0) [0x456a40]
  6988:20160421:160051.921 9: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](substitute_simple_macros+0x385e) [0x4594be]
  6988:20160421:160051.921 8: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x4405d0]
  6988:20160421:160051.921 7: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x440def]
  6988:20160421:160051.921 6: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](escalator_thread+0x157) [0x4410f7]
  6988:20160421:160051.921 5: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](zbx_thread_start+0x45) [0x47ed65]
  6988:20160421:160051.921 4: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](MAIN_ZABBIX_ENTRY+0x53c) [0x41d04c]
  6988:20160421:160051.921 3: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](daemon_start+0x1c3) [0x47d903]
  6988:20160421:160051.921 2: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations](main+0x2f8) [0x417d78]
  6988:20160421:160051.921 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f783d918b45]
  6988:20160421:160051.921 0: /usr/sbin/zabbix_server: escalator #4 [processed 0 escalations in 0.001460 sec, processing escalations]() [0x418043]

Could you attach the result of the following command?

objdump -DSswx zabbix_server
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 21 ]

It's difficult to say for sure without a binary, but most likely call stack looks like so: main escalator loop -> process_escalation() -> execute_escalations() -> process_recovery_msg() -> substitute_simple_macros() -> substitute_simple_macros(..., MACRO_TYPE_TRIGGER_DESCRIPTION, ...) -> DCexpression_expand_user_macro() -> dc_express_expand_user_macros() -> dc_expand_user_macros() -> strchr().

In this scenario event (with SOURCE_TRIGGERS) has somehow event.trigger.expression == NULL. Actually, get_event_info() guarantees only that event trigger fields are initialized if event object is EVENT_OBJECT_TRIGGER. If event source is SOURCE_TRIGGERS but event object is not EVENT_OBJECT_TRIGGER we may eventually pass NULL to strchr().

To check this please try to

select * from events where source=0 and object<>0;
Comment by Ilya Kruchinin [ 2016 Apr 26 ]

The query you mentioned, here's the output (or lack of one):
MariaDB [zabbix]> select * from events where source=0 and object<>0;
Empty set (0.00 sec)

Comment by Ilya Kruchinin [ 2016 Apr 26 ]

The "zabbix_server" I'm using is the binary from the official repo

Comment by Ilya Kruchinin [ 2016 Apr 26 ]

Added "objdump -DSswx /usr/sbin/zabbix_server > /root/objdump.dump" output (file attached as a tar.gz archive)
The zabbix_server binary is from the official repo, running on Debian 8 with Maria DB as the DB backend.

Please let me know if you need any additional information.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 28 ]

My initial guess was a bit off, but now I think I've localised the place where crash happens. Thank you for the provided information!

Can you check what these commands give you?

describe globalmacro;
describe hostmacro;
Comment by Ilya Kruchinin [ 2016 Apr 29 ]
MariaDB [zabbix]> describe globalmacro;
+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| globalmacroid | bigint(20) unsigned | NO   | PRI | NULL    |       |
| macro         | varchar(255)        | NO   | UNI |         |       |
| value         | varchar(255)        | NO   |     |         |       |
+---------------+---------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

MariaDB [zabbix]> describe hostmacro;
+-------------+---------------------+------+-----+---------+-------+
| Field       | Type                | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+-------+
| hostmacroid | bigint(20) unsigned | NO   | PRI | NULL    |       |
| hostid      | bigint(20) unsigned | NO   | MUL | NULL    |       |
| macro       | varchar(255)        | NO   |     |         |       |
| value       | varchar(255)        | NO   |     |         |       |
+-------------+---------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
Comment by Glebs Ivanovskis (Inactive) [ 2016 Apr 29 ]

Thank you for response! So far everything looks fine.

Can you see any suspicious entries in these tables? Like with empty macro field? Can you recall adding/deleting user/host macros and/or their context before the crash happened?

Comment by Ilya Kruchinin [ 2016 May 02 ]

Yes, I was changing the templates/hosts USERMACROs at about the same time when the crash happened.

There are currently no MACROs with an empty MACRO field (but the server hasn't crashed since then, all looks good):

MariaDB [zabbix]> select * from hostmacro where macro ='';
Empty set (0.00 sec)

There are 3 MACROS with empty VALUE fields (which, I guess, may not be a problem, just an empty value, and this is intended (undefined value in the template) from my part):

MariaDB [zabbix]> select * from hostmacro where value ='';
+-------------+--------+-----------------+-------+
| hostmacroid | hostid | macro           | value |
+-------------+--------+-----------------+-------+
|         187 |  10196 | {$IPMIIP}       |       |
|         472 |  10355 | {$PLESKDNSNAME} |       |
|         474 |  10357 | {$PLESKDNSNAME} |       |
+-------------+--------+-----------------+-------+

There hasn't been an issue since the last crash - all working fine.

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 02 ]

Empty value shouldn't be an issue. Although I don't have a clear idea where and how Zabbix crashed in your case, there are couple of places where Zabbix can be reinforced. Thank you for cooperation and let me know if the crash happens again and/or new details become available.

Comment by Oleksii Zagorskyi [ 2016 May 04 ]

Got the same constant crash on escalator on 3.0.2, using postgres.

 20822:20160504:213801.265 temporarily disabling Zabbix agent checks on host "AWS-L": host unavailable
 20822:20160504:213801.359 temporarily disabling Zabbix agent checks on host "AWS-T": host unavailable
 20822:20160504:213801.600 temporarily disabling Zabbix agent checks on host "Test-host": host unavailable
 20853:20160504:213803.309 escalation cancelled: trigger id:14839 deleted.
 20853:20160504:213803.325 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...
....
 20853:20160504:213803.325 === Backtrace: ===
 20853:20160504:213803.348 15: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](print_fatal_info+0x9e) [0x46b58e]
 20853:20160504:213803.348 14: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x46b847]
 20853:20160504:213803.348 13: /lib64/libc.so.6(+0x35640) [0x7f1c27af6640]
 20853:20160504:213803.348 12: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x450fa4]
 20853:20160504:213803.348 11: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](DCexpression_expand_user_macros+0x21) [0x45b751]
 20853:20160504:213803.348 10: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](substitute_simple_macros+0xd50) [0x4444e0]
 20853:20160504:213803.348 9: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](substitute_simple_macros+0x42a3) [0x447a33]
 20853:20160504:213803.348 8: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x433432]
 20853:20160504:213803.348 7: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x4341cf]
 20853:20160504:213803.348 6: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](escalator_thread+0x143) [0x434463]
 20853:20160504:213803.348 5: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](zbx_thread_start+0x45) [0x46c165]
 20853:20160504:213803.348 4: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](MAIN_ZABBIX_ENTRY+0x52b) [0x41919b]
 20853:20160504:213803.348 3: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](daemon_start+0x1ad) [0x46aecd]
 20853:20160504:213803.348 2: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations](main+0x336) [0x414246]
 20853:20160504:213803.348 1: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1c27ae2af5]
 20853:20160504:213803.348 0: /zabbix/sbin/zabbix_server: escalator #1 [processed 0 escalations in 0.705582 sec, processing escalations]() [0x4144b1]

I have those 3 tables dump.

To be able run zabbix server again I've cleaned out escalations table.
I feel that it could be related to context macro, which I'm starting to use.

Comment by Oleksii Zagorskyi [ 2016 May 05 ]

Wow, I can reproduce it! CONFIRMED
More details coming tomorrow morning.

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 05 ]

Can't wait!

Comment by Oleksii Zagorskyi [ 2016 May 06 ]

Well, while I've found a few issues and first of all I've faced with complex conditions how to reproduce the crash, now I've simplified the scenario:

  • have an action for internal events for triggers with a macro {TRIGGER.NAME} (will be enough), for example you can just enable existing action "Report unknown triggers".
  • have a trigger with expression {host:test.item.last(0)}>{$FSPFREE} and the trigger is in OK/PROBLEM state, i.e. it's not unknown (state =! 1)
  • make sure that the user macro from expression does NOT exist in your configuration
  • send a valid value to the item

When trigger will be recalculated - it will become unknown (state=1) with error message "Cannot evaluate expression: macro '{$FSPFREE}' is not found."
In the same time internal event and escalation will be generated.
After a few seconds escalator will process the record and will crash:

  8122:20160506:093338.346 In process_escalations()
  8122:20160506:093338.346 query [txnlev:0] [select escalationid,actionid,triggerid,eventid,r_eventid,nextcheck,esc_step,status,itemid from escalations where triggerid is not null order by actionid,triggerid,itemid,escalationid]
  8122:20160506:093338.346 query [txnlev:1] [begin;]
  8122:20160506:093338.346 In check_escalation() escalationid:1 status:active
  8122:20160506:093338.346 query [txnlev:1] [select source,object from events where eventid=67]
  8122:20160506:093338.346 query [txnlev:1] [select actionid,name,status,eventsource,esc_period,def_shortdata,def_longdata,r_shortdata,r_longdata,recovery_msg from actions where actionid=8]
  8122:20160506:093338.347 End of check_escalation():SUCCEED error: '(null)'
  8122:20160506:093338.347 In execute_escalation() escalationid:1 status:active
  8122:20160506:093338.347 query [txnlev:1] [select eventid,source,object,objectid,clock,value,acknowledged,ns from events where eventid=67]
  8122:20160506:093338.347 query [txnlev:1] [select description,expression,priority,comments,url from triggers where triggerid=13561]
  8122:20160506:093338.347 In execute_operations()
  8122:20160506:093338.347 query [txnlev:1] [select o.operationid,o.operationtype,o.esc_period,o.evaltype,m.operationid,m.default_msg,subject,message,mediatypeid from operations o left join opmessage m on m.operationid=o.operationid where o.actionid=8 ando.operationtype in (0,1) and o.esc_step_from<=1 and (o.esc_step_to=0 or o.esc_step_to>=1)]
  8122:20160506:093338.347 In check_operation_conditions() operationid:8
  8122:20160506:093338.347 query [txnlev:1] [select conditiontype,operator,value from opconditions where operationid=8 order by conditiontype]
  8122:20160506:093338.347 End of check_operation_conditions():SUCCEED
  8122:20160506:093338.347 Conditions match our event. Execute operation.
  8122:20160506:093338.347 In add_object_msg()
  8122:20160506:093338.347 query [txnlev:1] [select userid from opmessage_usr where operationid=8 union select g.userid from opmessage_grp m,users_groups g where m.usrgrpid=g.usrgrpid and m.operationid=8]
  8122:20160506:093338.348 query [txnlev:1] [select count(*) from usrgrp g,users_groups ug where ug.userid=1 and g.usrgrpid=ug.usrgrpid and g.users_status=1]
  8122:20160506:093338.348 In get_trigger_permission()
  8122:20160506:093338.348 query [txnlev:1] [select distinct i.hostid from items i,functions f where i.itemid=f.itemid and f.triggerid=13561]
  8122:20160506:093338.348 In get_host_permission()
  8122:20160506:093338.348 query [txnlev:1] [select type from users where userid=1]
  8122:20160506:093338.349 End of get_host_permission():rw
  8122:20160506:093338.349 End of get_trigger_permission():rw
  8122:20160506:093338.349 In substitute_simple_macros() data:'{TRIGGER.NAME}'
  8122:20160506:093338.349 In substitute_simple_macros() data:'Some myname more than 500'
  8122:20160506:093338.349 In get_N_functionid() expression:'{13250}>{$FSPFREE}' N_functionid:1
  8122:20160506:093338.349 get_N_functionid() functionid:13250
  8122:20160506:093338.349 End of get_N_functionid():SUCCEED
  8122:20160506:093338.349 In get_N_functionid() expression:'>{$FSPFREE}' N_functionid:1
  8122:20160506:093338.349 End of get_N_functionid():FAIL
  8122:20160506:093338.349 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...

  8122:20160506:093338.349 === Backtrace: ===
  8122:20160506:093338.350 15: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](print_fatal_info+0x9e) [0x47898e]
  8122:20160506:093338.350 14: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x478c32]
  8122:20160506:093338.350 13: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7fe9ac6980e0]
  8122:20160506:093338.350 12: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x45edf4]
  8122:20160506:093338.350 11: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](DCexpression_expand_user_macros+0x21) [0x468bf1]
  8122:20160506:093338.350 10: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](substitute_simple_macros+0xd68) [0x452ae8]
  8122:20160506:093338.350 9: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](substitute_simple_macros+0x3f23) [0x455ca3]
  8122:20160506:093338.350 8: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x43d3f0]
  8122:20160506:093338.350 7: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x43dbdf]
  8122:20160506:093338.350 6: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](escalator_thread+0x13f) [0x43decf]
  8122:20160506:093338.350 5: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](zbx_thread_start+0x45) [0x479515]
  8122:20160506:093338.350 4: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](MAIN_ZABBIX_ENTRY+0x598) [0x41b198]
  8122:20160506:093338.350 3: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](daemon_start+0x1b0) [0x4782d0]
  8122:20160506:093338.350 2: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations](main+0x388) [0x4160d8]
  8122:20160506:093338.350 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fe9ac684b45]
  8122:20160506:093338.350 0: ./zabbix_server3.0: escalator #1 [processed 0 escalations in 0.000768 sec, processing escalations]() [0x416365]

  8094:20160506:093340.357 Zabbix Server stopped. Zabbix 3.0.3rc1 (revision 59873).
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 06 ]

Oh, finally! After so many hours...

Long story short. To expand {TRIGGER.NAME} macro properly we need to take trigger name and resolve all macros there. Trigger name may contain positional macro ($1, $2, ...) and therefore we need to parse trigger expression in search for constants. To do this we must first resolve user macros in trigger expression. If there is an unknown macro dc_expand_user_macros() wants to provide an error message with

*error = zbx_dsprintf(*error, "macro '%.*s' is not found", len, ptr);

But substitute_simple_macros() is not bothered and simply passes NULL as error:

DCexpression_expand_user_macros(event->trigger.expression, NULL)

Caused by ZBXNEXT-3210.

Huge thanks to zalex_ua for so much effort put into investigation!

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 06 ]

Fix for version 3.0 is available in development branch svn://svn.zabbix.com/branches/dev/ZBX-10692 revision 59917. Patch attached for those who need it urgently.

Comment by Oleksii Zagorskyi [ 2016 May 06 ]

More complex scenario how it's possible to trigger such a crash (only once for each trigger prototype) described in ZBX-10759.
Initially current crash was discovered by me with that conditions.

Comment by Alexander Vladishev [ 2016 May 06 ]

Successfully tested! Have a look at my changes in r59921.

glebs.ivanovskis Changes look fine. Accepted!

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 06 ]

Fixed in pre-3.0.3rc1 r59924, pre-3.1.0 (trunk) r59926.

Generated at Fri Mar 29 09:52:01 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.