-
Incident report
-
Resolution: Fixed
-
Blocker
-
None
-
1.8.11
-
Cassandra DB backend
I recently started having the zabbix server crash upon certain reads from Cassandra. I've attached my zabbix_server binary file. It happens to several of our installations, but not all, and some more frequently than others, which leads me to believe it's likely a trigger somewhere that's causing it.
In a DebugLevel=3 I see many crashes with identical backtrace.
In most cases it happens very soon (just seconds) after server start, but once it was after 39 minutes uptime.
Here is filtered lines from the log:
29518:20120629:165702.530 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision
{ZABBIX_REVISION}).30115:20120629:165750.440 server #134 started history syncer #10
30115:20120629:165835.919 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
29518:20120629:165905.980 Zabbix Server stopped. Zabbix 1.8.11rc1 (revision {ZABBIX_REVISION}
).
uptime - 45 s
30916:20120629:170228.129 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision
{ZABBIX_REVISION}).31588:20120629:170315.535 server #138 started history syncer #14
31588:20120629:174254.942 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
30916:20120629:174303.972 Zabbix Server stopped. Zabbix 1.8.11rc1 (revision {ZABBIX_REVISION}
).
uptime - 39 minutes
7066:20120629:174306.371 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision
{ZABBIX_REVISION}).7421:20120629:174327.726 server #135 started history syncer #11
7421:20120629:174353.292 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
7066:20120629:174357.001 Zabbix Server stopped. Zabbix 1.8.11rc1 (revision {ZABBIX_REVISION}
).
uptime - 26 s
7577:20120629:174359.157 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision
{ZABBIX_REVISION}).7907:20120629:174504.105 server #132 started history syncer #8
7907:20120629:174510.752 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
7577:20120629:174523.594 Zabbix Server stopped. Zabbix 1.8.11rc1 (revision {ZABBIX_REVISION}
).
uptime - 6 s
8132:20120629:174526.033 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision
{ZABBIX_REVISION}).8481:20120629:174559.426 server #140 started history syncer #16
8481:20120629:174751.111 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
8132:20120629:174825.357 Zabbix Server stopped. Zabbix 1.8.11rc1 (revision {ZABBIX_REVISION}
).
uptime - 111 s
8976:20120629:174828.279 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision
{ZABBIX_REVISION}).9448:20120629:174931.466 server #141 started history syncer #17
8976:20120629:174931.472 server #0 started [main process]
9448:20120629:174950.833 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
8976:20120629:175028.396 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
Note in last example a crash ALSO in a [main process] during caches flush on stop:
8976:20120629:174950.841 One child process died (PID:9448,exitcode/signal:255). Exiting ...
8976:20120629:174952.885 syncing history data...
8976:20120629:175008.969 syncing history data... 6.710058%
8976:20120629:175020.751 syncing history data... 49.204527%
8976:20120629:175028.396 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
with the same backtrace.
uptime - 19 s
9588:20120629:175031.218 Starting Zabbix Server. Zabbix 1.8.11rc1 (revision {ZABBIX_REVISION}
).
9931:20120629:175110.377 server #140 started history syncer #16
9931:20120629:175131.008 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
9588:20120629:175208.038 Zabbix Server stopped. Zabbix 1.8.11rc1 (revision
).
uptime - 19 s
In a DebugLevel=4 I saw that history syncer has crashed when being getting history to calculating
a 'dd785c11n1:system.cpu.load[,avg1].avg(3600)' trigger.
...
24947:20120629:201005.765 query [txnlev:1]
...
update items set lastclock=1341014976,prevvalue=lastvalue,lastvalue='6.760000' where itemid=8759204879414;
...
24947:20120629:201010.167 In evaluate_function() function:'dd785c11n1:system.cpu.load[,avg1].avg(3600)'
24947:20120629:201010.167 In evaluate_AVG()
24947:20120629:201010.167 In get_function_parameter_uint() parameters:'3600' Nparam:1
24947:20120629:201010.167 In substitute_simple_macros() data:'3600'
24947:20120629:201010.167 get_function_parameter_uint() flag:0 value:3600
24947:20120629:201010.167 End of get_function_parameter_uint():SUCCEED
24947:20120629:201010.167 In DBget_history()
24947:20120629:201010.190 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xfffffff8]. Crashing ...
24947:20120629:201010.190 ====== Fatal information: ======
24947:20120629:201010.190 Program counter: 0x2ac8b83e373e
24947:20120629:201010.190 === Registers: ===
24947:20120629:201010.190 r8 = 4 = 4 = 4
24947:20120629:201010.190 r9 = 0 = 0 = 0
24947:20120629:201010.190 r10 = 1 = 1 = 1
24947:20120629:201010.190 r11 = 2ac8b83e3720 = 47041572910880 = 47041572910880
24947:20120629:201010.190 r12 = 0 = 0 = 0
24947:20120629:201010.190 r13 = 7fff05fe2260 = 140733293929056 = 140733293929056
24947:20120629:201010.190 r14 = 0 = 0 = 0
24947:20120629:201010.190 r15 = 0 = 0 = 0
24947:20120629:201010.190 rdi = 100000000 = 4294967296 = 4294967296
24947:20120629:201010.190 rsi = 5fcf0dc00000000 = 431484492004196352 = 431484492004196352
24947:20120629:201010.190 rbp = 7fff05fcf470 = 140733293851760 = 140733293851760
24947:20120629:201010.191 rbx = 19ee1100 = 435032320 = 435032320
24947:20120629:201010.191 rdx = 4f = 79 = 79
24947:20120629:201010.191 rax = 0 = 0 = 0
24947:20120629:201010.191 rcx = 4 = 4 = 4
24947:20120629:201010.191 rsp = 7fff05fcf360 = 140733293851488 = 140733293851488
24947:20120629:201010.191 rip = 2ac8b83e373e = 47041572910910 = 47041572910910
24947:20120629:201010.191 efl = 10206 = 66054 = 66054
24947:20120629:201010.191 csgsfs = 33 = 51 = 51
24947:20120629:201010.191 err = 4 = 4 = 4
24947:20120629:201010.191 trapno = e = 14 = 14
24947:20120629:201010.191 oldmask = 0 = 0 = 0
24947:20120629:201010.191 cr2 = fffffff8 = 4294967288 = 4294967288
24947:20120629:201010.191 === Backtrace: ===
24947:20120629:201010.192 21: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(print_fatal_info+0x100) [0x450767]
24947:20120629:201010.192 20: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x44f67a]
24947:20120629:201010.192 19: /lib64/libc.so.6 [0x2ac8b83a12d0]
24947:20120629:201010.192 18: /lib64/libc.so.6(cfree+0x1e) [0x2ac8b83e373e]
24947:20120629:201010.192 17: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(cassandra_client_recv_get_slice+0x178) [0x4763b8]
24947:20120629:201010.192 16: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(cassandra_client_get_slice+0x7b) [0x476b98]
24947:20120629:201010.192 15: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x47253e]
24947:20120629:201010.192 14: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(zbx_cassandra_fetch_history_values+0x4ac) [0x472b93]
24947:20120629:201010.192 13: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(DBget_history+0x11f) [0x49fb6a]
24947:20120629:201010.192 12: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x465091]
24947:20120629:201010.192 11: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(evaluate_function+0x14a) [0x4674b1]
24947:20120629:201010.192 10: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x46feea]
24947:20120629:201010.192 9: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x47068a]
24947:20120629:201010.192 8: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(evaluate_expressions+0x1d7) [0x4708a6]
24947:20120629:201010.192 7: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x43fc9b]
24947:20120629:201010.192 6: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(DCsync_history+0x8a2) [0x442df1]
24947:20120629:201010.192 5: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(main_dbsyncer_loop+0xcf) [0x41a593]
24947:20120629:201010.192 4: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(MAIN_ZABBIX_ENTRY+0xa8c) [0x419a1f]
24947:20120629:201010.192 3: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(daemon_start+0x3bc) [0x44feb7]
24947:20120629:201010.192 2: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server(main+0x1f1) [0x418f91]
24947:20120629:201010.192 1: /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ac8b838e994]
24947:20120629:201010.192 0: /ms/dist/ipm/PROJ/zabbix/prod/sbin/zabbix_server [0x414c19]
24947:20120629:201010.192 === Memory map: ===
...
In the same time I have two success calculations for identical trigger function, but for *other* hosts:
one example:
24940:20120629:200902.778 In evaluate_function() function:'dd778c9n1:system.cpu.load[,avg1].avg(3600)'
24940:20120629:200902.778 In evaluate_AVG()
24940:20120629:200902.778 In get_function_parameter_uint() parameters:'3600' Nparam:1
24940:20120629:200902.778 In substitute_simple_macros() data:'3600'
24940:20120629:200902.779 get_function_parameter_uint() flag:0 value:3600
24940:20120629:200902.779 End of get_function_parameter_uint():SUCCEED
24940:20120629:200902.779 In DBget_history()
24940:20120629:200902.784 End of DBget_history()
24940:20120629:200902.784 In DBfree_history()
24940:20120629:200902.784 End of DBfree_history()
24940:20120629:200902.784 End of evaluate_AVG():SUCCEED
24940:20120629:200902.784 End of evaluate_function():SUCCEED value:'21.98'
...
24940:20120629:200902.867 zbx_substitute_functions_results() expression[278]:'
>15' => '21.98>15'
...
24940:20120629:200902.913 In evaluate() expression:'21.98>15'
24940:20120629:200902.913 End of evaluate() value:1.000000
later:
The crashes started occurring at about the same time this user entered this trigger. I have modified the trigger and hopefully this stop the crashes
....
later:
Disabling this trigger does indeed stop the processes from crashing