This case observed because of Orabbix.
Orabbix's keys: "audit" (I'm 100% sure), "locks" (almost 100% sure) are potential killers.
As we know Orabbix send data in the old ZBX protocol where XML format used and all values (hostname, key, value) encoded in base64.
When value is very big (I always consider clear text values, i.e. not encoded to base64) - there two different behavior observed, both are critical.
1) When trappers convert big base64 encoded values it can returns wrong decoded value. In my test lab I can reproduce it when real DBforBix (Orabbix) sends two small values and one "24K" in the row.
You can see it in the attached file "185_ltrace_1901_24K_wrong_decoded_value.out". The value "DBforBIX Version 0.6" for the key "DBforBIX.MySQL.oratest" comes from previous decoded key "DBforBIX.Version"
In the same time, more big value "48K" decoded and returned correctly - see another attached "185_ltrace_1901_48K_sucessfull+correct_value.out"
And very short excerpt of this comparison you can see in attached "incorrect_encoded_value.txt".
In another production environment (where is constant flow of different values from real Orabbix instance) I was able to reproduce this case with even smaller values, as I recall 12-30K
2) trapper hangs (for x86_64) or dies (for i686) in function "str_base64_decode"
For i686 see attached "zabbix_server_trappers_crash_centos_i686.log"
On this 32bit CentOS 6.2 zabbix_server dies with 48K, and works ok with 36K.
To help find similar cases over the Jira here is small log's excerpt:
-
-
- stack smashing detected ***: /usr/sbin/zabbix_server terminated
======= Backtrace: =========
/lib/libc.so.6(__fortify_fail+0x4d)[0xb8e59d]
/lib/libc.so.6(+0xf754a)[0xb8e54a]
/usr/sbin/zabbix_server[0x80dd0f4]
/usr/sbin/zabbix_server(str_base64_decode+0x51a)[0x80acdba]
/usr/sbin/zabbix_server(comms_parse_response+0x25b)[0x80a6b5b]
/usr/sbin/zabbix_server[0x806e69e]
/usr/sbin/zabbix_server(main_trapper_loop+0x12f)[0x806f41f]
/usr/sbin/zabbix_server(MAIN_ZABBIX_ENTRY+0x816)[0x8059416]
/usr/sbin/zabbix_server(daemon_start+0x2af)[0x809dcbf]
/usr/sbin/zabbix_server(main+0x2d0)[0x8059c40]
/lib/libc.so.6(__libc_start_main+0xe6)[0xaadce6]
/usr/sbin/zabbix_server[0x8053de1]
...
14314:20120503:115247.714 Zabbix Server stopped. Zabbix 2.0.0rc4 (revision 27142).
- stack smashing detected ***: /usr/sbin/zabbix_server terminated
-
More detailed investigated "x86_64" platform on Debian 6.0.4, zabbix_server 2.0.0rc3 (latest revisions)
If sent values is 52K and more - trapper will get "futex", sent value 48K - server was able to decode it.
In production environment I saw 90-120K sent values.
See attached "200rc3_ltrace_681_futex.out" and just in case small Strace "200rc3_strace_342_futex.out"
DBforBix (Orabbix) sends two small values and one "52K" in the row.
It's not possible to see anything in the zabbix log because the trapper hang.
The same will happen with all available trappers during some period of time. Frontend will show that zabbix server is not running.
In the attached archive "script_to_reproduce" you can find a ready script and dummy files different size to easily reproduce this case, you even don'y need to create any hosts/items to reproduce the crash/futex.