[ZBX-3788] zabbix daemon processes hang on futex Created: 2011 May 06 Updated: 2017 May 30 Resolved: 2012 Feb 17 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G), Proxy (P), Server (S) |
Affects Version/s: | None |
Fix Version/s: | 1.8.11, 2.0.0rc1 |
Type: | Incident report | Priority: | Major |
Reporter: | richlv | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | crash | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | ZBX-3788_investigation1.odt hanging-on-futex.c no-hanging-on-futex.c |
Description |
sometimes zabbix agent will leave one process hanging upon exit. this has so far always happened after sending "killall -15 zabbix_agentd". strace shows : Process 10468 attached - interrupt to quit what's confusing, this pid does not appear in the agentd log at all. no idea whether this is in any way related to shared memory/semaphores, but the only thing owned by zabbix at that time is : ------ Shared Memory Segments -------- |
Comments |
Comment by richlv [ 2011 May 06 ] |
additional information : this is trunk from april 29th 10:00 (~ revision 19247) looks like the source of the problem might be a child process that the agent has : [sh] <defunct> of course, that one can not be straced looks like agent is not successfully killing child processes that time out or become dysfunctional |
Comment by dimir [ 2011 Jun 03 ] |
Can not reproduce. Maybe it's already fixed by |
Comment by richlv [ 2011 Aug 22 ] |
just happened again with a debug build from this time, pid (30076) had some entries in the logfile : 10496:20110822:194101.619 End of zbx_popen():6 child:30076 '] |
Comment by Aleksandrs Saveljevs [ 2011 Nov 17 ] |
We have reproduced the problem with the small program in hanging-on-futex.c. The idea is that some functions are not safe to use in signal handler. In our particular case, the use of function localtime() causes the process to hang on futex() call. More details are available at http://www.fedoraforum.org/forum/showthread.php?t=187375 . The thread also links to the list of functions that are safe to use in signal handlers at http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html . |
Comment by Aleksandrs Saveljevs [ 2011 Nov 17 ] |
Note that the problem affects not only agent, but server and proxy, too. The problem can also manifest itself while the process is operating normally, not necessarily during shutdown. |
Comment by dimir [ 2011 Nov 23 ] |
Same example using reentrant version of localtime(), no hanging. See attachment no-hanging-on-futex.c . |
Comment by dimir [ 2011 Nov 23 ] |
One of the biggest problems seems to me localtime() in our logging function ( __zbx_zabbix_log() ). There is lots of logging in our signal handlers. These should be replaced with reentrant functions for sure. |
Comment by dimir [ 2012 Jan 31 ] |
Reproduced again with unknown zabbix_server process: $ ps u -p 3835 $ grep 3835 /tmp/zabbix_server.log $ strace -p 3835 $ tail /tmp/zabbix_server.log ). |
Comment by Andris Mednis [ 2012 Feb 10 ] |
Problem reproduced and investigated.with zabbix_agentd from Zabbix 1.8.11rc1 on Debian GNU/Linux 6.0 (64-bit). More detailed document with proposed solution is in progress. |
Comment by Andris Mednis [ 2012 Feb 16 ] |
Signal handlers changed to reduce probability of process hang on exit in development branch svn://svn.zabbix.com/branches/dev/ZBX-3788 |
Comment by Andris Mednis [ 2012 Feb 16 ] |
How the bug was investigated and proposed solutions - see attached document |
Comment by Alexander Vladishev [ 2012 Feb 16 ] |
Great! Successfully tested! Please review my changes in r25427. |
Comment by Andris Mednis [ 2012 Feb 17 ] |
Fixed in version pre-1.8.11 (revision 25433) and pre-1.9.10 (revision 25434). |
Comment by richlv [ 2012 Feb 17 ] |
i believe correct resolution should be "fixed", not "incomplete" |
Comment by Oleksii Zagorskyi [ 2012 Feb 18 ] |
Andris, great analysis, it was interesting to read. |
Comment by Andris Mednis [ 2012 Feb 20 ] |
Thanks for kind words! It was great to learn about "ltrace", which reveals much more than "strace". As of "Fixed" vs."Incomplete" - I believe the change, although significantly reduces probability of process hangup (especially for DebugLevel=4)), does not completely eliminate it. We saw how evil localtime() can be in a signal handler, but even printf() is not acceptable there. Properly designed sIgnal handlers do something simple and lightweight, leaving the real work to others. |