ZABBIX BUGS AND ISSUES

zabbix daemon processes hang on futex

Details

  • Zabbix ID:
    Reviewed 2.0

Description

sometimes zabbix agent will leave one process hanging upon exit. this has so far always happened after sending "killall -15 zabbix_agentd".

strace shows :

Process 10468 attached - interrupt to quit
futex(0xb783ca50, FUTEX_WAIT_PRIVATE, 2, NULL

what's confusing, this pid does not appear in the agentd log at all.

no idea whether this is in any way related to shared memory/semaphores, but the only thing owned by zabbix at that time is :

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 3801088 zabbix 600 491612 1 dest

Activity

Hide
richlv added a comment - - edited

additional information : this is trunk from april 29th 10:00 (~ revision 19247)

looks like the source of the problem might be a child process that the agent has :

[sh] <defunct>

of course, that one can not be straced

looks like agent is not successfully killing child processes that time out or become dysfunctional

Show
richlv added a comment - - edited additional information : this is trunk from april 29th 10:00 (~ revision 19247) looks like the source of the problem might be a child process that the agent has : [sh] <defunct> of course, that one can not be straced looks like agent is not successfully killing child processes that time out or become dysfunctional
Hide
dimir added a comment -

Can not reproduce. Maybe it's already fixed by ZBX-3072 .

Show
dimir added a comment - Can not reproduce. Maybe it's already fixed by ZBX-3072 .
Hide
richlv added a comment -

just happened again with a debug build from ZBX-3994

this time, pid (30076) had some entries in the logfile :

10496:20110822:194101.619 End of zbx_popen():6 child:30076
30076:20110822:194101.620 zbx_popen(): executing script, parent:10496 [echo "show global status like 'Com_insert';" | HOME=/home/zabbix mysql -N | awk '{print $2}']

Show
richlv added a comment - just happened again with a debug build from ZBX-3994 this time, pid (30076) had some entries in the logfile : 10496:20110822:194101.619 End of zbx_popen():6 child:30076 30076:20110822:194101.620 zbx_popen(): executing script, parent:10496 [echo "show global status like 'Com_insert';" | HOME=/home/zabbix mysql -N | awk '{print $2}']
Hide
Aleksandrs Saveljevs added a comment -

We have reproduced the problem with the small program in hanging-on-futex.c.

The idea is that some functions are not safe to use in signal handler. In our particular case, the use of function localtime() causes the process to hang on futex() call.

More details are available at http://www.fedoraforum.org/forum/showthread.php?t=187375 . The thread also links to the list of functions that are safe to use in signal handlers at http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html .

Show
Aleksandrs Saveljevs added a comment - We have reproduced the problem with the small program in hanging-on-futex.c. The idea is that some functions are not safe to use in signal handler. In our particular case, the use of function localtime() causes the process to hang on futex() call. More details are available at http://www.fedoraforum.org/forum/showthread.php?t=187375 . The thread also links to the list of functions that are safe to use in signal handlers at http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html .
Hide
Aleksandrs Saveljevs added a comment -

Note that the problem affects not only agent, but server and proxy, too. The problem can also manifest itself while the process is operating normally, not necessarily during shutdown.

Show
Aleksandrs Saveljevs added a comment - Note that the problem affects not only agent, but server and proxy, too. The problem can also manifest itself while the process is operating normally, not necessarily during shutdown.
Hide
dimir added a comment - - edited

Same example using reentrant version of localtime(), no hanging. See attachment no-hanging-on-futex.c .

Show
dimir added a comment - - edited Same example using reentrant version of localtime(), no hanging. See attachment no-hanging-on-futex.c .
Hide
dimir added a comment - - edited

One of the biggest problems seems to me localtime() in our logging function ( __zbx_zabbix_log() ). There is lots of logging in our signal handlers. These should be replaced with reentrant functions for sure.

Show
dimir added a comment - - edited One of the biggest problems seems to me localtime() in our logging function ( __zbx_zabbix_log() ). There is lots of logging in our signal handlers. These should be replaced with reentrant functions for sure.
Hide
dimir added a comment -

Reproduced again with unknown zabbix_server process:

$ ps u -p 3835
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
vl 3835 0.0 0.0 52960 1048 ? SN Jan25 0:10 sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf

$ grep 3835 /tmp/zabbix_server.log
$

$ strace -p 3835
Process 3835 attached - interrupt to quit
futex(0xb72a6a6c, FUTEX_WAIT_PRIVATE, 2, NULL

$ tail /tmp/zabbix_server.log
3826:20120131:152550.161 In zbx_mem_destroy() descr:'configuration cache'
3826:20120131:152550.161 End of zbx_mem_destroy()
3826:20120131:152550.161 In zbx_strpool_destroy()
3826:20120131:152550.161 In zbx_mem_destroy() descr:'string pool'
3826:20120131:152550.161 End of zbx_mem_destroy()
3826:20120131:152550.161 End of zbx_strpool_destroy()
3826:20120131:152550.161 End of free_configuration_cache()
3826:20120131:152550.162 In free_selfmon_collector() collector:0xb4ba1000
3826:20120131:152550.162 End of free_selfmon_collector()
3826:20120131:152550.162 Zabbix Server stopped. Zabbix 1.9.9 (revision {ZABBIX_REVISION}).

Show
dimir added a comment - Reproduced again with unknown zabbix_server process: $ ps u -p 3835 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND vl 3835 0.0 0.0 52960 1048 ? SN Jan25 0:10 sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf $ grep 3835 /tmp/zabbix_server.log $ $ strace -p 3835 Process 3835 attached - interrupt to quit futex(0xb72a6a6c, FUTEX_WAIT_PRIVATE, 2, NULL $ tail /tmp/zabbix_server.log 3826:20120131:152550.161 In zbx_mem_destroy() descr:'configuration cache' 3826:20120131:152550.161 End of zbx_mem_destroy() 3826:20120131:152550.161 In zbx_strpool_destroy() 3826:20120131:152550.161 In zbx_mem_destroy() descr:'string pool' 3826:20120131:152550.161 End of zbx_mem_destroy() 3826:20120131:152550.161 End of zbx_strpool_destroy() 3826:20120131:152550.161 End of free_configuration_cache() 3826:20120131:152550.162 In free_selfmon_collector() collector:0xb4ba1000 3826:20120131:152550.162 End of free_selfmon_collector() 3826:20120131:152550.162 Zabbix Server stopped. Zabbix 1.9.9 (revision {ZABBIX_REVISION}).
Hide
Andris Mednis added a comment -

Problem reproduced and investigated.with zabbix_agentd from Zabbix 1.8.11rc1 on Debian GNU/Linux 6.0 (64-bit). More detailed document with proposed solution is in progress.

Show
Andris Mednis added a comment - Problem reproduced and investigated.with zabbix_agentd from Zabbix 1.8.11rc1 on Debian GNU/Linux 6.0 (64-bit). More detailed document with proposed solution is in progress.
Hide
Andris Mednis added a comment -

Signal handlers changed to reduce probability of process hang on exit in development branch svn://svn.zabbix.com/branches/dev/ZBX-3788

Show
Andris Mednis added a comment - Signal handlers changed to reduce probability of process hang on exit in development branch svn://svn.zabbix.com/branches/dev/ZBX-3788
Hide
Andris Mednis added a comment - - edited

How the bug was investigated and proposed solutions - see attached document ZBX-3788_investigation1.odt

Show
Andris Mednis added a comment - - edited How the bug was investigated and proposed solutions - see attached document ZBX-3788_investigation1.odt
Hide
Alexander Vladishev added a comment -

Great! Successfully tested!

Please review my changes in r25427.

Show
Alexander Vladishev added a comment - Great! Successfully tested! Please review my changes in r25427.
Hide
Andris Mednis added a comment -

Fixed in version pre-1.8.11 (revision 25433) and pre-1.9.10 (revision 25434).

Show
Andris Mednis added a comment - Fixed in version pre-1.8.11 (revision 25433) and pre-1.9.10 (revision 25434).
Hide
richlv added a comment -

i believe correct resolution should be "fixed", not "incomplete"

Show
richlv added a comment - i believe correct resolution should be "fixed", not "incomplete"
Hide
Oleksiy Zagorskyi added a comment -

Andris, great analysis, it was interesting to read.

Show
Oleksiy Zagorskyi added a comment - Andris, great analysis, it was interesting to read.
Hide
Andris Mednis added a comment -

Thanks for kind words! It was great to learn about "ltrace", which reveals much more than "strace". As of "Fixed" vs."Incomplete" - I believe the change, although significantly reduces probability of process hangup (especially for DebugLevel=4)), does not completely eliminate it. We saw how evil localtime() can be in a signal handler, but even printf() is not acceptable there. Properly designed sIgnal handlers do something simple and lightweight, leaving the real work to others.

Show
Andris Mednis added a comment - Thanks for kind words! It was great to learn about "ltrace", which reveals much more than "strace". As of "Fixed" vs."Incomplete" - I believe the change, although significantly reduces probability of process hangup (especially for DebugLevel=4)), does not completely eliminate it. We saw how evil localtime() can be in a signal handler, but even printf() is not acceptable there. Properly designed sIgnal handlers do something simple and lightweight, leaving the real work to others.

People

  • Assignee:
    Unassigned
    Reporter:
    richlv
Vote (0)
Watch (4)

Dates

  • Created:
    Updated:
    Resolved: