Details

      Description

      sometimes zabbix agent will leave one process hanging upon exit. this has so far always happened after sending "killall -15 zabbix_agentd".

      strace shows :

      Process 10468 attached - interrupt to quit
      futex(0xb783ca50, FUTEX_WAIT_PRIVATE, 2, NULL

      what's confusing, this pid does not appear in the agentd log at all.

      no idea whether this is in any way related to shared memory/semaphores, but the only thing owned by zabbix at that time is :

      ------ Shared Memory Segments --------
      key shmid owner perms bytes nattch status
      0x00000000 3801088 zabbix 600 491612 1 dest

      1. hanging-on-futex.c
        0.5 kB
        Aleksandrs Saveljevs
      2. no-hanging-on-futex.c
        0.6 kB
        dimir
      3. ZBX-3788_investigation1.odt
        36 kB
        Andris Mednis

        Activity

        Hide
        richlv added a comment - - edited

        additional information : this is trunk from april 29th 10:00 (~ revision 19247)

        looks like the source of the problem might be a child process that the agent has :

        [sh] <defunct>

        of course, that one can not be straced

        looks like agent is not successfully killing child processes that time out or become dysfunctional

        Show
        richlv added a comment - - edited additional information : this is trunk from april 29th 10:00 (~ revision 19247) looks like the source of the problem might be a child process that the agent has : [sh] <defunct> of course, that one can not be straced looks like agent is not successfully killing child processes that time out or become dysfunctional
        Hide
        dimir added a comment -

        Can not reproduce. Maybe it's already fixed by ZBX-3072 .

        Show
        dimir added a comment - Can not reproduce. Maybe it's already fixed by ZBX-3072 .
        Hide
        richlv added a comment -

        just happened again with a debug build from ZBX-3994

        this time, pid (30076) had some entries in the logfile :

        10496:20110822:194101.619 End of zbx_popen():6 child:30076
        30076:20110822:194101.620 zbx_popen(): executing script, parent:10496 [echo "show global status like 'Com_insert';" | HOME=/home/zabbix mysql -N | awk '

        {print $2}

        ']

        Show
        richlv added a comment - just happened again with a debug build from ZBX-3994 this time, pid (30076) had some entries in the logfile : 10496:20110822:194101.619 End of zbx_popen():6 child:30076 30076:20110822:194101.620 zbx_popen(): executing script, parent:10496 [echo "show global status like 'Com_insert';" | HOME=/home/zabbix mysql -N | awk ' {print $2} ']
        Hide
        Aleksandrs Saveljevs added a comment -

        We have reproduced the problem with the small program in hanging-on-futex.c.

        The idea is that some functions are not safe to use in signal handler. In our particular case, the use of function localtime() causes the process to hang on futex() call.

        More details are available at http://www.fedoraforum.org/forum/showthread.php?t=187375 . The thread also links to the list of functions that are safe to use in signal handlers at http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html .

        Show
        Aleksandrs Saveljevs added a comment - We have reproduced the problem with the small program in hanging-on-futex.c. The idea is that some functions are not safe to use in signal handler. In our particular case, the use of function localtime() causes the process to hang on futex() call. More details are available at http://www.fedoraforum.org/forum/showthread.php?t=187375 . The thread also links to the list of functions that are safe to use in signal handlers at http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html .
        Hide
        Aleksandrs Saveljevs added a comment -

        Note that the problem affects not only agent, but server and proxy, too. The problem can also manifest itself while the process is operating normally, not necessarily during shutdown.

        Show
        Aleksandrs Saveljevs added a comment - Note that the problem affects not only agent, but server and proxy, too. The problem can also manifest itself while the process is operating normally, not necessarily during shutdown.
        Hide
        dimir added a comment - - edited

        Same example using reentrant version of localtime(), no hanging. See attachment no-hanging-on-futex.c .

        Show
        dimir added a comment - - edited Same example using reentrant version of localtime(), no hanging. See attachment no-hanging-on-futex.c .
        Hide
        dimir added a comment - - edited

        One of the biggest problems seems to me localtime() in our logging function ( __zbx_zabbix_log() ). There is lots of logging in our signal handlers. These should be replaced with reentrant functions for sure.

        Show
        dimir added a comment - - edited One of the biggest problems seems to me localtime() in our logging function ( __zbx_zabbix_log() ). There is lots of logging in our signal handlers. These should be replaced with reentrant functions for sure.
        Hide
        dimir added a comment -

        Reproduced again with unknown zabbix_server process:

        $ ps u -p 3835
        USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
        vl 3835 0.0 0.0 52960 1048 ? SN Jan25 0:10 sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf

        $ grep 3835 /tmp/zabbix_server.log
        $

        $ strace -p 3835
        Process 3835 attached - interrupt to quit
        futex(0xb72a6a6c, FUTEX_WAIT_PRIVATE, 2, NULL

        $ tail /tmp/zabbix_server.log
        3826:20120131:152550.161 In zbx_mem_destroy() descr:'configuration cache'
        3826:20120131:152550.161 End of zbx_mem_destroy()
        3826:20120131:152550.161 In zbx_strpool_destroy()
        3826:20120131:152550.161 In zbx_mem_destroy() descr:'string pool'
        3826:20120131:152550.161 End of zbx_mem_destroy()
        3826:20120131:152550.161 End of zbx_strpool_destroy()
        3826:20120131:152550.161 End of free_configuration_cache()
        3826:20120131:152550.162 In free_selfmon_collector() collector:0xb4ba1000
        3826:20120131:152550.162 End of free_selfmon_collector()
        3826:20120131:152550.162 Zabbix Server stopped. Zabbix 1.9.9 (revision

        {ZABBIX_REVISION}

        ).

        Show
        dimir added a comment - Reproduced again with unknown zabbix_server process: $ ps u -p 3835 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND vl 3835 0.0 0.0 52960 1048 ? SN Jan25 0:10 sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf $ grep 3835 /tmp/zabbix_server.log $ $ strace -p 3835 Process 3835 attached - interrupt to quit futex(0xb72a6a6c, FUTEX_WAIT_PRIVATE, 2, NULL $ tail /tmp/zabbix_server.log 3826:20120131:152550.161 In zbx_mem_destroy() descr:'configuration cache' 3826:20120131:152550.161 End of zbx_mem_destroy() 3826:20120131:152550.161 In zbx_strpool_destroy() 3826:20120131:152550.161 In zbx_mem_destroy() descr:'string pool' 3826:20120131:152550.161 End of zbx_mem_destroy() 3826:20120131:152550.161 End of zbx_strpool_destroy() 3826:20120131:152550.161 End of free_configuration_cache() 3826:20120131:152550.162 In free_selfmon_collector() collector:0xb4ba1000 3826:20120131:152550.162 End of free_selfmon_collector() 3826:20120131:152550.162 Zabbix Server stopped. Zabbix 1.9.9 (revision {ZABBIX_REVISION} ).
        Hide
        Andris Mednis added a comment -

        Problem reproduced and investigated.with zabbix_agentd from Zabbix 1.8.11rc1 on Debian GNU/Linux 6.0 (64-bit). More detailed document with proposed solution is in progress.

        Show
        Andris Mednis added a comment - Problem reproduced and investigated.with zabbix_agentd from Zabbix 1.8.11rc1 on Debian GNU/Linux 6.0 (64-bit). More detailed document with proposed solution is in progress.
        Hide
        Andris Mednis added a comment -

        Signal handlers changed to reduce probability of process hang on exit in development branch svn://svn.zabbix.com/branches/dev/ZBX-3788

        Show
        Andris Mednis added a comment - Signal handlers changed to reduce probability of process hang on exit in development branch svn://svn.zabbix.com/branches/dev/ZBX-3788
        Hide
        Andris Mednis added a comment - - edited

        How the bug was investigated and proposed solutions - see attached document ZBX-3788_investigation1.odt

        Show
        Andris Mednis added a comment - - edited How the bug was investigated and proposed solutions - see attached document ZBX-3788 _investigation1.odt
        Hide
        Alexander Vladishev added a comment -

        Great! Successfully tested!

        Please review my changes in r25427.

        Show
        Alexander Vladishev added a comment - Great! Successfully tested! Please review my changes in r25427.
        Hide
        Andris Mednis added a comment -

        Fixed in version pre-1.8.11 (revision 25433) and pre-1.9.10 (revision 25434).

        Show
        Andris Mednis added a comment - Fixed in version pre-1.8.11 (revision 25433) and pre-1.9.10 (revision 25434).
        Hide
        richlv added a comment -

        i believe correct resolution should be "fixed", not "incomplete"

        Show
        richlv added a comment - i believe correct resolution should be "fixed", not "incomplete"
        Hide
        Oleksiy Zagorskyi added a comment -

        Andris, great analysis, it was interesting to read.

        Show
        Oleksiy Zagorskyi added a comment - Andris, great analysis, it was interesting to read.
        Hide
        Andris Mednis added a comment -

        Thanks for kind words! It was great to learn about "ltrace", which reveals much more than "strace". As of "Fixed" vs."Incomplete" - I believe the change, although significantly reduces probability of process hangup (especially for DebugLevel=4)), does not completely eliminate it. We saw how evil localtime() can be in a signal handler, but even printf() is not acceptable there. Properly designed sIgnal handlers do something simple and lightweight, leaving the real work to others.

        Show
        Andris Mednis added a comment - Thanks for kind words! It was great to learn about "ltrace", which reveals much more than "strace". As of "Fixed" vs."Incomplete" - I believe the change, although significantly reduces probability of process hangup (especially for DebugLevel=4)), does not completely eliminate it. We saw how evil localtime() can be in a signal handler, but even printf() is not acceptable there. Properly designed sIgnal handlers do something simple and lightweight, leaving the real work to others.

          People

          • Assignee:
            Unassigned
            Reporter:
            richlv
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: