[ZBX-3788] zabbix daemon processes hang on futex Created: 2011 May 06  Updated: 2017 May 30  Resolved: 2012 Feb 17

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G), Proxy (P), Server (S)
Affects Version/s: None
Fix Version/s: 1.8.11, 2.0.0rc1

Type: Incident report Priority: Major
Reporter: richlv Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File ZBX-3788_investigation1.odt     File hanging-on-futex.c     File no-hanging-on-futex.c    

 Description   

sometimes zabbix agent will leave one process hanging upon exit. this has so far always happened after sending "killall -15 zabbix_agentd".

strace shows :

Process 10468 attached - interrupt to quit
futex(0xb783ca50, FUTEX_WAIT_PRIVATE, 2, NULL

what's confusing, this pid does not appear in the agentd log at all.

no idea whether this is in any way related to shared memory/semaphores, but the only thing owned by zabbix at that time is :

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 3801088 zabbix 600 491612 1 dest



 Comments   
Comment by richlv [ 2011 May 06 ]

additional information : this is trunk from april 29th 10:00 (~ revision 19247)

looks like the source of the problem might be a child process that the agent has :

[sh] <defunct>

of course, that one can not be straced

looks like agent is not successfully killing child processes that time out or become dysfunctional

Comment by dimir [ 2011 Jun 03 ]

Can not reproduce. Maybe it's already fixed by ZBX-3072 .

Comment by richlv [ 2011 Aug 22 ]

just happened again with a debug build from ZBX-3994

this time, pid (30076) had some entries in the logfile :

10496:20110822:194101.619 End of zbx_popen():6 child:30076
30076:20110822:194101.620 zbx_popen(): executing script, parent:10496 [echo "show global status like 'Com_insert';" | HOME=/home/zabbix mysql -N | awk '

{print $2}

']

Comment by Aleksandrs Saveljevs [ 2011 Nov 17 ]

We have reproduced the problem with the small program in hanging-on-futex.c.

The idea is that some functions are not safe to use in signal handler. In our particular case, the use of function localtime() causes the process to hang on futex() call.

More details are available at http://www.fedoraforum.org/forum/showthread.php?t=187375 . The thread also links to the list of functions that are safe to use in signal handlers at http://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html .

Comment by Aleksandrs Saveljevs [ 2011 Nov 17 ]

Note that the problem affects not only agent, but server and proxy, too. The problem can also manifest itself while the process is operating normally, not necessarily during shutdown.

Comment by dimir [ 2011 Nov 23 ]

Same example using reentrant version of localtime(), no hanging. See attachment no-hanging-on-futex.c .

Comment by dimir [ 2011 Nov 23 ]

One of the biggest problems seems to me localtime() in our logging function ( __zbx_zabbix_log() ). There is lots of logging in our signal handlers. These should be replaced with reentrant functions for sure.

Comment by dimir [ 2012 Jan 31 ]

Reproduced again with unknown zabbix_server process:

$ ps u -p 3835
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
vl 3835 0.0 0.0 52960 1048 ? SN Jan25 0:10 sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf

$ grep 3835 /tmp/zabbix_server.log
$

$ strace -p 3835
Process 3835 attached - interrupt to quit
futex(0xb72a6a6c, FUTEX_WAIT_PRIVATE, 2, NULL

$ tail /tmp/zabbix_server.log
3826:20120131:152550.161 In zbx_mem_destroy() descr:'configuration cache'
3826:20120131:152550.161 End of zbx_mem_destroy()
3826:20120131:152550.161 In zbx_strpool_destroy()
3826:20120131:152550.161 In zbx_mem_destroy() descr:'string pool'
3826:20120131:152550.161 End of zbx_mem_destroy()
3826:20120131:152550.161 End of zbx_strpool_destroy()
3826:20120131:152550.161 End of free_configuration_cache()
3826:20120131:152550.162 In free_selfmon_collector() collector:0xb4ba1000
3826:20120131:152550.162 End of free_selfmon_collector()
3826:20120131:152550.162 Zabbix Server stopped. Zabbix 1.9.9 (revision

{ZABBIX_REVISION}

).

Comment by Andris Mednis [ 2012 Feb 10 ]

Problem reproduced and investigated.with zabbix_agentd from Zabbix 1.8.11rc1 on Debian GNU/Linux 6.0 (64-bit). More detailed document with proposed solution is in progress.

Comment by Andris Mednis [ 2012 Feb 16 ]

Signal handlers changed to reduce probability of process hang on exit in development branch svn://svn.zabbix.com/branches/dev/ZBX-3788

Comment by Andris Mednis [ 2012 Feb 16 ]

How the bug was investigated and proposed solutions - see attached document ZBX-3788_investigation1.odt

Comment by Alexander Vladishev [ 2012 Feb 16 ]

Great! Successfully tested!

Please review my changes in r25427.

Comment by Andris Mednis [ 2012 Feb 17 ]

Fixed in version pre-1.8.11 (revision 25433) and pre-1.9.10 (revision 25434).

Comment by richlv [ 2012 Feb 17 ]

i believe correct resolution should be "fixed", not "incomplete"

Comment by Oleksii Zagorskyi [ 2012 Feb 18 ]

Andris, great analysis, it was interesting to read.

Comment by Andris Mednis [ 2012 Feb 20 ]

Thanks for kind words! It was great to learn about "ltrace", which reveals much more than "strace". As of "Fixed" vs."Incomplete" - I believe the change, although significantly reduces probability of process hangup (especially for DebugLevel=4)), does not completely eliminate it. We saw how evil localtime() can be in a signal handler, but even printf() is not acceptable there. Properly designed sIgnal handlers do something simple and lightweight, leaving the real work to others.

Generated at Sat Apr 20 16:00:11 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.