[ZBX-5289] 2.0.1 agent on Solaris 10 throws "Got signal [signal:10(SIGBUS),reason:1,refaddr:fec0e4e4]. Crashing ..." Created: 2012 Jul 08  Updated: 2020 Mar 21  Resolved: 2012 Nov 06

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.0.1
Fix Version/s: 2.0.4rc1, 2.1.0

Type: Incident report Priority: Major
Reporter: Bruce Misc Assignee: Unassigned
Resolution: Fixed Votes: 5
Labels: crash, solaris
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SunOS nodename 5.10 Generic_127111-06 sun4v sparc SUNW,Sun-Fire-T1000


Attachments: File zabbix-2.0.x-solaris10-SIGBUS-crash-ZBX-5289-structforcealign1.patch     File zabbix-2.0.x-solaris10-SIGBUS-crash-ZBX-5289-structforcealign2.patch     File zabbix-2.0.x-solaris10-SIGBUS-crash-ZBX-5289-structpad.patch     Text File zabbix_agentd_truss-f.log     Text File zabbix_agentd_truss_output.txt    
Issue Links:
Duplicate
is duplicated by ZBX-5382 Zabbix agent 2.0 crashes on HP-UX Ita... Closed

 Description   

This bug appears to be similar to ZBX-2634.

$ CC=gcc CFLAGS=-O2 ./configure --prefix="/tmp/zabbix/agent" --enable-agent --enable-ipv6
$ make
$ make install

23454:20120707:154526.434 Starting Zabbix Agent [Zabbix server]. Zabbix 2.0.1 (revision 28455).
23455:20120707:154526.439 agent #0 started [collector]
23456:20120707:154526.441 agent #1 started [listener]
23457:20120707:154526.442 agent #2 started [listener]
23458:20120707:154526.444 agent #3 started [listener]
23455:20120707:154526.458 Got signal [signal:10(SIGBUS),reason:1,refaddr:fec0e4e4]. Crashing ...
23455:20120707:154526.458 ====== Fatal information: ======
23455:20120707:154526.459 program counter not available for this architecture
23455:20120707:154526.459 === Registers: ===
23455:20120707:154526.459 register dump not available for this architecture
23455:20120707:154526.460 === Backtrace: ===
23455:20120707:154526.460 backtrace not available for this platform
23455:20120707:154526.460 === Memory map: ===
23455:20120707:154526.460 memory map not available for this platform
23455:20120707:154526.461 ================================
23454:20120707:154526.464 One child process died (PID:23455,exitcode/signal:-1). Exiting ...
23454:20120707:154528.471 Zabbix Agent stopped. Zabbix 2.0.1 (revision 28455).



 Comments   
Comment by Bruce Misc [ 2012 Jul 08 ]

I should have included debug level log data.

23106:20120708:080050.609 Starting Zabbix Agent [Zabbix server]. Zabbix 2.0.1 (revision 28455).
23106:20120708:080050.612 In init_collector_data()
23106:20120708:080050.613 End of init_collector_data()
23107:20120708:080050.615 agent #0 started [collector]
23107:20120708:080050.616 In init_cpu_collector()
23108:20120708:080050.617 agent #1 started [listener]
23109:20120708:080050.618 agent #2 started [listener]
23110:20120708:080050.620 agent #3 started [listener]
23107:20120708:080050.630 End of init_cpu_collector():SUCCEED
23107:20120708:080050.630 In update_cpustats()
23107:20120708:080050.635 Got signal [signal:10(SIGBUS),reason:1,refaddr:fec0e4e4]. Crashing ...
23107:20120708:080050.635 ====== Fatal information: ======
23107:20120708:080050.635 program counter not available for this architecture
23107:20120708:080050.636 === Registers: ===
23107:20120708:080050.636 register dump not available for this architecture
23107:20120708:080050.636 === Backtrace: ===
23107:20120708:080050.636 backtrace not available for this platform
23107:20120708:080050.637 === Memory map: ===
23107:20120708:080050.637 memory map not available for this platform
23107:20120708:080050.637 ================================
23106:20120708:080050.640 One child process died (PID:23107,exitcode/signal:-1). Exiting ...
23106:20120708:080050.641 zbx_on_exit() called
23108:20120708:080050.641 Got signal [signal:15(SIGTERM),sender_pid:23106,sender_uid:10098,reason:0]. Exiting ...
23109:20120708:080050.641 Got signal [signal:15(SIGTERM),sender_pid:23106,sender_uid:10098,reason:0]. Exiting ...
23110:20120708:080050.641 Got signal [signal:15(SIGTERM),sender_pid:23106,sender_uid:10098,reason:0]. Exiting ...
23106:20120708:080052.648 Zabbix Agent stopped. Zabbix 2.0.1 (revision 28455).

Comment by Romeo Theriault [ 2012 Jul 24 ]

I am also seeing the exact same issue on Solaris 9 with v.2.0.1. I've not tried on solaris 10 yet but I'm guessing from the above I'll see the same thing.

Comment by Romeo Theriault [ 2012 Aug 01 ]

This is the output of truss on the zabbix_agentd daemon (v2.0.1) when trying to start on solaris 9.

Comment by Tomasz Zielinski [ 2012 Sep 03 ]

The same on 2.0.2 pls do somehting

Comment by Alexei Vladishev [ 2012 Sep 08 ]

Please try to test the latest nightly build and report back.

Comment by Romeo Theriault [ 2012 Sep 08 ]

On Solaris 9 (sparc) I am still seeing the issu:

bash-2.05# uname -a
SunOS epf01 5.9 Generic_118558-13 sun4u sparc SUNW,Sun-Fire-V240
  2624:20120907:113054.538 Starting Zabbix Agent [epf01]. Zabbix 2.0.3rc1 (revision 30147).
  2625:20120907:113054.539 agent #0 started [collector]
  2626:20120907:113054.540 agent #1 started [listener]
  2627:20120907:113054.542 agent #2 started [listener]
  2625:20120907:113054.543 Got signal [signal:10(SIGBUS),reason:1,refaddr:feebe4e4]. Crashing ...
  2625:20120907:113054.543 ====== Fatal information: ======
  2628:20120907:113054.543 agent #3 started [listener]
  2625:20120907:113054.544 program counter not available for this architecture
  2625:20120907:113054.544 === Registers: ===
  2625:20120907:113054.544 register dump not available for this architecture
  2625:20120907:113054.544 === Backtrace: ===
  2625:20120907:113054.544 backtrace not available for this platform
  2625:20120907:113054.544 === Memory map: ===
  2625:20120907:113054.544 memory map not available for this platform
  2625:20120907:113054.544 ================================
  2629:20120907:113054.545 agent #4 started [active checks]
  2624:20120907:113054.545 One child process died (PID:2625,exitcode/signal:-1). Exiting ...
  2624:20120907:113056.541 Zabbix Agent stopped. Zabbix 2.0.3rc1 (revision 30147).

I can test on solaris 10 (sparc) if you want.

Thanks.

Comment by Alexei Vladishev [ 2012 Sep 08 ]

Please test on solaris 10. Thanks for your help.

Comment by Romeo Theriault [ 2012 Sep 10 ]

NP, glad I can help. The problem seems to be the same on Solaris 10 (sparc). See output below. I'll try to test this on solaris 10 (x64) later today and report back if this is just a sparc issue.

$ uname -a
SunOS t2k10 5.10 Generic_127111-03 sun4v sparc SUNW,Sun-Fire-T200
29562:20120910:101002.902 Starting Zabbix Agent [Zabbix server]. Zabbix 2.0.3rc1 (revision 30147). 29563:20120910:101002.906 agent #0 started [collector]
29564:20120910:101002.907 agent #1 started [listener] 
29565:20120910:101002.909 agent #2 started [listener] 
29566:20120910:101002.911 agent #3 started [listener] 
29567:20120910:101002.913 agent #4 started [active checks] 
29563:20120910:101002.927 Got signal [signal:10(SIGBUS),reason:1,refaddr:fec0e4e4]. Crashing ... 29563:20120910:101002.927 ====== Fatal information: ====== 
29563:20120910:101002.927 program counter not available for this architecture 
29563:20120910:101002.927 === Registers: === 
29563:20120910:101002.927 register dump not available for this architecture 
29563:20120910:101002.927 === Backtrace: === 
29563:20120910:101002.928 backtrace not available for this platform 
29563:20120910:101002.928 === Memory map: === 
29563:20120910:101002.928 memory map not available for this platform 
29563:20120910:101002.928 ================================ 
29562:20120910:101003.270 One child process died (PID:29563,exitcode/signal:-1). Exiting ... 29562:20120910:101005.275 Zabbix Agent stopped. Zabbix 2.0.3rc1 (revision 30147).
Comment by Romeo Theriault [ 2012 Sep 11 ]

I tested this version on solaris 10 x86 (64bit) and it works fine. Starts up and runs without problems. This is the first time I test on solaris x86 though so it may have worked fine with earlier versions as well. It seems this is an issue with sparc arch only (for solaris anyway).

Comment by Romeo Theriault [ 2012 Sep 11 ]

If there is anything else I can do to help move this ticket along please let me know. We'd love to be able to upgrade our zabbix agents on solaris to 2.x.

Thanks!

Comment by Romeo Theriault [ 2012 Sep 15 ]

Was playing around with this a bit more and found how to get it to run without segfaulting. By default on my solaris sparc boxes the default compiler flags (I'm using gcc 3.4.2) picked up are "-g -02". (debugging and optimizing the code). I found that if I over-ride these with:

export CFLAGS=""; ./configure --enable-agent

the resulting binary builds and runs fine. I've not yet narrowed it down to see if it's the debugging or the code optimization feature which is causing the segfault. I'll play with it more later today and report back.

Comment by Romeo Theriault [ 2012 Sep 15 ]

This appears to be related to the compiler optimizations. When I build with just the '-O2' compiler flag I still get the segfault. I tried building with '-O1' compiler flag, less optimizations, I still get the segfault. When I remove the compiler optimization flags the resulting binary seems to work fine.

Is building without the compiler optimizations a reasonable workaround at this point? How much is the lack of these optimizations likely to affect the speed of the agent?

Thanks

Comment by Romeo Theriault [ 2012 Sep 15 ]

I also just tested this with Sun's 'cc' compiler which used the following compiler flags:

CFLAGS="-xO3 -m32 -xarch=v8"

and the resulting binary works fine. So it looks like this is specific to something with gcc's optimizations. Not sure if there are any other options to pass to gcc that might get it to work but I think for my own purposes I'm going to go ahead and use sun's c compiler to build my agent binaries.

Comment by Jairo Eduardo Lopez Fuentes Nacarino [ 2012 Sep 24 ]

Hello all,

I've been working on this bug as I have parties interested on the Zabbix agent working on Solaris 10.

I have been able to replicate all the issues posted on the board, crashing agent using gcc optimization with all optimization levels, working agent compiling with gcc and the -g flag and the successful compilation of the Zabbix agent with Oracle/Sun's cc compiler with any optimization level, all exclusively on the SPARC architecture with the Zabbix agent source code included in version 2.0.3rc1.

I have been working on Solaris 10 10/08 s10s_u6wos_07b for SPARC, using gcc 3.4.3 (csl-sol210-3_4-branch+sol_rpath) on a Sun Fire V120 with a UltraSPARC-IIe 648MHz processor.

The error seems to be formed when the SPARC processor tries to use the std instruction, which is a double word store, when updating structs, specifically in the update_cpu_counter function of src/zabbix_agent/cpustat.c. The offending structure seems to be the ZBX_COLLECTOR_DATA struct defined in src/zabbix_agent/stats.h which is not memory aligned for the SPARC architecture.

When the agent is compiled without modifications the struct size ZBX_COLLECTOR_DATA is 12, which is what creates the SIGBUS when the std instruction is used.

We have been able to apparently fix the issue using two methods, both of which we do not consider particularly pretty. We can pad the ZBX_COLLECTOR_DATA struct to get to a size 16, be it by a char between the ZBX_CPUS_STAT_DATA struct and the diskstat_shmid int or any other size 4 variable of choice. We can also force the gcc compiler to align the ZBX_COLLECTOR_DATA struct to 8 bytes using __attribute__((aligned(8))). We found that forcing the alignment on the ZBX_SINGLE_CPU_STAT_DATA struct and ZBX_CPUS_STAT_DATA struct also forces the alignment of the ZBX_COLLECTOR_DATA struct.

The issue might be resolved if we provided a simple memory alignment check before getting the shared memory for the agent, specifically in the function zbx_shmget defined in src/libs/zbxnix/ipc.c.

Since changing any memory alignment has implications depending on the architecture used, I have no real idea as to which way would be best. I am submitting my current workaround patches to help find a much nicer solution.

I thank everyone for their time and hope to get feedback.

Comment by richlv [ 2012 Sep 26 ]

just a non-dev thinking out loud - shouldn't gcc avoid optimisations that result in crashes ?

Comment by Takanori Suzuki [ 2012 Sep 26 ]

> shouldn't gcc avoid optimisations that result in crashes ?
No.
It's definitely a memory alignment problem.
The avoiding crash by changing optimization is just a lucky.
Because it's a undefined specification behavior in C.
Changing optimization is not a solution.

In SPARC, C developers have to take care memory alignment problem.
Because unaligned memory access cause crash in SPARC.
Original structure ZBX_COLLECTOR_DATA is not taken care of memory alignment.

In SPARC, if there is SIGBUS crash, we should think about memory alignment problem.
x86 CPU doesn't crash, because the CPU specification allows unaligned memory access.

Comment by richlv [ 2012 Sep 26 ]

ah, cool, thanks for the info

Comment by Romeo Theriault [ 2012 Sep 26 ]

Out of interest Takanori, does it work with Sun's 'cc' compiler because cc automatically detects these memory alignment issues and pad them?

Comment by Takanori Suzuki [ 2012 Sep 27 ]

> Out of interest Takanori, does it work with Sun's 'cc' compiler because cc automatically detects these memory alignment issues and pad them?
It's also just a lucky.
Original structure ZBX_COLLECTOR_DATA has possibilities to become 12 byte in some compiler.
So, some compiler like Sun's 'cc' doesn't crash, and some other compiler like gcc crashes.
We have to add pad to the structure to eliminate the possibilities in all compiler to avoid the crash.

I think programs should not depend on particular compiler specification.

Comment by Jairo Eduardo Lopez Fuentes Nacarino [ 2012 Sep 27 ]

The interesting thing is that the cc compiler doesn't use the SPARC std instruction for the offending function. That is just how the compiler has been designed.

By default Sun's cc compiler assumes at most an 8 byte alignment and raises a SIGBUS signal if the program tries to access misaligned data.

You can force the cc compiler to interpret the access to misaligned data while assuming at most an 8 byte alignment using the -xmemalign=8i flag but that is forcing the compiler to use information provided by the user.

This is actually equivalent to using __attribute__((aligned(8))) when defining the structs, since the macros involved are specifically for the gcc compiler.

I agree with Takanori that the error not being produced by Sun's cc compiler is mostly luck and think it would be nice to have a solution that is not compiler specific.

Comment by Arli [ 2012 Oct 04 ]

I encountered the same thing when trying to start 2.0.3 agent on HP-UX B.11.23, B.11.23.0812.076, compiled with cc.

 1394:20121004:134153.446 Starting Zabbix Agent [myserver.mydomain]. Zabbix 2.0.3 (revision 30485).
  1395:20121004:134153.450 agent #0 started [collector]
  1395:20121004:134153.450 Got signal [signal:10(SIGBUS),reason:1,refaddr:c2ec000c]. Crashing ...
  1395:20121004:134153.450 ====== Fatal information: ======
  1395:20121004:134153.450 program counter not available for this architecture
  1395:20121004:134153.450 === Registers: ===
  1395:20121004:134153.450 register dump not available for this architecture
  1395:20121004:134153.450 === Backtrace: ===
  1395:20121004:134153.450 backtrace not available for this platform
  1395:20121004:134153.450 === Memory map: ===
  1395:20121004:134153.450 memory map not available for this platform
  1395:20121004:134153.450 ================================
  1394:20121004:134153.451 One child process died (PID:1395,exitcode/signal:-1). Exiting ...
  1394:20121004:134155.459 Zabbix Agent stopped. Zabbix 2.0.3 (revision 30485).
Comment by Oleksii Zagorskyi [ 2012 Oct 08 ]

ZBX-5382 looks like very related, linked to be good noticeable.

Comment by Jeff Shingara [ 2012 Oct 22 ]

Still encountering this issue on Solaris10 with 2.0.3 agent

$
19965:20121022:092855.664 Starting Zabbix Agent [xxxxxxxx]. Zabbix 2.0.3 (revision 30485).
19966:20121022:092855.666 agent #0 started [collector]
19968:20121022:092855.666 agent #2 started [listener]
19967:20121022:092855.666 agent #1 started [listener]
19969:20121022:092855.667 agent #3 started [listener]
19970:20121022:092855.668 agent #4 started [listener]
19972:20121022:092855.669 agent #6 started [active checks]
19971:20121022:092855.668 agent #5 started [listener]
19966:20121022:092855.673 Got signal [signal:10(SIGBUS),reason:1,refaddr:fed0e4e4]. Crashing ...
19966:20121022:092855.673 ====== Fatal information: ======
19966:20121022:092855.673 program counter not available for this architecture
19966:20121022:092855.673 === Registers: ===
19966:20121022:092855.673 register dump not available for this architecture
19966:20121022:092855.673 === Backtrace: ===
19966:20121022:092855.673 backtrace not available for this platform
19966:20121022:092855.673 === Memory map: ===
19966:20121022:092855.674 memory map not available for this platform
19966:20121022:092855.674 ================================
19965:20121022:092855.675 One child process died (PID:19966,exitcode/signal:-1). Exiting ...
19965:20121022:092857.675 Zabbix Agent stopped. Zabbix 2.0.3 (revision 30485).

$ uname -a
SunOS 5.10 Generic_147440-19 sun4u sparc SUNW,SPARC-Enterprise

Comment by Alexander Vladishev [ 2012 Oct 26 ]

Similar issue: ZBX-5741

Comment by Andris Mednis [ 2012 Nov 01 ]

Thanks for valuable comments and special thanks to Jairo and Takanori for explaining the root cause and proposing solution!
At http://bytes.com/topic/c/answers/587942-isnt-time-there-standard-align-statement the commenter <artifact one at googlemail com> shows that memory alignment directives differ between GCC, Sun C, Intel C, HP C and IBM XL compilers. It seems better to avoid compiler vendor-specific syntax in Zabbix codebase.
I'm working on a solution where required padding for 8-byte alignment is included into Zabbix agent data structures.

Comment by Andris Mednis [ 2012 Nov 06 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-5289

Comment by Alexander Vladishev [ 2012 Nov 07 ]

Great work! Successfully tested.

Comment by Andris Mednis [ 2012 Nov 07 ]

Fixed in versions pre-2.0.4 rev. 31309 and pre-2.1.0 rev. 31312.

Comment by Paul Surgeon [ 2012 Nov 29 ]

I can confirm that the fix also works for HP-UX 11.31 on Itanium2 using GCC.
I'm using the zabbix-2.0.4rc1 pre-release.

Comment by Andris Mednis [ 2012 Nov 29 ]

Thanks, Paul!
Yesterday zabbix-2.0.4rc1 was released.

Comment by Gene Liverman [ 2012 Dec 08 ]

Has anyone by chance already made some pre-compiled agents for Solaris 9 / 10 SPARC with the 2.0.4rc1 code?

Comment by Andris Mednis [ 2012 Dec 10 ]

Version 2.0.4 was released on Dec 8. Pre-compiled agents for Solaris are expected in few days at http://www.zabbix.com/download.php

Comment by Andris Mednis [ 2012 Dec 10 ]

Version 2.0.4 pre-compiled agents are at http://www.zabbix.com/download.php

Comment by Leon Guo [ 2020 Mar 21 ]

I also have the same issue on Oracle Sparc S7 with zabbix agent 4.4.6.

Generated at Wed Apr 24 09:50:57 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.