[ZBX-5289] 2.0.1 agent on Solaris 10 throws "Got signal [signal:10(SIGBUS),reason:1,refaddr:fec0e4e4]. Crashing ..." Created: 2012 Jul 08 Updated: 2020 Mar 21 Resolved: 2012 Nov 06 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 2.0.1 |
Fix Version/s: | 2.0.4rc1, 2.1.0 |
Type: | Incident report | Priority: | Major |
Reporter: | Bruce Misc | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 5 |
Labels: | crash, solaris | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
SunOS nodename 5.10 Generic_127111-06 sun4v sparc SUNW,Sun-Fire-T1000 |
Attachments: |
![]() ![]() ![]() ![]() ![]() |
||||||||
Issue Links: |
|
Description |
This bug appears to be similar to $ CC=gcc CFLAGS=-O2 ./configure --prefix="/tmp/zabbix/agent" --enable-agent --enable-ipv6 23454:20120707:154526.434 Starting Zabbix Agent [Zabbix server]. Zabbix 2.0.1 (revision 28455). |
Comments |
Comment by Bruce Misc [ 2012 Jul 08 ] |
I should have included debug level log data. 23106:20120708:080050.609 Starting Zabbix Agent [Zabbix server]. Zabbix 2.0.1 (revision 28455). |
Comment by Romeo Theriault [ 2012 Jul 24 ] |
I am also seeing the exact same issue on Solaris 9 with v.2.0.1. I've not tried on solaris 10 yet but I'm guessing from the above I'll see the same thing. |
Comment by Romeo Theriault [ 2012 Aug 01 ] |
This is the output of truss on the zabbix_agentd daemon (v2.0.1) when trying to start on solaris 9. |
Comment by Tomasz Zielinski [ 2012 Sep 03 ] |
The same on 2.0.2 pls do somehting |
Comment by Alexei Vladishev [ 2012 Sep 08 ] |
Please try to test the latest nightly build and report back. |
Comment by Romeo Theriault [ 2012 Sep 08 ] |
On Solaris 9 (sparc) I am still seeing the issu: bash-2.05# uname -a SunOS epf01 5.9 Generic_118558-13 sun4u sparc SUNW,Sun-Fire-V240 2624:20120907:113054.538 Starting Zabbix Agent [epf01]. Zabbix 2.0.3rc1 (revision 30147). 2625:20120907:113054.539 agent #0 started [collector] 2626:20120907:113054.540 agent #1 started [listener] 2627:20120907:113054.542 agent #2 started [listener] 2625:20120907:113054.543 Got signal [signal:10(SIGBUS),reason:1,refaddr:feebe4e4]. Crashing ... 2625:20120907:113054.543 ====== Fatal information: ====== 2628:20120907:113054.543 agent #3 started [listener] 2625:20120907:113054.544 program counter not available for this architecture 2625:20120907:113054.544 === Registers: === 2625:20120907:113054.544 register dump not available for this architecture 2625:20120907:113054.544 === Backtrace: === 2625:20120907:113054.544 backtrace not available for this platform 2625:20120907:113054.544 === Memory map: === 2625:20120907:113054.544 memory map not available for this platform 2625:20120907:113054.544 ================================ 2629:20120907:113054.545 agent #4 started [active checks] 2624:20120907:113054.545 One child process died (PID:2625,exitcode/signal:-1). Exiting ... 2624:20120907:113056.541 Zabbix Agent stopped. Zabbix 2.0.3rc1 (revision 30147). I can test on solaris 10 (sparc) if you want. Thanks. |
Comment by Alexei Vladishev [ 2012 Sep 08 ] |
Please test on solaris 10. Thanks for your help. |
Comment by Romeo Theriault [ 2012 Sep 10 ] |
NP, glad I can help. The problem seems to be the same on Solaris 10 (sparc). See output below. I'll try to test this on solaris 10 (x64) later today and report back if this is just a sparc issue. $ uname -a SunOS t2k10 5.10 Generic_127111-03 sun4v sparc SUNW,Sun-Fire-T200 29562:20120910:101002.902 Starting Zabbix Agent [Zabbix server]. Zabbix 2.0.3rc1 (revision 30147). 29563:20120910:101002.906 agent #0 started [collector] 29564:20120910:101002.907 agent #1 started [listener] 29565:20120910:101002.909 agent #2 started [listener] 29566:20120910:101002.911 agent #3 started [listener] 29567:20120910:101002.913 agent #4 started [active checks] 29563:20120910:101002.927 Got signal [signal:10(SIGBUS),reason:1,refaddr:fec0e4e4]. Crashing ... 29563:20120910:101002.927 ====== Fatal information: ====== 29563:20120910:101002.927 program counter not available for this architecture 29563:20120910:101002.927 === Registers: === 29563:20120910:101002.927 register dump not available for this architecture 29563:20120910:101002.927 === Backtrace: === 29563:20120910:101002.928 backtrace not available for this platform 29563:20120910:101002.928 === Memory map: === 29563:20120910:101002.928 memory map not available for this platform 29563:20120910:101002.928 ================================ 29562:20120910:101003.270 One child process died (PID:29563,exitcode/signal:-1). Exiting ... 29562:20120910:101005.275 Zabbix Agent stopped. Zabbix 2.0.3rc1 (revision 30147). |
Comment by Romeo Theriault [ 2012 Sep 11 ] |
I tested this version on solaris 10 x86 (64bit) and it works fine. Starts up and runs without problems. This is the first time I test on solaris x86 though so it may have worked fine with earlier versions as well. It seems this is an issue with sparc arch only (for solaris anyway). |
Comment by Romeo Theriault [ 2012 Sep 11 ] |
If there is anything else I can do to help move this ticket along please let me know. We'd love to be able to upgrade our zabbix agents on solaris to 2.x. Thanks! |
Comment by Romeo Theriault [ 2012 Sep 15 ] |
Was playing around with this a bit more and found how to get it to run without segfaulting. By default on my solaris sparc boxes the default compiler flags (I'm using gcc 3.4.2) picked up are "-g -02". (debugging and optimizing the code). I found that if I over-ride these with: export CFLAGS=""; ./configure --enable-agent the resulting binary builds and runs fine. I've not yet narrowed it down to see if it's the debugging or the code optimization feature which is causing the segfault. I'll play with it more later today and report back. |
Comment by Romeo Theriault [ 2012 Sep 15 ] |
This appears to be related to the compiler optimizations. When I build with just the '-O2' compiler flag I still get the segfault. I tried building with '-O1' compiler flag, less optimizations, I still get the segfault. When I remove the compiler optimization flags the resulting binary seems to work fine. Is building without the compiler optimizations a reasonable workaround at this point? How much is the lack of these optimizations likely to affect the speed of the agent? Thanks |
Comment by Romeo Theriault [ 2012 Sep 15 ] |
I also just tested this with Sun's 'cc' compiler which used the following compiler flags:
CFLAGS="-xO3 -m32 -xarch=v8"
and the resulting binary works fine. So it looks like this is specific to something with gcc's optimizations. Not sure if there are any other options to pass to gcc that might get it to work but I think for my own purposes I'm going to go ahead and use sun's c compiler to build my agent binaries. |
Comment by Jairo Eduardo Lopez Fuentes Nacarino [ 2012 Sep 24 ] |
Hello all, I've been working on this bug as I have parties interested on the Zabbix agent working on Solaris 10. I have been able to replicate all the issues posted on the board, crashing agent using gcc optimization with all optimization levels, working agent compiling with gcc and the -g flag and the successful compilation of the Zabbix agent with Oracle/Sun's cc compiler with any optimization level, all exclusively on the SPARC architecture with the Zabbix agent source code included in version 2.0.3rc1. I have been working on Solaris 10 10/08 s10s_u6wos_07b for SPARC, using gcc 3.4.3 (csl-sol210-3_4-branch+sol_rpath) on a Sun Fire V120 with a UltraSPARC-IIe 648MHz processor. The error seems to be formed when the SPARC processor tries to use the std instruction, which is a double word store, when updating structs, specifically in the update_cpu_counter function of src/zabbix_agent/cpustat.c. The offending structure seems to be the ZBX_COLLECTOR_DATA struct defined in src/zabbix_agent/stats.h which is not memory aligned for the SPARC architecture. When the agent is compiled without modifications the struct size ZBX_COLLECTOR_DATA is 12, which is what creates the SIGBUS when the std instruction is used. We have been able to apparently fix the issue using two methods, both of which we do not consider particularly pretty. We can pad the ZBX_COLLECTOR_DATA struct to get to a size 16, be it by a char between the ZBX_CPUS_STAT_DATA struct and the diskstat_shmid int or any other size 4 variable of choice. We can also force the gcc compiler to align the ZBX_COLLECTOR_DATA struct to 8 bytes using __attribute__((aligned(8))). We found that forcing the alignment on the ZBX_SINGLE_CPU_STAT_DATA struct and ZBX_CPUS_STAT_DATA struct also forces the alignment of the ZBX_COLLECTOR_DATA struct. The issue might be resolved if we provided a simple memory alignment check before getting the shared memory for the agent, specifically in the function zbx_shmget defined in src/libs/zbxnix/ipc.c. Since changing any memory alignment has implications depending on the architecture used, I have no real idea as to which way would be best. I am submitting my current workaround patches to help find a much nicer solution. I thank everyone for their time and hope to get feedback. |
Comment by richlv [ 2012 Sep 26 ] |
just a non-dev thinking out loud - shouldn't gcc avoid optimisations that result in crashes ? |
Comment by Takanori Suzuki [ 2012 Sep 26 ] |
> shouldn't gcc avoid optimisations that result in crashes ? In SPARC, C developers have to take care memory alignment problem. In SPARC, if there is SIGBUS crash, we should think about memory alignment problem. |
Comment by richlv [ 2012 Sep 26 ] |
ah, cool, thanks for the info |
Comment by Romeo Theriault [ 2012 Sep 26 ] |
Out of interest Takanori, does it work with Sun's 'cc' compiler because cc automatically detects these memory alignment issues and pad them? |
Comment by Takanori Suzuki [ 2012 Sep 27 ] |
> Out of interest Takanori, does it work with Sun's 'cc' compiler because cc automatically detects these memory alignment issues and pad them? I think programs should not depend on particular compiler specification. |
Comment by Jairo Eduardo Lopez Fuentes Nacarino [ 2012 Sep 27 ] |
The interesting thing is that the cc compiler doesn't use the SPARC std instruction for the offending function. That is just how the compiler has been designed. By default Sun's cc compiler assumes at most an 8 byte alignment and raises a SIGBUS signal if the program tries to access misaligned data. You can force the cc compiler to interpret the access to misaligned data while assuming at most an 8 byte alignment using the -xmemalign=8i flag but that is forcing the compiler to use information provided by the user. This is actually equivalent to using __attribute__((aligned(8))) when defining the structs, since the macros involved are specifically for the gcc compiler. I agree with Takanori that the error not being produced by Sun's cc compiler is mostly luck and think it would be nice to have a solution that is not compiler specific. |
Comment by Arli [ 2012 Oct 04 ] |
I encountered the same thing when trying to start 2.0.3 agent on HP-UX B.11.23, B.11.23.0812.076, compiled with cc. 1394:20121004:134153.446 Starting Zabbix Agent [myserver.mydomain]. Zabbix 2.0.3 (revision 30485). 1395:20121004:134153.450 agent #0 started [collector] 1395:20121004:134153.450 Got signal [signal:10(SIGBUS),reason:1,refaddr:c2ec000c]. Crashing ... 1395:20121004:134153.450 ====== Fatal information: ====== 1395:20121004:134153.450 program counter not available for this architecture 1395:20121004:134153.450 === Registers: === 1395:20121004:134153.450 register dump not available for this architecture 1395:20121004:134153.450 === Backtrace: === 1395:20121004:134153.450 backtrace not available for this platform 1395:20121004:134153.450 === Memory map: === 1395:20121004:134153.450 memory map not available for this platform 1395:20121004:134153.450 ================================ 1394:20121004:134153.451 One child process died (PID:1395,exitcode/signal:-1). Exiting ... 1394:20121004:134155.459 Zabbix Agent stopped. Zabbix 2.0.3 (revision 30485). |
Comment by Oleksii Zagorskyi [ 2012 Oct 08 ] |
|
Comment by Jeff Shingara [ 2012 Oct 22 ] |
Still encountering this issue on Solaris10 with 2.0.3 agent $ $ uname -a |
Comment by Alexander Vladishev [ 2012 Oct 26 ] |
Similar issue: |
Comment by Andris Mednis [ 2012 Nov 01 ] |
Thanks for valuable comments and special thanks to Jairo and Takanori for explaining the root cause and proposing solution! |
Comment by Andris Mednis [ 2012 Nov 06 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-5289 |
Comment by Alexander Vladishev [ 2012 Nov 07 ] |
Great work! Successfully tested. |
Comment by Andris Mednis [ 2012 Nov 07 ] |
Fixed in versions pre-2.0.4 rev. 31309 and pre-2.1.0 rev. 31312. |
Comment by Paul Surgeon [ 2012 Nov 29 ] |
I can confirm that the fix also works for HP-UX 11.31 on Itanium2 using GCC. |
Comment by Andris Mednis [ 2012 Nov 29 ] |
Thanks, Paul! |
Comment by Gene Liverman [ 2012 Dec 08 ] |
Has anyone by chance already made some pre-compiled agents for Solaris 9 / 10 SPARC with the 2.0.4rc1 code? |
Comment by Andris Mednis [ 2012 Dec 10 ] |
Version 2.0.4 was released on Dec 8. Pre-compiled agents for Solaris are expected in few days at http://www.zabbix.com/download.php |
Comment by Andris Mednis [ 2012 Dec 10 ] |
Version 2.0.4 pre-compiled agents are at http://www.zabbix.com/download.php |
Comment by Leon Guo [ 2020 Mar 21 ] |
I also have the same issue on Oracle Sparc S7 with zabbix agent 4.4.6. |