[ZBX-21373] Zabbix Agent crashes every now and then on OpenBSD Created: 2022 Jul 21 Updated: 2024 Apr 10 Resolved: 2022 Dec 21 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 6.0.5, 6.2.0 |
Fix Version/s: | 6.0.13rc1, 6.2.7rc1, 6.4.0beta5, 6.4 (plan) |
Type: | Problem report | Priority: | Trivial |
Reporter: | Mark | Assignee: | Juris Lambda (Inactive) |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
OpenBSD 7.1 amd64 bare metal, VM |
Attachments: |
![]() ![]() |
Team: | |
Sprint: | Sprint 92 (Sep 2022), Sprint 93 (Oct 2022), Sprint 94 (Nov 2022), Sprint 95 (Dec 2022) |
Story Points: | 0.25 |
Description |
The zabbix_agentd crashes every now and then on OpenBSD 7.1 - amd64. This happens on bare metal systems, as well as VM (ESXi). I'm not 100% sure about the last working version. Surely it worked with 6.0.1, but crashes with 6.0.5 and now 6.2.0. The configs haven't touched for a while.
Last crash from tonight: Jul 21 01:26:07 s1 zabbix_agentd: vfprintf %s NULL in "%s " Jul 21 01:26:07 s1 zabbix_agentd[87595]: Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xa2d88e96081]. Crashing ... Jul 21 01:26:07 s1 zabbix_agentd[87595]: ====== Fatal information: ====== Jul 21 01:26:07 s1 zabbix_agentd[87595]: === Backtrace: === Jul 21 01:26:07 s1 zabbix_agentd[87595]: 1: 0xa2b2e716ee4 <zbx_backtrace+0x54> at /usr/local/sbin/zabbix_agentd Jul 21 01:26:07 s1 zabbix_agentd[87595]: 0: 0xa2b2e71705f <zbx_log_fatal_info+0x4f> at /usr/local/sbin/zabbix_agentd Jul 21 01:26:07 s1 zabbix_agentd[87595]: === Memory map: === Jul 21 01:26:07 s1 zabbix_agentd[87595]: memory map not available for this platform Jul 21 01:26:07 s1 zabbix_agentd[87595]: ================================
Steps to reproduce:
|
Comments |
Comment by Mark [ 2022 Jul 21 ] |
Perhaps to add to this. I've checked all items on this system and all values are absolutely normal (memory, cpu, disk, services). So no running out of resources issue. |
Comment by Vladislavs Sokurenko [ 2022 Jul 21 ] |
Could you please increase log level and provide part of the log before the crash, it seems that null is printed somewhere so it crashes on OpenBSD |
Comment by Mark [ 2022 Jul 25 ] |
Today, finally the first crash. Zabbix Agent log configuration: # for debugging purposes #LogType = syslog LogType = file LogFile = /var/log/zabbix/agentd.log LogFileSize = 20 DebugLevel = 4
Syslog /var/log/messages:
Jul 25 10:35:07 lynx zabbix_agentd: vfprintf %s NULL in "%s "
I see this in /var/log/zabbix/agentd.log 61027:20220725:103507.081 Requested [vfs.fs.size[/var/www/logs,free]] 61027:20220725:103507.081 In zbx_execute_threaded_metric() key:'vfs.fs.size' 6028:20220725:103507.082 executing in data process for key:'vfs.fs.size' 61027:20220725:103507.083 End of zbx_execute_threaded_metric():SYSINFO_SUCCEED '' 61027:20220725:103507.083 Sending back [30128128] 61027:20220725:103507.083 zbx_setproctitle() title:'listener #1 [waiting for connection]' 28968:20220725:103507.128 zbx_setproctitle() title:'listener #2 [processing request]' 28968:20220725:103507.129 Requested [proc.num[,,,ifstated]] 28968:20220725:103507.130 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xae31a26f081]. Crashing ... 28968:20220725:103507.130 ====== Fatal information: ====== 28968:20220725:103507.130 === Backtrace: === 28968:20220725:103507.131 1: 0xae0e398dee4 <zbx_backtrace+0x54> at /usr/local/sbin/zabbix_agentd 28968:20220725:103507.131 0: 0xae0e398e05f <zbx_log_fatal_info+0x4f> at /usr/local/sbin/zabbix_agentd 28968:20220725:103507.132 === Memory map: === 28968:20220725:103507.132 memory map not available for this platform 28968:20220725:103507.132 ================================ 32280:20220725:103507.138 One child process died (PID:0,exitcode/signal:0). Exiting ... 32280:20220725:103507.138 zbx_on_exit() called with ret:-1 32280:20220725:103507.142 In zbx_unload_modules() 32280:20220725:103507.142 End of zbx_unload_modules() 32280:20220725:103507.142 Zabbix Agent stopped. Zabbix 6.2.0 (revision db780895726). |
Comment by Mark [ 2022 Sep 01 ] |
Any news on this? I can confirm the issue persists with 6.2.2. It's crashing about once/day for several systems. |
Comment by Vladislavs Sokurenko [ 2022 Sep 01 ] |
Last metric was: |
Comment by Juris Lambda (Inactive) [ 2022 Sep 01 ] |
Is this on -release, -stable or -current? |
Comment by Mark [ 2022 Sep 01 ] |
I'm running on OpenBSD 7.2 -current now (so the upcoming 7.2 release), but it's been the same with 7.1 -current.
Thanks. |
Comment by Juris Lambda (Inactive) [ 2022 Sep 02 ] |
Hey, Uturn! As you're specifying only the cmdline parameter, which is a regex to match proc command-lines, can you check what other processes might be running on the boxes that could match it? Also, just for the sake of completeness, (1) any fancy ifstated_flags in rc.conf.local and (2) is this built from the ports tree (I've checked net/zabbix/patches and there doesn't appear to be anything that should really affect the proc.num item)? |
Comment by Mark [ 2022 Sep 02 ] |
Hi Juris, i'm using the default Zabbix OpenBSD template and it's only /usr/sbin/ifstated running...the one that comes with base. No special flags. $ ps ax | grep ifstated 3004 ?? IpU 0:00.00 /usr/sbin/ifstated 69772 p3 R+/2 0:00.00 grep ifstated I've disabled checks for ifstated and will see if the daily crashes stop. |
Comment by Juris Lambda (Inactive) [ 2022 Sep 02 ] |
Hello again, Uturn! Can you please apply the attached patch-src_libs_zbxsysinfo_openbsd_proc_c-6.2.0-trace.patch |
Comment by Mark [ 2022 Sep 05 ] |
Hi Juris, i'm running with the patch now, reactivated monitoring of the ifstated service and btw...i didn't have any crashes in the last days with item 'proc.num[,,,ifstated]' deactivated. |
Comment by Vladislavs Sokurenko [ 2022 Sep 05 ] |
Could you please be so kind to enable 'proc.num[,,,ifstated]' and provide part of the log after: Is ifstated sometimes run with arguments and sometimes not, thus could cause race condition in proc_argv() function between sysctl() ? |
Comment by Mark [ 2022 Sep 05 ] |
Hi Vladislavs, ifstated always runs without arguments in my configuration 95101:20220905:132707.619 Requested [proc.num[,,,ifstated]] 95101:20220905:132707.619 [*] Parameter count validated 95101:20220905:132707.619 [*] name: 0x42cd64dc310 95101:20220905:132707.619 [*] user: 0x42cd64c54a0 95101:20220905:132707.619 [*] state: 0x42cd64c5ce0 95101:20220905:132707.619 [*] cmdline: 0x42cd64d5350 95101:20220905:132707.619 [*] entered proc_argv 95101:20220905:132707.619 [*] argv: 0x7f7ffffdf328 95101:20220905:132707.619 [*] *argv: 0x0 95101:20220905:132707.619 [*] argv_alloc: 0x7f7ffffdf2e8 (0) 95101:20220905:132707.619 [*] argc: 0x7f7ffffdf324 (32639) 95101:20220905:132707.619 [*] leaving proc_argv 95101:20220905:132707.619 [*] entered collect_args 95101:20220905:132707.619 [*] argv: 0x42cd64cfec0 95101:20220905:132707.619 [*] *argv: 0x42cd64cfee0 95101:20220905:132707.619 [*] argc: 3 95101:20220905:132707.619 [*] args: -134352 95101:20220905:132707.619 [*] *args: 0 95101:20220905:132707.620 [*] procloop iter: 0 95101:20220905:132707.620 [*] procloop iter: 1 95101:20220905:132707.620 [*] procloop iter: 2 95101:20220905:132707.620 [*] entered proc_argv 95101:20220905:132707.620 [*] argv: 0x7f7ffffdf328 95101:20220905:132707.620 [*] *argv: 0x42cd64cfec0 95101:20220905:132707.620 [*] argv_alloc: 0x7f7ffffdf2e8 (64) 95101:20220905:132707.620 [*] argc: 0x7f7ffffdf324 (3) 95101:20220905:132707.620 [*] leaving proc_argv 95101:20220905:132707.620 [*] entered collect_args 95101:20220905:132707.620 [*] argv: 0x42cd64d0d00 95101:20220905:132707.620 [*] *argv: 0x42cd64d0d20 95101:20220905:132707.620 [*] argc: 3 95101:20220905:132707.620 [*] args: -134352 95101:20220905:132707.620 [*] *args: -699599232 |
Comment by Vladislavs Sokurenko [ 2022 Sep 05 ] |
Should try deleting redundant system calls and simply iterate while not NULL diff --git a/src/libs/zbxsysinfo/openbsd/proc.c b/src/libs/zbxsysinfo/openbsd/proc.c index 412c380f279..aa324ec8c99 100644 --- a/src/libs/zbxsysinfo/openbsd/proc.c +++ b/src/libs/zbxsysinfo/openbsd/proc.c @@ -172,12 +172,6 @@ retry: return FAIL; } - mib[3] = KERN_PROC_NARGV; - - sz = sizeof(int); - if (0 != sysctl(mib, 4, argc, &sz, NULL, 0)) - return FAIL; - return SUCCEED; } @@ -192,7 +186,7 @@ static void collect_args(char **argv, int argc, char **args, size_t *args_alloc) *args = zbx_malloc(*args, *args_alloc); } - for (i = 0; i < argc; i++) + while (NULL != argv[i]) zbx_snprintf_alloc(args, args_alloc, &args_offset, "%s ", argv[i]); if (0 != args_offset) |
Comment by Mark [ 2022 Sep 06 ] |
Hey Vladislavs, with your diff, zabbix_agentd dies after <1 min independently if proc.num[,,,ifstated] is activated or not 26504:20220906:090645.548 Starting Zabbix Agent [lynx]. Zabbix 6.2.2 (revision 35455866073). 26504:20220906:090645.548 **** Enabled features **** 26504:20220906:090645.549 IPv6 support: YES 26504:20220906:090645.549 TLS support: YES 26504:20220906:090645.549 ************************** 26504:20220906:090645.549 using configuration file: /etc/zabbix/zabbix_agentd.conf 26504:20220906:090645.549 In zbx_load_modules() 26504:20220906:090645.549 End of zbx_load_modules():SUCCEED 26504:20220906:090645.549 In init_collector_data() 26504:20220906:090645.549 End of init_collector_data() 26504:20220906:090645.549 agent #0 started [main process] 52311:20220906:090645.550 agent #1 started [collector] 71417:20220906:090645.551 agent #2 started [listener #1] 8494:20220906:090645.551 agent #4 started [listener #3] 36918:20220906:090645.551 agent #3 started [listener #2] 52311:20220906:090645.552 In init_cpu_collector() 8494:20220906:090645.552 In zbx_tls_init_child() ..... 75410:20220906:090706.780 End of need_meta_update():FAIL 75410:20220906:090706.780 In send_buffer() host:'10.0.118.68' port:10051 entries:4/100 75410:20220906:090706.780 send_buffer() now:1662448026 lastsent:1662448025 now-lastsent:1 BufferSend:5; will not send now 75410:20220906:090706.780 End of send_buffer():SUCCEED 75410:20220906:090706.780 End of process_active_checks() 75410:20220906:090706.780 In get_min_nextcheck() 75410:20220906:090706.780 End of get_min_nextcheck():1662448027 75410:20220906:090706.780 zbx_setproctitle() title:'active checks #1 [idle 1 sec]' 26504:20220906:090706.883 One child process died (PID:0,exitcode/signal:0). Exiting ... 26504:20220906:090706.884 zbx_on_exit() called with ret:-1 26504:20220906:090706.886 In zbx_unload_modules() 26504:20220906:090706.886 End of zbx_unload_modules() 26504:20220906:090706.886 Zabbix Agent stopped. Zabbix 6.2.2 (revision 35455866073). |
Comment by Vladislavs Sokurenko [ 2022 Sep 06 ] |
It was draft of what should be deleted, patch will be provided soon it will be nice if you will be able to test it. |
Comment by Juris Lambda (Inactive) [ 2022 Sep 06 ] |
Hello, Uturn! Good news, I've managed to fully reproduce the behaviour. It does appear vso was right and there is some form of race between the two sysctl(2) calls, where in the first one we ask for the list of proc args (which gets written to our local argument vector), and the second asks explicitly for their count. We don't do this on other platforms. I haven't spoken to tech@ yet, but I'm wondering if it's legal to expect data integrity across two sysctl(2) calls. I replaced the second call with setting the argc value from my internal counter, and the issue stopped manifesting. IMHO, we should just drop the extra sysctl(2) call and (1) determine the argv length in proc_argv ourselves, to maintain the interface, or (2) make callers iterate the returned argv themselves. |
Comment by Juris Lambda (Inactive) [ 2022 Sep 09 ] |
Hey, Uturn! I cooked up a quick patch-src_libs_zbxsysinfo_openbsd_proc_c-6.0.5-fix.patch |
Comment by Juris Lambda (Inactive) [ 2022 Sep 09 ] |
Fixed in development branch feature/ZBX-21373-6.0 and available for review. |
Comment by Mark [ 2022 Sep 09 ] |
Hey Juris. I'm running with your diff on 6.2.2 now and will report how it goes. Thanks. UPDATE: no issues after 4 hours UPDATE: no issues after 24 hours UPDATE: everything's fine after 3 days |
Comment by dimir [ 2022 Dec 12 ] |
Successfully tested. |
Comment by Juris Lambda (Inactive) [ 2022 Dec 13 ] |
Fixed in versions: |