[ZBX-21373] Zabbix Agent crashes every now and then on OpenBSD Created: 2022 Jul 21  Updated: 2024 Apr 10  Resolved: 2022 Dec 21

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 6.0.5, 6.2.0
Fix Version/s: 6.0.13rc1, 6.2.7rc1, 6.4.0beta5, 6.4 (plan)

Type: Problem report Priority: Trivial
Reporter: Mark Assignee: Juris Lambda (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OpenBSD 7.1 amd64 bare metal, VM


Attachments: Text File patch-src_libs_zbxsysinfo_openbsd_proc_c-6.0.5-fix.patch     Text File patch-src_libs_zbxsysinfo_openbsd_proc_c-6.2.0-trace.patch    
Team: Team B
Sprint: Sprint 92 (Sep 2022), Sprint 93 (Oct 2022), Sprint 94 (Nov 2022), Sprint 95 (Dec 2022)
Story Points: 0.25

 Description   

The zabbix_agentd crashes every now and then on OpenBSD 7.1 - amd64. This happens on bare metal systems, as well as VM (ESXi). I'm not 100% sure about the last working version. Surely it worked with 6.0.1, but crashes with 6.0.5 and now 6.2.0. The configs haven't touched for a while.

 

Last crash from tonight:

Jul 21 01:26:07 s1 zabbix_agentd: vfprintf %s NULL in "%s "
Jul 21 01:26:07 s1 zabbix_agentd[87595]: Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xa2d88e96081]. Crashing ...
Jul 21 01:26:07 s1 zabbix_agentd[87595]: ====== Fatal information: ======
Jul 21 01:26:07 s1 zabbix_agentd[87595]: === Backtrace: ===
Jul 21 01:26:07 s1 zabbix_agentd[87595]: 1: 0xa2b2e716ee4 <zbx_backtrace+0x54> at /usr/local/sbin/zabbix_agentd
Jul 21 01:26:07 s1 zabbix_agentd[87595]: 0: 0xa2b2e71705f <zbx_log_fatal_info+0x4f> at /usr/local/sbin/zabbix_agentd
Jul 21 01:26:07 s1 zabbix_agentd[87595]: === Memory map: ===
Jul 21 01:26:07 s1 zabbix_agentd[87595]: memory map not available for this platform
Jul 21 01:26:07 s1 zabbix_agentd[87595]: ================================

 

Steps to reproduce:

  • no easy way. Start zabbix_agentd an wait.


 Comments   
Comment by Mark [ 2022 Jul 21 ]

Perhaps to add to this. I've checked all items on this system and all values are absolutely normal (memory, cpu, disk, services). So no running out of resources issue.

Comment by Vladislavs Sokurenko [ 2022 Jul 21 ]

Could you please increase log level and provide part of the log before the crash, it seems that null is printed somewhere so it crashes on OpenBSD

Comment by Mark [ 2022 Jul 25 ]

Today, finally the first crash.

Zabbix Agent log configuration:

# for debugging purposes
#LogType = syslog
LogType = file
LogFile = /var/log/zabbix/agentd.log
LogFileSize = 20
DebugLevel = 4

 

Syslog /var/log/messages:

Jul 25 10:35:07 lynx zabbix_agentd: vfprintf %s NULL in "%s "

 

I see this in /var/log/zabbix/agentd.log

61027:20220725:103507.081 Requested [vfs.fs.size[/var/www/logs,free]]
 61027:20220725:103507.081 In zbx_execute_threaded_metric() key:'vfs.fs.size'
  6028:20220725:103507.082 executing in data process for key:'vfs.fs.size'
 61027:20220725:103507.083 End of zbx_execute_threaded_metric():SYSINFO_SUCCEED ''
 61027:20220725:103507.083 Sending back [30128128]
 61027:20220725:103507.083 zbx_setproctitle() title:'listener #1 [waiting for connection]'
 28968:20220725:103507.128 zbx_setproctitle() title:'listener #2 [processing request]'
 28968:20220725:103507.129 Requested [proc.num[,,,ifstated]]
 28968:20220725:103507.130 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0xae31a26f081]. Crashing ...
 28968:20220725:103507.130 ====== Fatal information: ======
 28968:20220725:103507.130 === Backtrace: ===
 28968:20220725:103507.131 1: 0xae0e398dee4 <zbx_backtrace+0x54> at /usr/local/sbin/zabbix_agentd
 28968:20220725:103507.131 0: 0xae0e398e05f <zbx_log_fatal_info+0x4f> at /usr/local/sbin/zabbix_agentd
 28968:20220725:103507.132 === Memory map: ===
 28968:20220725:103507.132 memory map not available for this platform
 28968:20220725:103507.132 ================================
 32280:20220725:103507.138 One child process died (PID:0,exitcode/signal:0). Exiting ...
 32280:20220725:103507.138 zbx_on_exit() called with ret:-1
 32280:20220725:103507.142 In zbx_unload_modules()
 32280:20220725:103507.142 End of zbx_unload_modules()
 32280:20220725:103507.142 Zabbix Agent stopped. Zabbix 6.2.0 (revision db780895726).
Comment by Mark [ 2022 Sep 01 ]

Any news on this?

I can confirm the issue persists with 6.2.2. It's crashing about once/day for several systems.

Comment by Vladislavs Sokurenko [ 2022 Sep 01 ]

Last metric was: 
Requested [proc.num[,,,ifstated]]
We should try to reproduce and get back to you

Comment by Juris Lambda (Inactive) [ 2022 Sep 01 ]

Is this on -release, -stable or -current?

Comment by Mark [ 2022 Sep 01 ]

I'm running on OpenBSD 7.2 -current now (so the upcoming 7.2 release), but it's been the same with 7.1 -current.

 

Thanks.

Comment by Juris Lambda (Inactive) [ 2022 Sep 02 ]

Hey, Uturn! As you're specifying only the cmdline parameter, which is a regex to match proc command-lines, can you check what other processes might be running on the boxes that could match it?

Also, just for the sake of completeness, (1) any fancy ifstated_flags in rc.conf.local and (2) is this built from the ports tree (I've checked net/zabbix/patches and there doesn't appear to be anything that should really affect the proc.num item)?

Comment by Mark [ 2022 Sep 02 ]

Hi Juris,

i'm using the default Zabbix OpenBSD template and it's only /usr/sbin/ifstated running...the one that comes with base. No special flags.

$ ps ax | grep ifstated
 3004 ??  IpU      0:00.00 /usr/sbin/ifstated
69772 p3  R+/2     0:00.00 grep ifstated

I've disabled checks for ifstated and will see if the daily crashes stop.

Comment by Juris Lambda (Inactive) [ 2022 Sep 02 ]

Hello again, Uturn!

Can you please apply the attached patch-src_libs_zbxsysinfo_openbsd_proc_c-6.2.0-trace.patch, rebuild the agent and run that, with the DebugLevel knob turned up to 5? I've sprinkled in some logging just to see the data that is being worked on when processing the proc.num item. It should give some insight for why this might be happening. 

Comment by Mark [ 2022 Sep 05 ]

Hi Juris,

i'm running with the patch now, reactivated monitoring of the ifstated service and btw...i didn't have any crashes in the last days with item 'proc.num[,,,ifstated]' deactivated.

Comment by Vladislavs Sokurenko [ 2022 Sep 05 ]

Could you please be so kind to enable 'proc.num[,,,ifstated]' and provide part of the log after:
Requested [proc.num[,,,ifstated]]

Is ifstated sometimes run with arguments and sometimes not, thus could cause race condition in proc_argv() function between sysctl() ?

Comment by Mark [ 2022 Sep 05 ]

Hi Vladislavs,

ifstated always runs without arguments in my configuration

95101:20220905:132707.619 Requested [proc.num[,,,ifstated]]
 95101:20220905:132707.619 [*] Parameter count validated
 95101:20220905:132707.619 [*] name: 0x42cd64dc310
 95101:20220905:132707.619 [*] user: 0x42cd64c54a0
 95101:20220905:132707.619 [*] state: 0x42cd64c5ce0
 95101:20220905:132707.619 [*] cmdline: 0x42cd64d5350
 95101:20220905:132707.619 [*] entered proc_argv
 95101:20220905:132707.619 [*]   argv:         0x7f7ffffdf328
 95101:20220905:132707.619 [*]  *argv:         0x0
 95101:20220905:132707.619 [*]   argv_alloc:   0x7f7ffffdf2e8 (0)
 95101:20220905:132707.619 [*]   argc:         0x7f7ffffdf324 (32639)
 95101:20220905:132707.619 [*] leaving proc_argv
 95101:20220905:132707.619 [*] entered collect_args
 95101:20220905:132707.619 [*]   argv:        0x42cd64cfec0
 95101:20220905:132707.619 [*]  *argv:        0x42cd64cfee0
 95101:20220905:132707.619 [*]   argc:        3
 95101:20220905:132707.619 [*]   args:        -134352
 95101:20220905:132707.619 [*]  *args:        0
 95101:20220905:132707.620 [*] procloop iter: 0
 95101:20220905:132707.620 [*] procloop iter: 1
 95101:20220905:132707.620 [*] procloop iter: 2
 95101:20220905:132707.620 [*] entered proc_argv
 95101:20220905:132707.620 [*]   argv:         0x7f7ffffdf328
 95101:20220905:132707.620 [*]  *argv:         0x42cd64cfec0
 95101:20220905:132707.620 [*]   argv_alloc:   0x7f7ffffdf2e8 (64)
 95101:20220905:132707.620 [*]   argc:         0x7f7ffffdf324 (3)
 95101:20220905:132707.620 [*] leaving proc_argv
 95101:20220905:132707.620 [*] entered collect_args
 95101:20220905:132707.620 [*]   argv:        0x42cd64d0d00
 95101:20220905:132707.620 [*]  *argv:        0x42cd64d0d20
 95101:20220905:132707.620 [*]   argc:        3
 95101:20220905:132707.620 [*]   args:        -134352
 95101:20220905:132707.620 [*]  *args:        -699599232
 
Comment by Vladislavs Sokurenko [ 2022 Sep 05 ]

Should try deleting redundant system calls and simply iterate while not NULL

diff --git a/src/libs/zbxsysinfo/openbsd/proc.c b/src/libs/zbxsysinfo/openbsd/proc.c
index 412c380f279..aa324ec8c99 100644
--- a/src/libs/zbxsysinfo/openbsd/proc.c
+++ b/src/libs/zbxsysinfo/openbsd/proc.c
@@ -172,12 +172,6 @@ retry:
                return FAIL;
        }
 
-       mib[3] = KERN_PROC_NARGV;
-
-       sz = sizeof(int);
-       if (0 != sysctl(mib, 4, argc, &sz, NULL, 0))
-               return FAIL;
-
        return SUCCEED;
 }
 
@@ -192,7 +186,7 @@ static void collect_args(char **argv, int argc, char **args, size_t *args_alloc)
                *args = zbx_malloc(*args, *args_alloc);
        }
 
-       for (i = 0; i < argc; i++)
+       while (NULL != argv[i])
                zbx_snprintf_alloc(args, args_alloc, &args_offset, "%s ", argv[i]);
 
        if (0 != args_offset)
Comment by Mark [ 2022 Sep 06 ]

Hey Vladislavs,

with your diff, zabbix_agentd dies after <1 min independently if proc.num[,,,ifstated] is activated or not

26504:20220906:090645.548 Starting Zabbix Agent [lynx]. Zabbix 6.2.2 (revision 35455866073).
 26504:20220906:090645.548 **** Enabled features ****
 26504:20220906:090645.549 IPv6 support:          YES
 26504:20220906:090645.549 TLS support:           YES
 26504:20220906:090645.549 **************************
 26504:20220906:090645.549 using configuration file: /etc/zabbix/zabbix_agentd.conf
 26504:20220906:090645.549 In zbx_load_modules()
 26504:20220906:090645.549 End of zbx_load_modules():SUCCEED
 26504:20220906:090645.549 In init_collector_data()
 26504:20220906:090645.549 End of init_collector_data()
 26504:20220906:090645.549 agent #0 started [main process]
 52311:20220906:090645.550 agent #1 started [collector]
 71417:20220906:090645.551 agent #2 started [listener #1]
  8494:20220906:090645.551 agent #4 started [listener #3]
 36918:20220906:090645.551 agent #3 started [listener #2]
 52311:20220906:090645.552 In init_cpu_collector()
  8494:20220906:090645.552 In zbx_tls_init_child()
.....
 75410:20220906:090706.780 End of need_meta_update():FAIL
 75410:20220906:090706.780 In send_buffer() host:'10.0.118.68' port:10051 entries:4/100
 75410:20220906:090706.780 send_buffer() now:1662448026 lastsent:1662448025 now-lastsent:1 BufferSend:5; will not send now
 75410:20220906:090706.780 End of send_buffer():SUCCEED
 75410:20220906:090706.780 End of process_active_checks()
 75410:20220906:090706.780 In get_min_nextcheck()
 75410:20220906:090706.780 End of get_min_nextcheck():1662448027
 75410:20220906:090706.780 zbx_setproctitle() title:'active checks #1 [idle 1 sec]'
 26504:20220906:090706.883 One child process died (PID:0,exitcode/signal:0). Exiting ...
 26504:20220906:090706.884 zbx_on_exit() called with ret:-1
 26504:20220906:090706.886 In zbx_unload_modules()
 26504:20220906:090706.886 End of zbx_unload_modules()
 26504:20220906:090706.886 Zabbix Agent stopped. Zabbix 6.2.2 (revision 35455866073). 
Comment by Vladislavs Sokurenko [ 2022 Sep 06 ]

It was draft of what should be deleted, patch will be provided soon it will be nice if you will be able to test it.

Comment by Juris Lambda (Inactive) [ 2022 Sep 06 ]

Hello, Uturn!

Good news, I've managed to fully reproduce the behaviour.

It does appear vso was right and there is some form of race between the two sysctl(2) calls, where in the first one we ask for the list of proc args (which gets written to our local argument vector), and the second asks explicitly for their count. We don't do this on other platforms.

I haven't spoken to tech@ yet, but I'm wondering if it's legal to expect data integrity across two sysctl(2) calls.

I replaced the second call with setting the argc value from my internal counter, and the issue stopped manifesting.

IMHO, we should just drop the extra sysctl(2) call and (1) determine the argv length in proc_argv ourselves, to maintain the interface, or (2) make callers iterate the returned argv themselves.

Comment by Juris Lambda (Inactive) [ 2022 Sep 09 ]

Hey, Uturn!

I cooked up a quick patch-src_libs_zbxsysinfo_openbsd_proc_c-6.0.5-fix.patch that you can use, based on 6.0.5, which also applies cleanly to 6.2.0 and current master.

Comment by Juris Lambda (Inactive) [ 2022 Sep 09 ]

Fixed in development branch feature/ZBX-21373-6.0 and available for review.

Comment by Mark [ 2022 Sep 09 ]

Hey Juris.

I'm running with your diff on 6.2.2 now and will report how it goes. Thanks.

UPDATE: no issues after 4 hours

UPDATE: no issues after 24 hours

UPDATE: everything's fine after 3 days

Comment by dimir [ 2022 Dec 12 ]

Successfully tested.

Comment by Juris Lambda (Inactive) [ 2022 Dec 13 ]

Fixed in versions:

Generated at Sun Jun 29 06:16:17 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.