Steps to reproduce:
Unknown. Run the proxy and wait. Occasionally, it will crash. With ~40 proxies with similar hardware and OS configurations, there have been 3 crashes so far in about a month or so since the most recent OS update, so the recurrence time of each individual crash appears to be on the order of a year.
In other words, this is a rare concurrency bug.
Result:
The proxy will occasionally crash with, in our case;
|59823:20251010:014335.479 [Z3005] query failed: [0] database disk image is malformed [delete from proxy_history where id<128175613 and (write_clock<1760049815 or (id<=128175519))]| |zabbix_proxy [59823]: [file:'db.c',line:1616] lock failed: [2] No such file or directory| |52824:20251010:014335.489 One child process died (PID:59823,exitcode/signal:1). Exiting ...| |zabbix_proxy [52824]: Error waiting for process with PID 59823: [10] No child processes| |52824:20251010:014335.510 Zabbix Proxy stopped. Zabbix 7.0.6 (revision c1d7a081969). Fatal error 'mutex 0x83e9f9000 own 0x188e0 is on list 0x373bf40121a8 0x0' at line 151 in file /var/jenkins/workspace/pfSense-CE-snapshots-2_8_1-main/sources/FreeBSD-src-RELENG_2_8_1/lib/libthr/thread/thr_mutex.c (errno = 2)|
The code in which this crash occurs reproduced below:
static void mutex_assert_not_owned(struct pthread *curthread __unused, struct pthread_mutex *m __unused) {#if defined(_PTHREADS_INVARIANTS) if (__predict_false(m->m_qe.tqe_prev != NULL || m->m_qe.tqe_next != NULL)) PANIC("mutex %p own %#x is on list %p %p", m, m->m_lock.m_owner, m->m_qe.tqe_prev, m->m_qe.tqe_next); if (__predict_false(is_robust_mutex(m) && (m->m_lock.m_rb_lnk != 0 || m->m_rb_prev != NULL || (is_pshared_mutex(m) && curthread->robust_list == (uintptr_t)&m->m_lock) || (!is_pshared_mutex(m) && curthread->priv_robust_list == (uintptr_t)&m->m_lock)))) PANIC( "mutex %p own %#x is on robust linkage %p %p head %p phead %p", m, m->m_lock.m_owner, (void *)m->m_lock.m_rb_lnk, m->m_rb_prev, (void *)curthread->robust_list, (void *)curthread->priv_robust_list); #endif }
Note that this is from pfSense version 2.7.2 not 2.8.0 because netgate hasn't gotten around to publishing that source code yet. It may have been modified, though the line numbers do match so I do not assume it has.
I have no dump file, because, when it does crash, I get only this in the kernel log...
pid 52824 (zabbix_proxy), jid 0, uid 122: exited on signal 6 (no core dump - bad address)
A low-level enough problem that this might be either a hardware or an OS issue. In relation to the Hardware idea (or an occasional incident such as a power spike), it's now been happening on multiple machines at disparate times so this is less likely.
I checked the SMART state of one of them and all checks pass.
Inspecting it further, it appears that the variable
m->m_qe.tqe_next
inside a mutex in the BSD kernel somehow gets to be NULLPTR while the OS expects that to never happen.
One way to cause this issue is to call pthread_mutex_destroy() on a mutex that is still owned by a process.
From https://pubs.opengroup.org/onlinepubs/007904875/functions/pthread_mutex_destroy.html : Attempting to destroy a locked mutex results in UB. This UB takes the form of an assertion that panics and crashes the offending program on BSD.
Of course, where and how Zabbix proxy is trying to destroy a locked mutex is not yet known (and not possible for me to trace/find out without the crash dump) ...
A possible freeBSD commit that might be related to the problem is https://github.com/pfsense/FreeBSD-src/commit/b370ef156ab9d88450e9bc0440df522aec88cc44 ; commit b370ef1
Expected:
No such mutex associated crashes occur.