[#ZBX-25287] Zabbix Server Crash

Comment by haven gu [ 2024 Sep 25 ]

1.The zabbix crash log

273963:20240924:234149.119 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0x7]. Crashing ...
3273963:20240924:234149.119 ====== Fatal information: ======
3273963:20240924:234149.119 Program counter: 0x56240fb7e950
3273963:20240924:234149.120 === Registers: ===
3273963:20240924:234149.120 r8      =        562411c13 =          23123270675 =          23123270675
3273963:20240924:234149.120 r9      =     562411b5201a =       94712915894298 =       94712915894298
3273963:20240924:234149.120 r10     =         14fef808 =            352253960 =            352253960
3273963:20240924:234149.120 r11     =                7 =                    7 =                    7
3273963:20240924:234149.120 r12     =                0 =                    0 =                    0
3273963:20240924:234149.120 r13     =                2 =                    2 =                    2
3273963:20240924:234149.121 r14     =     7fff3e005790 =      140734233597840 =      140734233597840
3273963:20240924:234149.121 r15     =     7fff3e005170 =      140734233596272 =      140734233596272
3273963:20240924:234149.122 rdi     =     7fff3e005020 =      140734233595936 =      140734233595936
3273963:20240924:234149.122 rsi     =                1 =                    1 =                    1
3273963:20240924:234149.122 rbp     =     7fff3e035cb0 =      140734233795760 =      140734233795760
3273963:20240924:234149.122 rbx     =               81 =                  129 =                  129
3273963:20240924:234149.122 rdx     =                0 =                    0 =                    0
3273963:20240924:234149.122 rax     =                7 =                    7 =                    7
3273963:20240924:234149.122 rcx     =                0 =                    0 =                    0
3273963:20240924:234149.123 rsp     =     7fff3e0050f0 =      140734233596144 =      140734233596144
3273963:20240924:234149.123 rip     =     56240fb7e950 =       94712882522448 =       94712882522448
3273963:20240924:234149.123 efl     =            10246 =                66118 =                66118
3273963:20240924:234149.123 csgsfs  =   2b000000000033 =    12103423998558259 =    12103423998558259
3273963:20240924:234149.123 err     =                4 =                    4 =                    4
3273963:20240924:234149.123 trapno  =                e =                   14 =                   14
3273963:20240924:234149.125 oldmask =                0 =                    0 =                    0
3273963:20240924:234149.128 cr2     =                7 =                    7 =                    7
3273963:20240924:234149.130 === Backtrace: ===
3181872:20240924:234149.144 One child process died (PID:3273963,exitcode/signal:11). Exiting ...
3181872:20240924:234149.144 PROCESS EXIT: 3273963
3273553:20240924:234149.145 HA manager has been paused
3274634:20240924:234149.574 cannot write to IPC socket: Broken pipe
3274635:20240924:234149.576 cannot write to IPC socket: Broken pipe
zabbix_server [3181872]: Error waiting for process with PID 3273963: [10] No child processes
3273553:20240924:234149.637 HA manager has been stopped
3181872:20240924:234149.698 syncing history data...
3181872:20240924:234150.095 syncing history data... 100.000000%
3181872:20240924:234150.095 syncing history data done
3181872:20240924:234150.095 syncing trend data...
3181872:20240924:234230.983 syncing trend data done
3181872:20240924:234231.175 Zabbix Server stopped. Zabbix 6.0.17 (revision c81d82859a8).
3945667:20240924:234241.275 Starting Zabbix Server. Zabbix 6.0.17 (revision c81d82859a8).
3945667:20240924:234241.275 ****** Enabled features ******
3945667:20240924:234241.275 SNMP monitoring:           YES
3945667:20240924:234241.275 IPMI monitoring:           YES
3945667:20240924:234241.276 Web monitoring:            YES
3945667:20240924:234241.276 VMware monitoring:         YES
3945667:20240924:234241.276 SMTP authentication:       YES
3945667:20240924:234241.276 ODBC:                      YES
3945667:20240924:234241.276 SSH support:               YES
3945667:20240924:234241.276 IPv6 support:              YES
3945667:20240924:234241.276 TLS support:               YES
3945667:20240924:234241.277 ******************************
3945667:20240924:234241.277 using configuration file: /etc/zabbix/zabbix_server.conf
3945667:20240924:234241.412 current database version (mandatory/optional): 06000000/06000020
3945667:20240924:234241.413 required mandatory version: 06000000
3945668:20240924:234241.541 starting HA manager
3945668:20240924:234241.575 HA manager started in standby mode
3945667:20240924:234241.576 "pdczbxap01p" node started in "standby" mode
3945667:20240925:004241.551 "pdczbxap01p" node is working in "standby" mode
3945668:20240925:011000.543 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
 [begin;]
3945667:20240925:014241.543 "pdczbxap01p" node is working in "standby" mode
3945667:20240925:024241.543 "pdczbxap01p" node is working in "standby" mode
3945667:20240925:034241.543 "pdczbxap01p" node is working in "standby" mode
3945667:20240925:044241.543 "pdczbxap01p" node is working in "standby" mode

when high history poller used , I find some slow request in DB, Some requst completed in 949.829426 sec.

2300192:20240925:041306.587 slow query: 4.001480 sec, "select auditid from auditlog where clock<1695672782 order by auditid limit 5000"
2300192:20240925:041309.651 slow query: 3.054834 sec, "select auditid from auditlog where clock<1695672782 order by auditid limit 5000"
2300192:20240925:051313.419 slow query: 3.244738 sec, "select auditid from auditlog where clock<1695676389 order by auditid limit 5000"
2300213:20240925:060025.508 slow query: 950.053324 sec, "select distinct r.druleid,r.iprange,r.name,c.dcheckid,r.proxy_hostid,r.delay from drules r left join dchecks c on c.druleid=r.druleid and c.uniq=1 where r.status=0 and r.nextcheck<=1727214275 and mod(r.druleid,30)=9"
2300192:20240925:061344.638 slow query: 3.824655 sec, "select auditid from auditlog where clock<1695680020 order by auditid limit 5000"
2300192:20240925:081400.171 slow query: 4.105848 sec, "select auditid from auditlog where clock<1695687235 order by auditid limit 5000"
2300192:20240925:081403.241 slow query: 3.045755 sec, "select auditid from auditlog where clock<1695687235 order by auditid limit 5000"
2300213:20240925:081701.603 slow query: 949.829426 sec, "select distinct r.druleid,r.iprange,r.name,c.dcheckid,r.proxy_hostid,r.delay from drules r left join dchecks c on c.druleid=r.druleid and c.uniq=1 where r.status=0 and r.nextcheck<=1727222471 and mod(r.druleid,30)=9"
2300192:20240925:091407.467 slow query: 3.836412 sec, "select auditid from auditlog where clock<1695690843 order by auditid limit 5000"
2300192:20240925:101414.783 slow query: 3.889972 sec, "select auditid from auditlog where clock<1695694450 order by auditid limit 5000"
2300192:20240925:101417.841 slow query: 3.033084 sec, "select auditid from auditlog where clock<1695694450 order by auditid limit 5000"
2300210:20240925:111706.851 slow query: 949.166381 sec, "begin;"
2300212:20240925:115005.219 slow query: 951.956315 sec, "select distinct r.druleid,r.iprange,r.name,c.dcheckid,r.proxy_hostid,r.delay from drules r left join dchecks c on c.druleid=r.druleid and c.uniq=1 where r.status=0 and r.nextcheck<=1727235253 and mod(r.druleid,30)=8"
2300212:20240925:130628.643 slow query: 950.042201 sec, "select distinct r.druleid,r.iprange,r.name,c.dcheckid,r.proxy_hostid,r.delay from drules r left join dchecks c on c.druleid=r.druleid and c.uniq=1 where r.status=0 and r.nextcheck<=1727239838 and mod(r.druleid,30)=8"
2300211:20240925:134458.787 slow query: 951.056993 sec, "begin;"
2300192:20240925:141452.274 slow query: 3.945263 sec, "select auditid from auditlog where clock<1695708888 order by auditid limit 5000"
2300210:20240925:143347.428 slow query: 949.943952 sec, "begin;"
2300246:20240925:145406.230 slow query: 30.970810 sec, "select clock,ns,value from history_uint where itemid=2729021 and clock>1726642415 and clock<=1727245185"
2300192:20240925:161522.296 slow query: 3.374543 sec, "select auditid from auditlog where clock<1695716115 order by auditid limit 5000"

Comment by Alexey Pustovalov [ 2024 Sep 26 ]

You use 6.0.17 Zabbix server:

3181872:20240924:234231.175 Zabbix Server stopped. Zabbix 6.0.17 (revision c81d82859a8).
3945667:20240924:234241.275 Starting Zabbix Server. Zabbix 6.0.17 (revision c81d82859a8).

Please update Zabbix server to latest 6.0 version

Comment by haven gu [ 2024 Sep 26 ]

dotneft

Thanks for prompt reply .

Is this a known issue, and will an upgrade fix it?

Comment by Alexey Pustovalov [ 2024 Sep 26 ]

You did not provide full logs for the crash, but anyway, it is better to update Zabbix server to avoid already fixed crashes.

Comment by haven gu [ 2024 Sep 26 ]

This is where I can get all the crash logs, is there a setting I need to change to get full crash logs?

Comment by Alexey Pustovalov [ 2024 Sep 26 ]

Please attach Zabbix server log file, not just part of.

Comment by haven gu [ 2024 Sep 26 ]

Total log was big, so I upload one hour log without any filter

Comment by Alexey Pustovalov [ 2024 Sep 26 ]

Please share:
1. cat zabbix_server.log | grep 3273963: | tail -n100
2. what exact version of OS?
3. Did you compile Zabbix server or use our packages?

Comment by haven gu [ 2024 Sep 26 ]

OS: redhat 8.7 4.18.0-425.19.2.el8_7.x86_64
The rpm packages get from zabbix, I didn't compile the packages

zabbix-web-deps-6.0.17-release1.el8.noarch

zabbix-selinux-policy-6.0.17-release1.el8.x86_64
zabbix-get-6.0.17-release1.el8.x86_64
zabbix-web-6.0.17-release1.el8.noarch
zabbix-web-pgsql-6.0.17-release1.el8.noarch
zabbix-nginx-conf-6.0.17-release1.el8.noarch
zabbix-sql-scripts-6.0.17-release1.el8.noarch
zabbix-agent2-plugin-postgresql-6.0.17-release1.el8.x86_64
zabbix-agent2-6.0.17-release1.el8.x86_64
zabbix-server-pgsql-6.0.17-release1.el8.x86_64
zabbix-agent2-plugin-mongodb-6.0.17-release1.el8.x86_64
zabbix-release-6.0-4.el8.noarch

 # cat /var/log/zabbix/zabbix_server.log | grep 3273963: | tail -n100
3273963:20240924:084705.769 SNMP agent item "TempSensorReading[242]" on host "PDCESX32P" failed: first network error, wait for 15 seconds
3273963:20240924:100705.446 SNMP agent item "net.if.in[ifHCInOctets.8]" on host "SLP-HCM-SW01" failed: first network error, wait for 15 seconds
3273963:20240924:101131.705 SNMP agent item "huawei-server[syslogReceiverTest.2]" on host "PDCFRC02P" failed: first network error, wait for 15 seconds
3273963:20240924:102852.893 SNMP agent item "net.if.out.discards[419]" on host "GLP-CN0010h11-15" failed: first network error, wait for 15 seconds
3273963:20240924:113147.234 [ AWS RDS ] ERROR: Error: cannot get URL: Couldn't connect to server.
3273963:20240924:123310.312 Failed to execute command "/usr/lib/zabbix/externalscripts/checkdisk.sh 'Pr0l0gis' '10.191.251.241' '1.3.6.1.4.1.674.10892.5.5.1.20.130.4.1.4'": Timeout while executing a shell script.
3273963:20240924:123728.909 [ AWS RDS ] ERROR: Error: cannot get URL: Timeout was reached.
3273963:20240924:131024.046 SNMP agent item "TempSensorStatus[30]" on host "PDCESX34P" failed: first network error, wait for 15 seconds
3273963:20240924:140101.031 SNMP agent item "sensor.temp.value[cpqHeTemperatureCelsius."I/O.0.41"]" on host "PDCORA24T" failed: first network error, wait for 15 seconds
3273963:20240924:151028.387 SNMP agent item "TempSensorReading[41]" on host "PDCESX34P" failed: first network error, wait for 15 seconds
3273963:20240924:160129.276 SNMP agent item "TempSensorReading[38]" on host "PDCESX34P" failed: first network error, wait for 15 seconds
3273963:20240924:161611.744 SNMP agent item "MemStatus[12]" on host "PDCEVK8S05P" failed: first network error, wait for 15 seconds
3273963:20240924:163838.914 SNMP agent item "PSU_powertype[4]" on host "Dell-MX7000-2" failed: first network error, wait for 15 seconds
3273963:20240924:170611.948 SNMP agent item "net.if.speed[ifHighSpeed.634]" on host "10.128.1.8" failed: first network error, wait for 15 seconds
3273963:20240924:180842.698 SNMP agent item "MemStatus[1]" on host "PDCEVK8S02P" failed: first network error, wait for 15 seconds
3273963:20240924:191710.135 SNMP agent item "TempSensorReading[28]" on host "PDCESX36P" failed: first network error, wait for 15 seconds
3273963:20240924:194326.093 SNMP agent item "net.if.status[ifOperStatus.117]" on host "GLPSGSGSW002" failed: first network error, wait for 15 seconds
3273963:20240924:210312.224 SNMP agent item "TempSensorReading[6]" on host "PDCESX31P" failed: first network error, wait for 15 seconds
3273963:20240924:234149.119 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0x7]. Crashing ...
3273963:20240924:234149.119 ====== Fatal information: ======
3273963:20240924:234149.119 Program counter: 0x56240fb7e950
3273963:20240924:234149.120 === Registers: ===
3273963:20240924:234149.120 r8      =        562411c13 =          23123270675 =          23123270675
3273963:20240924:234149.120 r9      =     562411b5201a =       94712915894298 =       94712915894298
3273963:20240924:234149.120 r10     =         14fef808 =            352253960 =            352253960
3273963:20240924:234149.120 r11     =                7 =                    7 =                    7
3273963:20240924:234149.120 r12     =                0 =                    0 =                    0
3273963:20240924:234149.120 r13     =                2 =                    2 =                    2
3273963:20240924:234149.121 r14     =     7fff3e005790 =      140734233597840 =      140734233597840
3273963:20240924:234149.121 r15     =     7fff3e005170 =      140734233596272 =      140734233596272
3273963:20240924:234149.122 rdi     =     7fff3e005020 =      140734233595936 =      140734233595936
3273963:20240924:234149.122 rsi     =                1 =                    1 =                    1
3273963:20240924:234149.122 rbp     =     7fff3e035cb0 =      140734233795760 =      140734233795760
3273963:20240924:234149.122 rbx     =               81 =                  129 =                  129
3273963:20240924:234149.122 rdx     =                0 =                    0 =                    0
3273963:20240924:234149.122 rax     =                7 =                    7 =                    7
3273963:20240924:234149.122 rcx     =                0 =                    0 =                    0
3273963:20240924:234149.123 rsp     =     7fff3e0050f0 =      140734233596144 =      140734233596144
3273963:20240924:234149.123 rip     =     56240fb7e950 =       94712882522448 =       94712882522448
3273963:20240924:234149.123 efl     =            10246 =                66118 =                66118
3273963:20240924:234149.123 csgsfs  =   2b000000000033 =    12103423998558259 =    12103423998558259
3273963:20240924:234149.123 err     =                4 =                    4 =                    4
3273963:20240924:234149.123 trapno  =                e =                   14 =                   14
3273963:20240924:234149.125 oldmask =                0 =                    0 =                    0
3273963:20240924:234149.128 cr2     =                7 =                    7 =                    7
3273963:20240924:234149.130 === Backtrace: ===
zabbix_server [3181872]: Error waiting for process with PID 3273963: [10] No child processes

Comment by Tomasz Grzechulski [ 2024 Sep 26 ]

Hello Haven,

Considering provided logs, could you also provide us with logs from 24.09 starting from at least 21:30?

And since you are using RHEL in version 8.7, which is quite old , same as Zabbix version - are there reasons behind it?

Do you consider upgrading both?

Thank you,

Regards,

Tomasz

Comment by haven gu [ 2024 Sep 27 ]

zabbix02.log

^{Yes. Will consider upgrading the zabbix version to the latest 6.0.x version。In general, the only time I would consider upgrading is if I encountered a security risk or bug, and it has been running very stable for over 1 year.}

Because zabbix has high availability. master crash will be switched to slave has little impact. But from 00:00 onwards suddenly history syncer poller is 100% busy. at that time the data can not be written to the database. This led to a large number of agent unreachable alert. at this time there are a few sql slow query has reached 900s. but at that time to check the database performance is not very high. i have some confusion.
1. why history sync will be triggered suddenly?
2.my environment zabbix NVPS often reach 4000, zabbix docs recommend 4
Do I need to adjust this value?
StartDBSyncers
The number of pre-forked instances of history syncers.
Note: Be careful when changing this value, increasing it may do more harm than good. Roughly, the default value should be enough to handle up to 4000 NVPS.

Comment by Tomasz Grzechulski [ 2024 Sep 27 ]

Hello Haven,

Thank you for provided logs.

I would have one question - by default, after installation of Zabbix, host "Zabbix Server" have two templates - "Zabbix server health" and "Linux by Zabbix Agent".

If you have those templates set on your Zabbix Server host, could you please also provide us with graphs about compute resources? like, CPU, memory, SWAP etc..

The reason behind Zabbix crash possibly is related to lack of memory, but to confirm that we would need to have look on graphs of metrics I've mentioned above.

History sync started running shortly after server crash of process with PID 3273963:

3273963:20240924:234149.119 Got signal [signal:11(SIGSEGV),reason:1,refaddr:0x7]. Crashing ...
3273553:20240924:234149.145 HA manager has been paused
3274634:20240924:234149.574 cannot write to IPC socket: Broken pipe
3274635:20240924:234149.576 cannot write to IPC socket: Broken pipe
zabbix_server [3181872]: Error waiting for process with PID 3273963: [10] No child processes
3273553:20240924:234149.637 HA manager has been stopped
3181872:20240924:234149.698 syncing history data...
3181872:20240924:234150.095 syncing history data... 100.000000%
3181872:20240924:234150.095 syncing history data done
3181872:20240924:234150.095 syncing trend data...
3181872:20240924:234230.983 syncing trend data done
3181872:20240924:234231.175 Zabbix Server stopped. Zabbix 6.0.17 (revision c81d82859a8).

If it would be possible, please share with us graphs with metrics which I've mentioned above,

Regarding agents reporting "unreachable" - We would need to have better understanding of your whole Zabbix installation (with used DB engine solution)

Thank you,

Regards,

Tomasz

Comment by haven gu [ 2024 Sep 29 ]

The database cluster verison is ：

pgpool 4.4.2

postgresql 15..2

timescaledb 2.10.3

Comment by Tomasz Grzechulski [ 2024 Sep 30 ]

Hello Haven,

Thank you for attached graphs - Sorry to bother you, since I've did not mentioned that - if that would be possible, please share with us graphs with at least month back data.

Since it's already end of September, you can for example share with us graphs from start to end of September,

Thank you,

Regards,

Tomasz

Comment by haven gu [ 2024 Oct 12 ]

Sorry for late reply, the last 30 day gragh below.

Comment by haven gu [ 2024 Dec 20 ]

tgrzechulski This issue is come again, Based on these logs is it possible to analyze what the problem is, is it a known problem?

[ZBX-25779] Zabbix Server crash - ZABBIX SUPPORT

[ZBX-25287] Zabbix Server Crash Created: 2024 Sep 25 Updated: 2024 Dec 20 Resolved: 2024 Oct 08
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	None
Affects Version/s:	6.0.32
Fix Version/s:	None

[ZBX-25287] Zabbix Server Crash Created: 2024 Sep 25 Updated: 2024 Dec 20 Resolved: 2024 Oct 08