[ZBX-23941] Issue with TLS PSK connection from server to passive agents Created: 2024 Jan 14  Updated: 2025 May 15  Resolved: 2024 Feb 23

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G), Server (S)
Affects Version/s: 7.0.0alpha9
Fix Version/s: 7.0.0beta2, 7.0 (plan)

Type: Problem report Priority: Trivial
Reporter: Andrii Malyi Assignee: Vladislavs Sokurenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Zabbix 7.0.0.a9 on RHEL8.9, DB: PG15 with TSDB 2.13
Latest agent version 6.4


Attachments: Text File agent-debug.txt    
Issue Links:
Duplicate
Team: Team A
Sprint: Sprint candidates, S24-W6/7
Story Points: 0.25

 Description   

We have strange floating issue with TLS PSK connectivity between server and agent 
It is new installation of zabbix 7.0.0.a9 on RHEL8.9, DB: PG15 with TSDB 2.13
Agents version - latest 6.4 (6.4.9,6.4.10, we also tried 6.0.25 LTS)
Passive agent configuration is used

TLSConnect=psk
TLSAccept=psk
TLSPSKIdentity=******-PSK-IDENTITY-*****
TLSPSKFile=/etc/zabbix/zabbix_agentd.psk

Time to time, in zabbix server log we wound following errors:

  2443:20240114:050602.443 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.443 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.444 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init
  2443:20240114:050602.445 SSL_shutdown() with 10.0.9.29 set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shut
down while in init

(10.0.9.29 - one of the zabbix agents)

Here are the corresponding records on the agent side

7528:20240114:051025.852 failed to accept an incoming connection: from 10.xx.xx.xx: unspecified certificate verification error: TLS handshake set result code to 5:
6800:20240114:063734.937 failed to accept an incoming connection: from 10.xx.xx.xx: unspecified certificate verification error: TLS handshake set result code to 5:
 152:20240114:063734.943 failed to accept an incoming connection: from 10.xx.xx.xx: unspecified certificate verification error: TLS handshake set result code to 5:

10.xx.xx.xx - ip of zabbix server

This issue appears on Linux hosts sometimes but most often this occurs on Windows servers.
In the attachment small part of agent's debug log.

This problem is floating, errors can be recorded in the log from several minutes to tens of minutes, the data seems to be collected, but given that I saw several crashes in the server log,  I can not say for sure whether this is related to the server crashes or no.
Additional thing that is no clear, why does the error refer to certificate verification when all TLS parameters configured to use PSK.



 Comments   
Comment by Vladislavs Sokurenko [ 2024 Jan 15 ]

Thank you for your report, there could be more information if debug log attached for the time when the problem has occurred, also it is possible that it is fixed in this commit:
https://git.zabbix.com/projects/ZBX/repos/zabbix/commits/964f51c0555e22404c1df7c6a0d1ef592a6cc84e

Comment by Andrii Malyi [ 2024 Jan 17 ]

I greatly appreciate your support and the quick response to my ticket. Thank you!
Am I correct in understanding that the changes you provided in the commit will be included in the next release (alpha10)? 
Also, I had the opportunity to catch the moment when this error occurred, and I managed to gather debug logs from both the agent and the server simultaneously.
Do you see any value in sending debug-logs to you for additional analysis?
Thank you again.

Comment by Vladislavs Sokurenko [ 2024 Jan 17 ]

It will be included in alpha10, please upgrade when available and let us know if issue still persists.

Comment by Andrii Malyi [ 2024 Jan 17 ]

Thank you!

Comment by Andrii Malyi [ 2024 Feb 01 ]

Hello team!

Unfortunately, I have upgraded to beta1 but issue still there.

Could you please provide me with an email address to which I can send you the debug logs that I have collected?

Comment by Vladislavs Sokurenko [ 2024 Feb 02 ]

It appears that issue is occurring when there is timeout during TLS handshake and we should suppress warning in that case, are you using Zabbix agent or Zabbix agent2, is it possible that they don't have enough listeners or capacity to handle all requests ?

20240116:171718.284 In agent_task_process() step 'connect' event:4 itemid:50439
573098:20240116:171718.285 agent_task_process() step 'tls' event:4 itemid:50439
...

573098:20240116:171722.209 In agent_task_process() step 'tls' event:1 itemid:50439
573098:20240116:171722.209 SSL_shutdown() with x.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutdown:shutdown while in init
573098:20240116:171722.209 In process_async_result() key:'perf_counter_en["\PhysicalDisk(10)\Disk Reads/sec",60]' host:'<host>' addr:'<addr>'
573098:20240116:171722.209 finished processing itemid:50439 processing:201
573098:20240116:171722.209 End of process_async_result():TIMEOUT_ERROR
573098:20240116:171722.209 End of async_event():ZBX_ASYNC_TASK_STOP

Please check how many pollers there was before the upgrade and set MaxConcurrentChecksPerPoller to the same size or increase StartAgents count so that they can process all checks and not timeout

Comment by Andrii Malyi [ 2024 Feb 02 ]

I'd hazard a guess that it is not related to poller parametres if only their default values are not enough to start.

zabbix_server.conf:

#StartPollers=5

#StartAgentPollers=1

#StartHTTPAgentPollers=1

#StartSNMPPollers=1

#MaxConcurrentChecksPerPoller=1000

#StartIPMIPollers=0

#StartPollersUnreachable=1

#StartHistoryPollers=5

#StartHTTPPollers=1

#StartJavaPollers=0

#StartProxyPollers=1

#StartODBCPollers=1

On the other hand, agents have default settings as well, so they start with 3 listeners.

Starting Zabbix Agent [******01]. Zabbix 6.4.10 (revision 4da16fb82f5).
  8992:20240114:023908.799 **** Enabled features ****
  8992:20240114:023908.803 IPv6 support:          YES
  8992:20240114:023908.807 TLS support:           YES
  8992:20240114:023908.811 **************************
  8992:20240114:023908.815 using configuration file: C:\Program Files\Zabbix Agent\zabbix_agentd.conf
  8992:20240114:023912.299 agent #0 started [main process]
   152:20240114:023912.307 agent #2 started listener #1
  7528:20240114:023912.311 agent #4 started listener #3
  1692:20240114:023912.316 agent #1 started [collector]
  6800:20240114:023912.323 agent #3 started listener #2

one of the zabbix_agent.confs'

#StartAgents=3

 

As I mentioned, it is a fresh installation with up to 20 agents. Currently, we have 2 Win2022, 4 Win2016, ~5 AIX, and ~5 RHEL8 servers. 
I've tried different agents - 5.4, 6.0.25LTS, and most often 6.4.9 - 6.4.10. Only one installation of agent2 we have (on the zabbix server).
Only common operating system monitoring is in place, so we are trying to collect 150-300 metrics from each host. 
Standard Linux/Windows/AIX Zabbix agent templates are used.

Nothing unusual...

Most often this issue appears on windows, but on linux host, I also find TLS errors

3227939:20240119:193024.585 Zabbix Agent stopped. Zabbix 6.4.10 (revision 4da16fb82f5).
  1394:20240119:193137.832 Starting Zabbix Agent [***linuxhost***]. Zabbix 6.4.10 (revision 4da16fb82f5).
  1394:20240119:193137.841 **** Enabled features ****
  1394:20240119:193137.841 IPv6 support:          YES
  1394:20240119:193137.841 TLS support:           YES
  1394:20240119:193137.841 **************************
  1394:20240119:193137.841 using configuration file: /etc/zabbix/zabbix_agentd.conf
  1394:20240119:193137.842 agent #0 started [main process]
  1399:20240119:193137.843 agent #2 started listener #1
  1398:20240119:193137.854 agent #1 started [collector]
  1400:20240119:193137.858 agent #3 started listener #2
  1401:20240119:193137.861 agent #4 started listener #3
  1400:20240119:234829.532 failed to accept an incoming connection: from 10.****: unencrypted connections are not allowed
  1401:20240120:002554.768 failed to accept an incoming connection: from 10.****: unspecified certificate verification error: TLS handshake set result code to 1: file ssl/record/
rec_layer_s3.c line 1497: error:140940CD:SSL routines:ssl3_read_bytes:invalid alert: TLS write fatal alert "unexpected_message"
  1400:20240120:013700.899 failed to accept an incoming connection: from 10.****: unencrypted connections are not allowed
  1399:20240121:011059.290 failed to accept an incoming connection: from 10.****: unencrypted connections are not allowed
  1400:20240121:011117.286 failed to accept an incoming connection: from 10.****: unencrypted connections are not allowed

Comment by Vladislavs Sokurenko [ 2024 Feb 02 ]

It appears on windows because some checks don't return data for long periods of time, thus they don't allow other checks to be processed

Comment by Vladislavs Sokurenko [ 2024 Feb 08 ]

(1) It should be documented that it's recommended to increase StartAgents if there is either high queue or network errors during passive checks as this may indicate that checks are slow and Zabbix agent is too busy.

Comment by Andrii Malyi [ 2024 Feb 08 ]

Hi team, just for information.
I've increased number of pollers on server side:
StartPollers=10
StartAgentPollers=5
StartPollersUnreachable=3
increased timeouts on server and agent side to 8
increased number of agent listeners StartAgents=8 (on windows host where the problem occurs most often, and updated zbx agent to the latest ver. 6.4.11 on this host)
But unfortunately it does not help.

I'm unable to increase the parameter MaxConcurrentChecksPerPoller as it has its max value  = 1000 by default.

Comment by Vladislavs Sokurenko [ 2024 Feb 09 ]

Please increase StartAgents on Zabbix agent

Comment by Andrii Malyi [ 2024 Feb 09 ]

 
It has already been done.

On the one of the Windows hosts, parameter StartAgents is set to 8, but it did not help, errors continue to occur. Should the StartAgents' parameter be increased more?
 
 

Comment by Vladislavs Sokurenko [ 2024 Feb 09 ]

Yes, 8 might not be enough to process 300 metrics in time

Comment by Andrii Malyi [ 2024 Feb 09 ]

wow, increased to 15

Comment by Andrii Malyi [ 2024 Feb 09 ]

agent log:

18104:20240209:102253.193 Zabbix Agent received stop request.
 18104:20240209:102254.223 Zabbix Agent stopped. Zabbix 6.4.11 (revision fb6439a26e3).
 37140:20240209:102255.278 Starting Zabbix Agent [TS-RDHO-06]. Zabbix 6.4.11 (revision fb6439a26e3).
 37140:20240209:102255.298 **** Enabled features ****
 37140:20240209:102255.322 IPv6 support:          YES
 37140:20240209:102255.347 TLS support:           YES
 37140:20240209:102255.373 **************************
 37140:20240209:102255.395 using configuration file: C:\Program Files\Zabbix Agent\zabbix_agentd.conf
 37140:20240209:102259.533 agent #0 started [main process]
 13804:20240209:102259.552 agent #1 started [collector]
 32084:20240209:102259.569 agent #2 started listener #1
 22640:20240209:102259.596 agent #11 started listener #10
  1432:20240209:102259.615 agent #4 started listener #3
 32716:20240209:102259.639 agent #5 started listener #4
 15788:20240209:102259.661 agent #6 started listener #5
 11896:20240209:102259.683 agent #7 started listener #6
    32:20240209:102259.700 agent #8 started listener #7
   184:20240209:102259.722 agent #9 started listener #8
 33672:20240209:102259.742 agent #10 started listener #9
 17172:20240209:102259.757 agent #3 started listener #2
 23304:20240209:102259.773 agent #12 started listener #11
 35036:20240209:102259.794 agent #13 started listener #12
 39052:20240209:102259.811 agent #14 started listener #13
 35640:20240209:102259.832 agent #15 started listener #14
 26096:20240209:102259.858 agent #16 started listener #15
 33672:20240209:102324.748 failed to accept an incoming connection: from 10.хх.хх.хх: unspecified certificate verification error: TLS handshake set result code to 5:
 35036:20240209:102324.774 failed to accept an incoming connection: from 10.хх.хх.хх: unspecified certificate verification error: TLS handshake set result code to 5:
    32:20240209:102324.804 failed to accept an incoming connection: from 10.хх.хх.хх unspecified certificate verification error: TLS handshake set result code to 5:
 35640:20240209:102325.752 failed to accept an incoming connection: from 10.хх.хх.хх: unspecified certificate verification error: TLS handshake set result code to 5:
  1432:20240209:102325.773 failed to accept an incoming connection: from 110.хх.хх.хх: unspecified certificate verification error: TLS handshake set result code to 5:
 33672:20240209:102325.794 failed to accept an incoming connection: from 110.хх.хх.хх: unspecified certificate verification error: TLS handshake set result code to 5:

//10.хх.хх.хх- IP of zabbix server

Server log:

1283925:20240209:102258.236 Zabbix agent item "perf_counter_en["\PhysicalDisk(3)\Avg. Disk Write Queue Length",60]" on host "ts-rdho-06"
failed: first network error, wait for 15 seconds
1283925:20240209:102314.256 resuming Zabbix agent checks on host "ts-rdho-06": connection restored
1283920:20240209:102323.238 Zabbix agent item "perf_counter_en["\PhysicalDisk(2)\Disk Writes/sec",60]" on host "ts-rdho-06" failed: first
 network error, wait for 15 seconds
1283920:20240209:102323.242 SSL_shutdown() with 10.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutd
own:shutdown while in init
1283920:20240209:102323.242 SSL_shutdown() with 10.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutd
own:shutdown while in init
1283920:20240209:102323.242 SSL_shutdown() with 10.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutd
own:shutdown while in init
1283920:20240209:102323.242 SSL_shutdown() with 10.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutd
own:shutdown while in init
1283920:20240209:102323.243 SSL_shutdown() with 10.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutd
own:shutdown while in init
1283920:20240209:102323.243 SSL_shutdown() with 10.x.x.x set result code to 1: file ssl/ssl_lib.c line 2094: error:140E0197:SSL routines:SSL_shutd
own:shutdown while in init

10.x.x.x - IP of ts-rdho-06

We just tried the option with StartAgent=25, but result the same, issue persist. Reverted to StartAgent=15

Comment by Vladislavs Sokurenko [ 2024 Feb 19 ]

Fixed in:

Generated at Sat Jun 14 17:44:48 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.