[ZBX-15316] Windows zabbix agent: port 10050 is in listening state after stopping the service Created: 2018 Dec 13  Updated: 2024 Apr 10  Resolved: 2019 Feb 08

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: None
Fix Version/s: 3.0.26rc1, 4.0.5rc1, 4.2.0beta1, 4.2 (plan)

Type: Problem report Priority: Minor
Reporter: Sergey Lisikh Assignee: Alex Kalimulin
Resolution: Fixed Votes: 1
Labels: agent
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 2008 R2 x64


Attachments: JPEG File 2018.12.19-16.18.37.jpg     JPEG File 2018.12.19-16.42.30.jpg     Text File comms_changes_patch.txt     PNG File image-2018-12-13-17-32-44-968.png     PNG File image-2018-12-19-17-30-42-731.png     File netstat.log     PNG File proclist.png     File zabbix_agent.log    
Issue Links:
Causes
causes ZBX-15728 listener failed: WSASocket() for [[-]... Closed
Duplicate
Team: Team A
Team: Team A
Sprint: Sprint 47, Dec 2018, Sprint 48, Jan 2019, Sprint 49 (Feb 2019)
Story Points: 2

 Description   

We have several hundreds of windows hosts and on some of them after trying to stop/start "zabbix agent" port 10050 is not released and there are a lot of "time_wait" connections, so service coudn't be started until we reboot the machine. Via TcpView it is possible to close "time_wait" connections but not that which is in listening state and it shows <non-existent> process.

 

 

C:\Windows\system32>sc query "zabbix agent"
SERVICE_NAME: zabbix agent
 TYPE : 10 WIN32_OWN_PROCESS
 STATE : 1 STOPPED
 WIN32_EXIT_CODE : 1067 (0x42b)
 SERVICE_EXIT_CODE : 0 (0x0)
 CHECKPOINT : 0x0
 WAIT_HINT : 0x0
C:\Windows\system32>netstat -ano | findstr :10050
 TCP 0.0.0.0:10050 0.0.0.0:0 LISTENING 2728
 TCP [::]:10050 [::]:0 LISTENING 2728
C:\Windows\system32>taskkill /f /pid 2728
ERROR: The process "2728" not found.
c:\Program Files\Zabbix>zabbix_agentd.exe -V
zabbix_agentd Win64 (service) (Zabbix) 3.4.6
Revision 76819 15 January 2018, compilation time: Jan 15 2018 10:43:42

 

Steps to reproduce:

  1. sc stop "zabbix agent"
  2. sc start "zabbix agent"

Result:
Service is in stopped state. No zabbix agent process, but port 10050 is listening.
Expected:
Service is successfully restarted



 Comments   
Comment by Arturs Lontons [ 2018 Dec 13 ]

Hello and thank you for reporting the issue.
Could you provide the Zabbix agent log with Debug level - 4?

I don't think that the time_wait connections are the cause of this behavior, since they can stay open by design.

Comment by Sergey Lisikh [ 2018 Dec 13 ]

Hello. I'll try to reproduce this problem with higher debug lvl but it could be tricky. We've got about 40 of 600 agent instances on productive being stopped after last service restarting

Here is another screenshot showing almost 2 hundreds of CLOSE_WAIT (of course, not "TIME_WAIT") connections related to process 2568 which no longer exists

Comment by Arturs Lontons [ 2018 Dec 17 ]

Hi,
Does this issue happen on the same hosts or are the hosts random each time? Does the issue happen after every agent restart or randomly? Are there any specific user parameters or scripts on the hosts encountering the issue? How about specific firewall/antivirus configuration?

we would still require at least the agent logs to further troubleshoot the issue.

Comment by Sergey Lisikh [ 2018 Dec 17 ]

Hey!

It seems to happen on random hosts each time and not after every agent restart. I've scheduled a task on one of the last "broken" hosts for testing purposes. It restarts service with given interval - agent works fine up to now.

All hosts have the same configuration: working windows firewall with opened tcp/10050 and centrally controlled kaspersky security 10 (checked its events, didn't find any anomaly)

Trying catching the issue I cannot increase debug level on all of the hosts, it requires one more restart and we'll get a lot of unworking agents again, but I'm going to do it with a few of them and create restarting tasks with different intervals.

Comment by Vladislavs Sokurenko [ 2018 Dec 18 ]

Looks like listener thread child process was not terminated for some reason that should be investigated, does issue persists after upgrade to 3.4.15 ?

Comment by Sergey Lisikh [ 2018 Dec 19 ]

I tried to detect child processess last time, taking pid that was showing in "netstat -ano | findstr 10050" as a parent process:

wmic process where (ParentProcessId=2568) get Caption,ProcessId

It showed nothing.

 

Any chance issue will gone after upgrade to 3.4.15?

To do this, we need to restart a lot of agents for 2 times to check if issue still exists, and the first restart will definitely fail on some of them. It's very critical for business especially now on the eve of holidays. Upgrade could be possible after NY holidays.

Comment by Vladislavs Sokurenko [ 2018 Dec 19 ]

I don't think that upgrade will solve the issue. Does connecting to this port make agent stop ? You can use "net.tcp.service" key to connect.

Comment by Sergey Lisikh [ 2018 Dec 19 ]

I didn't try net.tcp.service, but we have "agent.ping" item (it makes port connection, isn't it?) and trigger that reported "agent is unavailable". I tried to connect via telnet - it is possible. But all of this didn't  lead to port closing, it had been in listen state untill Windows restarted.

Unfortunately, debug level was 3 and log just says that agent stopped and couldn't start again:

 2772:20181030:231554.837 using configuration file: C:\Program Files\Zabbix\zabbix_agentd.win.conf
 2772:20181030:231554.888 agent #0 started [main process]
 2784:20181030:231554.889 agent #1 started [collector]
 2788:20181030:231554.889 agent #2 started [listener #1]
 2792:20181030:231554.889 agent #3 started [listener #2]
 2796:20181030:231554.889 agent #4 started [listener #3]
 2732:20181212:182802.847 Zabbix Agent stopped. Zabbix 3.4.6 (revision 76819).
 9632:20181212:182803.008 Starting Zabbix Agent [Windows host]. Zabbix 3.4.6 (revision 76819).
 9632:20181212:182803.008 **** Enabled features ****
 9632:20181212:182803.008 IPv6 support: YES
 9632:20181212:182803.009 TLS support: NO
 9632:20181212:182803.009 **************************
 9632:20181212:182803.009 using configuration file: C:\Program Files\Zabbix\zabbix_agentd.win.conf
 9632:20181212:182803.011 listener failed: bind() for [[-]:10050] failed: [0x00002740] Only one usage of each socket address (protocol/network address/port) is normally permitted.
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ]

Did you have such message in the log ?
Zabbix Agent stopped.

Comment by Sergey Lisikh [ 2018 Dec 19 ]

Yes, I had. Take a look on my previous comment.

Comment by Vladislavs Sokurenko [ 2018 Dec 19 ]

I am sorry and in that comment which process owns listening socket after termination ? is it 2772 ?

Comment by Sergey Lisikh [ 2018 Dec 19 ]

 Got your thought! I've just caught this issue on of the hosts. Let's recup all information on live situation.

Here is new screenshot: service is stopped, process owning socket is different and cannot be investigated in proc list

Also, I misled you about telnet. Actually it is possible to make connection to port, and when I do this, netstat shows ESTABLISHED connection with the same PID 1324.

But no useful info from agent:

Comment by Vladislavs Sokurenko [ 2018 Dec 19 ]

Can you please also show output of task manager for this process ? It might be a good idea if zabbix would close this socket prior to exiting.

Also did process exit after telnet or still remains ? It would mean that closing socket when zabbix server is stopped can solve the issue.

Comment by Vladislavs Sokurenko [ 2018 Dec 19 ]

This might be related to ZBX-15027 so if issue is reproducible then I kindly ask you to upgrade to 3.4.14 after new year and see if it helps. And if trying to reproduce should also try 3.4.6 because 3.4.14 might behave differently.

Comment by Sergey Lisikh [ 2018 Dec 19 ]

Update!!!

I was going to make tasklist output and found working process "smartctl.exe...". I know that it is called via system.run[] item.

After killing smartctl via task manager port was closed!

 

 

Comment by Vladislavs Sokurenko [ 2018 Dec 19 ]

So current suspicion is that stopping zabbix agent while there is program in system.run can cause such behavior.
Need to create script that simply sleeps for ever, set Timeout to 30 seconds and try to reproduce the issue by stopping Zabbix agent at this moment.

Comment by Sergey Lisikh [ 2018 Dec 21 ]

Hi again,

I managed to reproduce the problem with debug lvl 4 on one of the servers.

Looked more attentively at the processes, which didn't let the port close. It was interesting to discover that the start time of these processess strongly differs from the time when service stopped. Also, they were suspended - it seems to be happened because of execution timeout.

For instance,

4968:20181220:165652.241 Requested [uHDD["/dev/sda",Temperature_Celsius]]

4968:20181220:165721.432 Sending back [ZBX_NOTSUPPORTED: Timeout while executing a shell script.]

2800:20181220:194704.428 Zabbix Agent stopped. Zabbix 3.4.6 (revision 76819)

Here is the full agent log, proclist screenshot with start time of suspended processess and netstat output. "smartctl.exe" is calling from Userparams and "rec_status.exe" is calling via system.run

zabbix_agent.log netstat.log 

vso unfortunately it's not clear why those processes are not stopped when timed out since Zabbix agent should terminate those.

Comment by Glebs Ivanovskis [ 2019 Jan 29 ]

Two typos in the comment: stronlystrongly.

Comment by Alex Kalimulin [ 2019 Feb 05 ]

Two typos in the comment: stronly→strongly.

Fixed, thank you!

Comment by Alex Kalimulin [ 2019 Feb 05 ]

Fixed in:

  • pre-3.0.26rc1 by r89186
  • pre-4.0.5rc1 by r89467
  • pre-4.2.0alpha4 by r89178
Generated at Sat Apr 20 13:30:14 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.