[ZBX-15316] Windows zabbix agent: port 10050 is in listening state after stopping the service Created: 2018 Dec 13 Updated: 2024 Apr 10 Resolved: 2019 Feb 08 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | None |
Fix Version/s: | 3.0.26rc1, 4.0.5rc1, 4.2.0beta1, 4.2 (plan) |
Type: | Problem report | Priority: | Minor |
Reporter: | Sergey Lisikh | Assignee: | Alex Kalimulin |
Resolution: | Fixed | Votes: | 1 |
Labels: | agent | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Windows 2008 R2 x64 |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||
Issue Links: |
|
||||||||||||
Team: | |||||||||||||
Sprint: | Sprint 47, Dec 2018, Sprint 48, Jan 2019, Sprint 49 (Feb 2019) | ||||||||||||
Story Points: | 2 |
Description |
We have several hundreds of windows hosts and on some of them after trying to stop/start "zabbix agent" port 10050 is not released and there are a lot of "time_wait" connections, so service coudn't be started until we reboot the machine. Via TcpView it is possible to close "time_wait" connections but not that which is in listening state and it shows <non-existent> process.
C:\Windows\system32>sc query "zabbix agent" SERVICE_NAME: zabbix agent TYPE : 10 WIN32_OWN_PROCESS STATE : 1 STOPPED WIN32_EXIT_CODE : 1067 (0x42b) SERVICE_EXIT_CODE : 0 (0x0) CHECKPOINT : 0x0 WAIT_HINT : 0x0 C:\Windows\system32>netstat -ano | findstr :10050 TCP 0.0.0.0:10050 0.0.0.0:0 LISTENING 2728 TCP [::]:10050 [::]:0 LISTENING 2728 C:\Windows\system32>taskkill /f /pid 2728 ERROR: The process "2728" not found. c:\Program Files\Zabbix>zabbix_agentd.exe -V zabbix_agentd Win64 (service) (Zabbix) 3.4.6 Revision 76819 15 January 2018, compilation time: Jan 15 2018 10:43:42
Steps to reproduce:
Result: |
Comments |
Comment by Arturs Lontons [ 2018 Dec 13 ] |
Hello and thank you for reporting the issue. I don't think that the time_wait connections are the cause of this behavior, since they can stay open by design. |
Comment by Sergey Lisikh [ 2018 Dec 13 ] |
Hello. I'll try to reproduce this problem with higher debug lvl but it could be tricky. We've got about 40 of 600 agent instances on productive being stopped after last service restarting Here is another screenshot showing almost 2 hundreds of CLOSE_WAIT (of course, not "TIME_WAIT") connections related to process 2568 which no longer exists |
Comment by Arturs Lontons [ 2018 Dec 17 ] |
Hi, we would still require at least the agent logs to further troubleshoot the issue. |
Comment by Sergey Lisikh [ 2018 Dec 17 ] |
Hey! It seems to happen on random hosts each time and not after every agent restart. I've scheduled a task on one of the last "broken" hosts for testing purposes. It restarts service with given interval - agent works fine up to now. All hosts have the same configuration: working windows firewall with opened tcp/10050 and centrally controlled kaspersky security 10 (checked its events, didn't find any anomaly) Trying catching the issue I cannot increase debug level on all of the hosts, it requires one more restart and we'll get a lot of unworking agents again, but I'm going to do it with a few of them and create restarting tasks with different intervals. |
Comment by Vladislavs Sokurenko [ 2018 Dec 18 ] |
Looks like listener thread child process was not terminated for some reason that should be investigated, does issue persists after upgrade to 3.4.15 ? |
Comment by Sergey Lisikh [ 2018 Dec 19 ] |
I tried to detect child processess last time, taking pid that was showing in "netstat -ano | findstr 10050" as a parent process: wmic process where (ParentProcessId=2568) get Caption,ProcessId It showed nothing.
Any chance issue will gone after upgrade to 3.4.15? To do this, we need to restart a lot of agents for 2 times to check if issue still exists, and the first restart will definitely fail on some of them. It's very critical for business especially now on the eve of holidays. Upgrade could be possible after NY holidays. |
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ] |
I don't think that upgrade will solve the issue. Does connecting to this port make agent stop ? You can use "net.tcp.service" key to connect. |
Comment by Sergey Lisikh [ 2018 Dec 19 ] |
I didn't try net.tcp.service, but we have "agent.ping" item (it makes port connection, isn't it?) and trigger that reported "agent is unavailable". I tried to connect via telnet - it is possible. But all of this didn't lead to port closing, it had been in listen state untill Windows restarted. Unfortunately, debug level was 3 and log just says that agent stopped and couldn't start again:
2772:20181030:231554.837 using configuration file: C:\Program Files\Zabbix\zabbix_agentd.win.conf
2772:20181030:231554.888 agent #0 started [main process]
2784:20181030:231554.889 agent #1 started [collector]
2788:20181030:231554.889 agent #2 started [listener #1]
2792:20181030:231554.889 agent #3 started [listener #2]
2796:20181030:231554.889 agent #4 started [listener #3]
2732:20181212:182802.847 Zabbix Agent stopped. Zabbix 3.4.6 (revision 76819).
9632:20181212:182803.008 Starting Zabbix Agent [Windows host]. Zabbix 3.4.6 (revision 76819).
9632:20181212:182803.008 **** Enabled features ****
9632:20181212:182803.008 IPv6 support: YES
9632:20181212:182803.009 TLS support: NO
9632:20181212:182803.009 **************************
9632:20181212:182803.009 using configuration file: C:\Program Files\Zabbix\zabbix_agentd.win.conf
9632:20181212:182803.011 listener failed: bind() for [[-]:10050] failed: [0x00002740] Only one usage of each socket address (protocol/network address/port) is normally permitted.
|
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ] |
Did you have such message in the log ? |
Comment by Sergey Lisikh [ 2018 Dec 19 ] |
Yes, I had. Take a look on my previous comment. |
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ] |
I am sorry and in that comment which process owns listening socket after termination ? is it 2772 ? |
Comment by Sergey Lisikh [ 2018 Dec 19 ] |
Got your thought! I've just caught this issue on of the hosts. Let's recup all information on live situation. Here is new screenshot: service is stopped, process owning socket is different and cannot be investigated in proc list Also, I misled you about telnet. Actually it is possible to make connection to port, and when I do this, netstat shows ESTABLISHED connection with the same PID 1324. But no useful info from agent: |
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ] |
Can you please also show output of task manager for this process ? It might be a good idea if zabbix would close this socket prior to exiting. Also did process exit after telnet or still remains ? It would mean that closing socket when zabbix server is stopped can solve the issue. |
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ] |
This might be related to |
Comment by Sergey Lisikh [ 2018 Dec 19 ] |
Update!!! I was going to make tasklist output and found working process "smartctl.exe...". I know that it is called via system.run[] item. After killing smartctl via task manager port was closed!
|
Comment by Vladislavs Sokurenko [ 2018 Dec 19 ] |
So current suspicion is that stopping zabbix agent while there is program in system.run can cause such behavior. |
Comment by Sergey Lisikh [ 2018 Dec 21 ] |
Hi again, I managed to reproduce the problem with debug lvl 4 on one of the servers. Looked more attentively at the processes, which didn't let the port close. It was interesting to discover that the start time of these processess strongly differs from the time when service stopped. Also, they were suspended - it seems to be happened because of execution timeout. For instance, 4968:20181220:165652.241 Requested [uHDD["/dev/sda",Temperature_Celsius]] 4968:20181220:165721.432 Sending back [ZBX_NOTSUPPORTED: Timeout while executing a shell script.] 2800:20181220:194704.428 Zabbix Agent stopped. Zabbix 3.4.6 (revision 76819) Here is the full agent log, proclist screenshot with start time of suspended processess and netstat output. "smartctl.exe" is calling from Userparams and "rec_status.exe" is calling via system.run vso unfortunately it's not clear why those processes are not stopped when timed out since Zabbix agent should terminate those. |
Comment by Glebs Ivanovskis [ 2019 Jan 29 ] |
Comment by Alex Kalimulin [ 2019 Feb 05 ] |
Fixed, thank you! |
Comment by Alex Kalimulin [ 2019 Feb 05 ] |
Fixed in:
|