[ZBX-9251] Too many open files / can't identify protocol Created: 2015 Jan 26  Updated: 2017 May 30  Resolved: 2015 Mar 24

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.4.3
Fix Version/s: 2.0.15rc1, 2.2.10rc1, 2.4.5rc1, 2.5.0

Type: Incident report Priority: Blocker
Reporter: Domi Barton Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: agent, filedescriptor
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian 7.7


Issue Links:
Duplicate

 Description   

My Zabbix Notifications gone wild, because all my items started flapping between "Not Supported" and "Normal".

After pinpointing the problem, I've seen that there are too many open file descriptors on all my zabbix agents:

open file descriptors
root@moros:~# lsof | grep zabbix | wc -l
3184
root@moros:~# lsof | grep zabbix | more
...
zabbix_ag  1946       zabbix   68u     sock                0,7      0t0     603651 can't identify protocol
zabbix_ag  1946       zabbix   69u     sock                0,7      0t0     614876 can't identify protocol
zabbix_ag  1946       zabbix   70u     sock                0,7      0t0     619853 can't identify protocol
zabbix_ag  1946       zabbix   71u     sock                0,7      0t0     631084 can't identify protocol
zabbix_ag  1946       zabbix   72u     sock                0,7      0t0     642314 can't identify protocol
...

I'm using *Zabbix Agent 2.4.3-1* on Debian 7.7:

root@moros:~# lsb_release -ds
Debian GNU/Linux 7.7 (wheezy)

root@moros:~# dpkg -l zabbix-agent
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                               Version                        Architecture                   Description
+++-==================================================-==============================-==============================-==========================================================================================================
ii  zabbix-agent                                       1:2.4.3-1+wheezy               amd64                          network monitoring solution - agent

I'm using a *Zabbix MySQL Proxy version 2.4.3-1* as well on Debian 7.7.
The *Server is version 2.4.3-1* and is also running on Debian 7.7.

That never happened before / in the older versions.



 Comments   
Comment by Evgeny Molchanov [ 2015 Feb 09 ]

Same problem, limit open file ended and zabbix-agent stop.

Ubuntu Server 12.04.5 LTS

dpkg -l zabbix-agent
ii zabbix-agent 1:2.4.3-1+precise network monitoring solution - agent

lsof | grep identify | grep zabb
zabbix_ag 1961 zabbix 7u sock 0,7 0t0 41843682 can't identify protocol
zabbix_ag 1961 zabbix 8u sock 0,7 0t0 41843683 can't identify protocol
zabbix_ag 1961 zabbix 9u sock 0,7 0t0 41852033 can't identify protocol
zabbix_ag 1961 zabbix 10u sock 0,7 0t0 41843702 can't identify protocol
zabbix_ag 1961 zabbix 11u sock 0,7 0t0 41843703 can't identify protocol
zabbix_ag 1961 zabbix 12u sock 0,7 0t0 41851650 can't identify protocol
zabbix_ag 1961 zabbix 13u sock 0,7 0t0 41851775 can't identify protocol
zabbix_ag 1961 zabbix 14u sock 0,7 0t0 41852280 can't identify protocol
zabbix_ag 1961 zabbix 15u sock 0,7 0t0 41852293 can't identify protocol
zabbix_ag 1961 zabbix 16u sock 0,7 0t0 41852296 can't identify protocol
zabbix_ag 1961 zabbix 17u sock 0,7 0t0 41851824 can't identify protocol

Comment by Domi Barton [ 2015 Feb 09 ]

Hey Zabbix guys,

is it possible that the sockets will not be closed properly?

Cheers
Domi

Comment by Bill James [ 2015 Feb 09 ]

We are seeing the same issue on Centos 5, 6, and 7.
zabbix-agent-2.4.3-1.el6.x86_64

Number of open files keeps increasing over time.

[root@puppet test manifests]# for i in `ps -ef|grep zabbix|awk '

{print $2}

'`; do echo -n "$i: ";lsof -p $i|wc -l; done
3085: 44
3087: 44
3088: 17835
3089: 18008
3090: 17970
3091: 44
3092: 45
13679: 0
[root@puppet test manifests]# ps -ef|grep zabbix
zabbix 3085 1 0 Jan09 ? 00:00:00 zabbix_agentd -c /etc/zabbix/zabbix_agentd.conf
zabbix 3087 3085 0 Jan09 ? 00:09:34 zabbix_agentd: collector [idle 1 sec]
zabbix 3088 3085 0 Jan09 ? 00:00:50 zabbix_agentd: listener #1 [waiting for connection]
zabbix 3089 3085 0 Jan09 ? 00:00:51 zabbix_agentd: listener #2 [waiting for connection]
zabbix 3090 3085 0 Jan09 ? 00:00:50 zabbix_agentd: listener #3 [waiting for connection]
zabbix 3091 3085 0 Jan09 ? 00:17:59 zabbix_agentd: active checks #1 [idle 1 sec]
zabbix 3092 3085 0 Jan09 ? 00:01:39 zabbix_agentd: active checks #2 [idle 1 sec]
root 13717 26564 0 11:56 pts/0 00:00:00 grep zabbix
[root@puppet test manifests]#

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 23 ]

Hi! I would appreciate if you could answer the following questions to help us identify the problem:

  • are you using and of web.* or net.* items?
  • do you have multiple network interfaces?
Comment by Domi Barton [ 2015 Feb 23 ]

Hi

Usually we only have one network interface on the servers, with one exception. But the problem do occur on all servers, even if they've only one NIC.

No, we don't use any web.* items, but we're using the auto-discovered net.if.* items on all of our servers.

Cheers
Domi

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 23 ]

Hi, Domi. Thank you for prompt response.

Are you using multiple Zabbix Servers and/or Zabbix Proxy?

Comment by Evgeny Molchanov [ 2015 Feb 23 ]

Hi, we don't use any web.* items.
I left only the default template of Linux, but a problem with the number of open files still remained.
Host monitored by Zabbix proxy in the active mode.
The agent also works in active mode.

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 23 ]

Eugeny, thank for your reply.

Do you have multiple Zabbix Servers?

Comment by Domi Barton [ 2015 Feb 23 ]

We've 1 Zabbix server and 1 Zabbix proxy.
Most of the hosts are monitored by the Zabbix proxy.

Comment by Evgeny Molchanov [ 2015 Feb 23 ]

We've 1 Zabbix server and 2 Zabbix proxy.

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ]

What error do you get when item becomes not supported?

In your Zabbix agent configuration file what IP address/-es are specified in Server and ServerActive fields? Do you have IP of proxy, or, both, proxy and server?

Comment by Domi Barton [ 2015 Feb 24 ]

I get the notification because I've defined a global trigger with the following condition:

Event type = Item in "not supported" state

The Server directive in the agent config is defined via IP address, ServerActive isn't defined at all.
The Server directive in the proxy config is defined via hostname.

Fixing that to the hostname right now!

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ]

Domi, in the Zabbix Frontend, what error is displayed for the item that becomes not supported? This error might be found also in the agent's log file.

Comment by Domi Barton [ 2015 Feb 24 ]

Sorry my logfiles are already rotated and I don't have a backup of them :/

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ]

Domi, just to confirm, Server directive in the agent config has an IP address of Zabbix Proxy, right?

Comment by Domi Barton [ 2015 Feb 24 ]

exactly, but I just changed to IP address to the hostname:

root@chaos:~# grep ^Server /etc/zabbix/zabbix_agentd.conf
Server=monitoring-proxy.confirm.ch
ServerActive=monitoring-proxy.confirm.ch
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ]

OK, is number of opened files increasing now? Does it happen regularly or occasionally?

Comment by Domi Barton [ 2015 Feb 24 ]

Can't say that now, because I need to watch it a few days.
What I see since the restart is:

root@apollo:~# lsof | grep '^zabbix_ag .* protocol$'
zabbix_ag  9411           zabbix    9u     sock                0,7       0t0    9478465 can't identify protocol
zabbix_ag  9412           zabbix    9u     sock                0,7       0t0    9467007 can't identify protocol
zabbix_ag  9412           zabbix   10u     sock                0,7       0t0    9493068 can't identify protocol
zabbix_ag  9413           zabbix    9u     sock                0,7       0t0    9472233 can't identify protocol
zabbix_ag  9413           zabbix   10u     sock                0,7       0t0    9486571 can't identify protocol

I'll check it in a few hours again, to see if there are more than 5 open sockets with the error. OK?

I think it happens regularly, but it needs a few days, because the default limits are quite high (at least for our use case).

zabbix@apollo:/$ ulimit -aH
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3878
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3878
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ]

Thank you, keep me updated if you get any news.

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ]

Domi, could you please run strace command specifying the process that causes the problem? That can give us more information about opened connections.

Example: strace -p PID -s 2048 -f

Additionally, tcpdump dump file would be usefull.

If there is no sensitive information(IP addresses ,etc.) please attach the output of strace and tcpdump to this issue.

Comment by Domi Barton [ 2015 Feb 24 ]

Increasing...

root@apollo:~# lsof | grep '^zabbix_ag .* protocol$'
zabbix_ag  9411           zabbix    9u     sock                0,7       0t0    9478465 can't identify protocol
zabbix_ag  9411           zabbix   10u     sock                0,7       0t0    9954053 can't identify protocol
zabbix_ag  9411           zabbix   11u     sock                0,7       0t0    9967950 can't identify protocol
zabbix_ag  9411           zabbix   12u     sock                0,7       0t0    9985879 can't identify protocol
zabbix_ag  9411           zabbix   13u     sock                0,7       0t0    9998726 can't identify protocol
zabbix_ag  9411           zabbix   14u     sock                0,7       0t0   10032923 can't identify protocol
zabbix_ag  9411           zabbix   15u     sock                0,7       0t0   10039702 can't identify protocol
zabbix_ag  9411           zabbix   16u     sock                0,7       0t0   10046318 can't identify protocol
zabbix_ag  9411           zabbix   17u     sock                0,7       0t0   10057943 can't identify protocol
zabbix_ag  9412           zabbix    9u     sock                0,7       0t0    9467007 can't identify protocol
zabbix_ag  9412           zabbix   10u     sock                0,7       0t0    9493068 can't identify protocol
zabbix_ag  9412           zabbix   11u     sock                0,7       0t0    9961674 can't identify protocol
zabbix_ag  9412           zabbix   12u     sock                0,7       0t0    9974215 can't identify protocol
zabbix_ag  9412           zabbix   13u     sock                0,7       0t0    9979441 can't identify protocol
zabbix_ag  9412           zabbix   14u     sock                0,7       0t0   10013694 can't identify protocol
zabbix_ag  9412           zabbix   15u     sock                0,7       0t0   10020707 can't identify protocol
zabbix_ag  9412           zabbix   16u     sock                0,7       0t0   10052669 can't identify protocol
zabbix_ag  9412           zabbix   17u     sock                0,7       0t0   10064311 can't identify protocol
zabbix_ag  9412           zabbix   18u     sock                0,7       0t0   10076997 can't identify protocol
zabbix_ag  9413           zabbix    9u     sock                0,7       0t0    9472233 can't identify protocol
zabbix_ag  9413           zabbix   10u     sock                0,7       0t0    9486571 can't identify protocol
zabbix_ag  9413           zabbix   11u     sock                0,7       0t0    9992374 can't identify protocol
zabbix_ag  9413           zabbix   12u     sock                0,7       0t0   10006026 can't identify protocol
zabbix_ag  9413           zabbix   13u     sock                0,7       0t0   10027431 can't identify protocol
zabbix_ag  9413           zabbix   14u     sock                0,7       0t0   10070554 can't identify protocol

No I can't give you the tcpdump and/or the strace.

Is this thing open source & can I have a look at the source code?
There must be an exception or something like that, where the socket will not be closed successfully (just a guess).

If you can't reproduce it in your lab, I might be able to setup a new machine and deploy the default config on it via Ansible for your testing.

Cheers
Domi

Comment by Domi Barton [ 2015 Feb 24 ]

lsof:

zabbix_ag  9411           zabbix    9u     sock                0,7       0t0    9478465 can't identify protocol
zabbix_ag  9411           zabbix   10u     sock                0,7       0t0    9954053 can't identify protocol
zabbix_ag  9411           zabbix   11u     sock                0,7       0t0    9967950 can't identify protocol
zabbix_ag  9411           zabbix   12u     sock                0,7       0t0    9985879 can't identify protocol
zabbix_ag  9411           zabbix   13u     sock                0,7       0t0    9998726 can't identify protocol
zabbix_ag  9411           zabbix   14u     sock                0,7       0t0   10032923 can't identify protocol
zabbix_ag  9411           zabbix   15u     sock                0,7       0t0   10039702 can't identify protocol
zabbix_ag  9411           zabbix   16u     sock                0,7       0t0   10046318 can't identify protocol
zabbix_ag  9411           zabbix   17u     sock                0,7       0t0   10057943 can't identify protocol
zabbix_ag  9411           zabbix   18u     sock                0,7       0t0   10199702 can't identify protocol
zabbix_ag  9412           zabbix    9u     sock                0,7       0t0    9467007 can't identify protocol
zabbix_ag  9412           zabbix   10u     sock                0,7       0t0    9493068 can't identify protocol
zabbix_ag  9412           zabbix   11u     sock                0,7       0t0    9961674 can't identify protocol
zabbix_ag  9412           zabbix   12u     sock                0,7       0t0    9974215 can't identify protocol
zabbix_ag  9412           zabbix   13u     sock                0,7       0t0    9979441 can't identify protocol
zabbix_ag  9412           zabbix   14u     sock                0,7       0t0   10013694 can't identify protocol
zabbix_ag  9412           zabbix   15u     sock                0,7       0t0   10020707 can't identify protocol
zabbix_ag  9412           zabbix   16u     sock                0,7       0t0   10052669 can't identify protocol
zabbix_ag  9412           zabbix   17u     sock                0,7       0t0   10064311 can't identify protocol
zabbix_ag  9412           zabbix   18u     sock                0,7       0t0   10076997 can't identify protocol
zabbix_ag  9413           zabbix    9u     sock                0,7       0t0    9472233 can't identify protocol
zabbix_ag  9413           zabbix   10u     sock                0,7       0t0    9486571 can't identify protocol
zabbix_ag  9413           zabbix   11u     sock                0,7       0t0    9992374 can't identify protocol
zabbix_ag  9413           zabbix   12u     sock                0,7       0t0   10006026 can't identify protocol
zabbix_ag  9413           zabbix   13u     sock                0,7       0t0   10027431 can't identify protocol
zabbix_ag  9413           zabbix   14u     sock                0,7       0t0   10070554 can't identify protocol

file descriptors are still open and linking to a socket:

root@apollo:~# ls -la /proc/9411/fd
total 0
dr-x------ 2 root   root    0 Feb 24 15:51 .
dr-xr-xr-x 8 zabbix zabbix  0 Feb 24 15:51 ..
lr-x------ 1 root   root   64 Feb 24 15:51 0 -> /dev/null
l-wx------ 1 root   root   64 Feb 24 15:51 1 -> /var/log/zabbix/zabbix_agentd.log
lrwx------ 1 root   root   64 Feb 24 15:51 10 -> socket:[9954053]
lrwx------ 1 root   root   64 Feb 24 15:51 11 -> socket:[9967950]
lrwx------ 1 root   root   64 Feb 24 15:51 12 -> socket:[9985879]
lrwx------ 1 root   root   64 Feb 24 15:51 13 -> socket:[9998726]
lrwx------ 1 root   root   64 Feb 24 15:51 14 -> socket:[10032923]
lrwx------ 1 root   root   64 Feb 24 15:51 15 -> socket:[10039702]
lrwx------ 1 root   root   64 Feb 24 15:51 16 -> socket:[10046318]
lrwx------ 1 root   root   64 Feb 24 15:51 17 -> socket:[10057943]
lrwx------ 1 root   root   64 Feb 24 15:51 18 -> socket:[10199702]
l-wx------ 1 root   root   64 Feb 24 15:51 2 -> /var/log/zabbix/zabbix_agentd.log
l-wx------ 1 root   root   64 Feb 24 15:51 3 -> /run/zabbix/zabbix_agentd.pid
lr-x------ 1 root   root   64 Feb 24 15:51 4 -> pipe:[9460899]
lr-x------ 1 root   root   64 Feb 24 15:51 5 -> pipe:[9460932]
lrwx------ 1 root   root   64 Feb 24 15:51 6 -> socket:[9460955]
lrwx------ 1 root   root   64 Feb 24 15:51 7 -> socket:[9460956]
lrwx------ 1 root   root   64 Feb 24 15:51 9 -> socket:[9478465]

must be some orphan sockets, because for example for FD 9 (inode# 9478465) there is no active tcp, udp and/or unix socket/connection:

root@apollo:~# grep 9478465 /proc/net/{tcp,udp,unix}
root@apollo:~#

I see a lot of TIME_WAITs

root@apollo:~# netstat -anee | grep 10050.*WAIT
tcp        0      0 10.10.90.1:10050        10.10.90.100:35228      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34804      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34808      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34972      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34996      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35209      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35233      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34951      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35009      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34824      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35195      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34910      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35181      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35125      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35097      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35131      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35108      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35043      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34801      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35156      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34978      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35267      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35013      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35208      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34897      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34964      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35166      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35023      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34859      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35207      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35087      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35117      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35265      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34981      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35037      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35194      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34846      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35080      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35066      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34943      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35235      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34841      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35215      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34919      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34872      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35162      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34793      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35078      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34934      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34986      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34821      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35173      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34994      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35032      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35244      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35179      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34817      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35057      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34828      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35029      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35003      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35227      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35112      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34792      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:34957      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35257      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35187      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35141      TIME_WAIT   0          0          
tcp        0      0 10.10.90.1:10050        10.10.90.100:35052      TIME_WAIT   0          0

This means that the connection was closed by the Zabbix Agent and the TCP stack is waiting for packets, which belong to this connection. Of course this is a default TCP/IP behaviour! Is the agent / server reusing the connection for several checks, or is it opening a new connection for each check?

Regardless of the TIME_WAIT entries, I don't think that our problem has to do something with it, I just saw it and it might not even be a problem (probably just works as designed in Zabbix).

I still think there is something wrong in the code, and if I've to guess, I would say a socket will be opened but on some point it won't be closed properly (you see that a lot when an exception is thrown after the socket was opened and it isn't catched properly).

Just my two cents...

Comment by Domi Barton [ 2015 Feb 25 ]

Guys if you've a closer look at it, you can see that:

  • there is an open file descriptor (see /proc/<pid>/fd)
  • apparently it's a socket (/proc/<pid>/fd/<fd> is a link)
  • you've no open connection for this file descriptor (no entry in /proc/net/tcp)

Which means:

  • you're opening a socket (e.g. fsockopen()) somewhere in the in your application, without connecting (results in an open FD but no connection / SYN packet was seen by the kernel) or closing the FD (e.g. close())
  • you've opened a socket (e.g. fsockopen()), tried to connect and it didn't work, but the FD wasn't closed (e.g. close())
  • you've opened a socket (e.g. fsockopen()), connected sucessfully and disconnected at some point (you or the remote site), without closing the socket afterwards (e.g. close())

So on my point of view, you're opening a socket without closing it.
Long story short: The FD for the socket is still open, regardless of the connection (which was never made or disconnected at some point).

This is just my point of view as a developer

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 25 ]

Domi, thank you for the information and your comments. I'm looking at this issue now and will get back to you as soon as possible.

Comment by dimir [ 2015 Feb 25 ]

I don't have this problem (hit the limit of open files) but here's probably how you could get more information. You'd need LogLevel=4. Here's a one-liner for any zabbix daemon (adjust logdir and pattern variables if needed) :

$ logdir=/var/log/zabbix; pattern=zabbix_; while true; do sudo lsof | egrep "$pattern.*can't identify" | awk '{print $2}' | while read pid; do grep " $pid:" $logdir/$pattern*.log; done; sleep 1; done

I see next output, however the sockets seem to get closed immediately after issuing the error:

/tmp/zabbix_server.log: 28380:20150224:230757.405 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28380:20150224:230842.417 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28380:20150224:230843.418 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28380:20150224:230843.482 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28380:20150224:230914.550 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28415:20150224:221326.539 server #227 started [proxy poller #46]
/tmp/zabbix_server.log: 28415:20150224:230810.489 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28415:20150224:230811.570 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28370:20150224:221326.520 server #182 started [proxy poller #1]
/tmp/zabbix_server.log: 28370:20150224:225624.995 Error while receiving answer from proxy [Proxy1] [ZBX_TCP_READ() failed: [104] Connection reset by peer]
/tmp/zabbix_server.log: 28370:20150224:225704.001 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28370:20150224:225704.065 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28370:20150224:225705.066 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28370:20150224:225705.130 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused]
/tmp/zabbix_server.log: 28370:20150224:225706.131 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused]
...

$ lsof -u zabbix | wc -l
957
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 26 ]

Please also check what is returned by:

sysctl -a | grep time_wait
Comment by Domi Barton [ 2015 Feb 27 ]

120s - the default on Debian:

root@apollo:~# sysctl -a | grep time_wait
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 27 ]

Domi, ok, that's good!

Do you have an item monitoring mac address?

Please try to run the script posted earlier to get the error messages printed in the log file.

Comment by Domi Barton [ 2015 Feb 27 ]

the one liner is grepping way to much information, because I'm running the agent in debug level 4 now.
but beside the amount of data, I don't have any "error" or "unable" strings in my (currently) huge logfile :/

what do you mean with "an item monitoring mac address"?

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 27 ]

Item system.hw.macaddr

Could you upload at least a small portion of the script output?

Comment by Domi Barton [ 2015 Feb 27 ]

Item system.hw.macaddr

yes sir, I use that one.

Could you upload at least a small portion of the script output?

no sir, because it won't help you
if you like I can upload you the whole logfile, just PM me please.

Comment by Igors Homjakovs (Inactive) [ 2015 Feb 27 ]

I've found out that after processing that item (system.hw.macaddr) the socket is left opened. I'm going to fix that shortly. However, as i understood correctly, in your case, the number of opened FDs is growing very fast. This means that there might be another problem which is causing that.

There is one more thing i want to clarify. Do you have any active items, like log[ ] or logrt[ ]?

No need to send the log file yet.

Comment by Igors Homjakovs (Inactive) [ 2015 Mar 02 ]

Fixed in svn://svn.zabbix.com/branches/dev/ZBX-9251

Comment by Andris Zeila [ 2015 Mar 03 ]

Successfully tested

Comment by Igors Homjakovs (Inactive) [ 2015 Mar 09 ]

Available in 2.4.5rc1 r52605 and 2.5.0 (trunk) r52607.

Comment by MATSUDA Daiki [ 2015 Mar 10 ]

I saw the fix on svn and think it is needed on older version 2.2, 2.0 and 1.8.

Comment by Igors Homjakovs (Inactive) [ 2015 Mar 10 ]

This fix will be also added to versions 2.0 and 2.2 soon.

Comment by Alexander Vladishev [ 2015 Mar 24 ]

Available in:

  • pre-2.0.15 r52857
  • pre-2.2.10 r52858
  • pre-2.4.5 r52605
  • pre-2.5.0 (trunk) r52607
Generated at Wed Apr 24 22:49:51 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.