[ZBX-9251] Too many open files / can't identify protocol Created: 2015 Jan 26 Updated: 2017 May 30 Resolved: 2015 Mar 24 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 2.4.3 |
Fix Version/s: | 2.0.15rc1, 2.2.10rc1, 2.4.5rc1, 2.5.0 |
Type: | Incident report | Priority: | Blocker |
Reporter: | Domi Barton | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 1 |
Labels: | agent, filedescriptor | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Debian 7.7 |
Issue Links: |
|
Description |
My Zabbix Notifications gone wild, because all my items started flapping between "Not Supported" and "Normal". After pinpointing the problem, I've seen that there are too many open file descriptors on all my zabbix agents: open file descriptors root@moros:~# lsof | grep zabbix | wc -l 3184 root@moros:~# lsof | grep zabbix | more ... zabbix_ag 1946 zabbix 68u sock 0,7 0t0 603651 can't identify protocol zabbix_ag 1946 zabbix 69u sock 0,7 0t0 614876 can't identify protocol zabbix_ag 1946 zabbix 70u sock 0,7 0t0 619853 can't identify protocol zabbix_ag 1946 zabbix 71u sock 0,7 0t0 631084 can't identify protocol zabbix_ag 1946 zabbix 72u sock 0,7 0t0 642314 can't identify protocol ... I'm using *Zabbix Agent 2.4.3-1* on Debian 7.7: root@moros:~# lsb_release -ds Debian GNU/Linux 7.7 (wheezy) root@moros:~# dpkg -l zabbix-agent Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==================================================-==============================-==============================-========================================================================================================== ii zabbix-agent 1:2.4.3-1+wheezy amd64 network monitoring solution - agent I'm using a *Zabbix MySQL Proxy version 2.4.3-1* as well on Debian 7.7. That never happened before / in the older versions. |
Comments |
Comment by Evgeny Molchanov [ 2015 Feb 09 ] |
Same problem, limit open file ended and zabbix-agent stop. Ubuntu Server 12.04.5 LTS dpkg -l zabbix-agent lsof | grep identify | grep zabb |
Comment by Domi Barton [ 2015 Feb 09 ] |
Hey Zabbix guys, is it possible that the sockets will not be closed properly? Cheers |
Comment by Bill James [ 2015 Feb 09 ] |
We are seeing the same issue on Centos 5, 6, and 7. Number of open files keeps increasing over time. [root@puppet test manifests]# for i in `ps -ef|grep zabbix|awk ' {print $2}'`; do echo -n "$i: ";lsof -p $i|wc -l; done |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 23 ] |
Hi! I would appreciate if you could answer the following questions to help us identify the problem:
|
Comment by Domi Barton [ 2015 Feb 23 ] |
Hi Usually we only have one network interface on the servers, with one exception. But the problem do occur on all servers, even if they've only one NIC. No, we don't use any web.* items, but we're using the auto-discovered net.if.* items on all of our servers. Cheers |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 23 ] |
Hi, Domi. Thank you for prompt response. Are you using multiple Zabbix Servers and/or Zabbix Proxy? |
Comment by Evgeny Molchanov [ 2015 Feb 23 ] |
Hi, we don't use any web.* items. |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 23 ] |
Eugeny, thank for your reply. Do you have multiple Zabbix Servers? |
Comment by Domi Barton [ 2015 Feb 23 ] |
We've 1 Zabbix server and 1 Zabbix proxy. |
Comment by Evgeny Molchanov [ 2015 Feb 23 ] |
We've 1 Zabbix server and 2 Zabbix proxy. |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ] |
What error do you get when item becomes not supported? In your Zabbix agent configuration file what IP address/-es are specified in Server and ServerActive fields? Do you have IP of proxy, or, both, proxy and server? |
Comment by Domi Barton [ 2015 Feb 24 ] |
I get the notification because I've defined a global trigger with the following condition: Event type = Item in "not supported" state The Server directive in the agent config is defined via IP address, ServerActive isn't defined at all. Fixing that to the hostname right now! |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ] |
Domi, in the Zabbix Frontend, what error is displayed for the item that becomes not supported? This error might be found also in the agent's log file. |
Comment by Domi Barton [ 2015 Feb 24 ] |
Sorry my logfiles are already rotated and I don't have a backup of them :/ |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ] |
Domi, just to confirm, Server directive in the agent config has an IP address of Zabbix Proxy, right? |
Comment by Domi Barton [ 2015 Feb 24 ] |
exactly, but I just changed to IP address to the hostname: root@chaos:~# grep ^Server /etc/zabbix/zabbix_agentd.conf Server=monitoring-proxy.confirm.ch ServerActive=monitoring-proxy.confirm.ch |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ] |
OK, is number of opened files increasing now? Does it happen regularly or occasionally? |
Comment by Domi Barton [ 2015 Feb 24 ] |
Can't say that now, because I need to watch it a few days.
root@apollo:~# lsof | grep '^zabbix_ag .* protocol$'
zabbix_ag 9411 zabbix 9u sock 0,7 0t0 9478465 can't identify protocol
zabbix_ag 9412 zabbix 9u sock 0,7 0t0 9467007 can't identify protocol
zabbix_ag 9412 zabbix 10u sock 0,7 0t0 9493068 can't identify protocol
zabbix_ag 9413 zabbix 9u sock 0,7 0t0 9472233 can't identify protocol
zabbix_ag 9413 zabbix 10u sock 0,7 0t0 9486571 can't identify protocol
I'll check it in a few hours again, to see if there are more than 5 open sockets with the error. OK? I think it happens regularly, but it needs a few days, because the default limits are quite high (at least for our use case). zabbix@apollo:/$ ulimit -aH core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 3878 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 3878 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ] |
Thank you, keep me updated if you get any news. |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 24 ] |
Domi, could you please run strace command specifying the process that causes the problem? That can give us more information about opened connections. Example: strace -p PID -s 2048 -f Additionally, tcpdump dump file would be usefull. If there is no sensitive information(IP addresses ,etc.) please attach the output of strace and tcpdump to this issue. |
Comment by Domi Barton [ 2015 Feb 24 ] |
Increasing...
root@apollo:~# lsof | grep '^zabbix_ag .* protocol$'
zabbix_ag 9411 zabbix 9u sock 0,7 0t0 9478465 can't identify protocol
zabbix_ag 9411 zabbix 10u sock 0,7 0t0 9954053 can't identify protocol
zabbix_ag 9411 zabbix 11u sock 0,7 0t0 9967950 can't identify protocol
zabbix_ag 9411 zabbix 12u sock 0,7 0t0 9985879 can't identify protocol
zabbix_ag 9411 zabbix 13u sock 0,7 0t0 9998726 can't identify protocol
zabbix_ag 9411 zabbix 14u sock 0,7 0t0 10032923 can't identify protocol
zabbix_ag 9411 zabbix 15u sock 0,7 0t0 10039702 can't identify protocol
zabbix_ag 9411 zabbix 16u sock 0,7 0t0 10046318 can't identify protocol
zabbix_ag 9411 zabbix 17u sock 0,7 0t0 10057943 can't identify protocol
zabbix_ag 9412 zabbix 9u sock 0,7 0t0 9467007 can't identify protocol
zabbix_ag 9412 zabbix 10u sock 0,7 0t0 9493068 can't identify protocol
zabbix_ag 9412 zabbix 11u sock 0,7 0t0 9961674 can't identify protocol
zabbix_ag 9412 zabbix 12u sock 0,7 0t0 9974215 can't identify protocol
zabbix_ag 9412 zabbix 13u sock 0,7 0t0 9979441 can't identify protocol
zabbix_ag 9412 zabbix 14u sock 0,7 0t0 10013694 can't identify protocol
zabbix_ag 9412 zabbix 15u sock 0,7 0t0 10020707 can't identify protocol
zabbix_ag 9412 zabbix 16u sock 0,7 0t0 10052669 can't identify protocol
zabbix_ag 9412 zabbix 17u sock 0,7 0t0 10064311 can't identify protocol
zabbix_ag 9412 zabbix 18u sock 0,7 0t0 10076997 can't identify protocol
zabbix_ag 9413 zabbix 9u sock 0,7 0t0 9472233 can't identify protocol
zabbix_ag 9413 zabbix 10u sock 0,7 0t0 9486571 can't identify protocol
zabbix_ag 9413 zabbix 11u sock 0,7 0t0 9992374 can't identify protocol
zabbix_ag 9413 zabbix 12u sock 0,7 0t0 10006026 can't identify protocol
zabbix_ag 9413 zabbix 13u sock 0,7 0t0 10027431 can't identify protocol
zabbix_ag 9413 zabbix 14u sock 0,7 0t0 10070554 can't identify protocol
No I can't give you the tcpdump and/or the strace. Is this thing open source & can I have a look at the source code? If you can't reproduce it in your lab, I might be able to setup a new machine and deploy the default config on it via Ansible for your testing. Cheers |
Comment by Domi Barton [ 2015 Feb 24 ] |
lsof: zabbix_ag 9411 zabbix 9u sock 0,7 0t0 9478465 can't identify protocol zabbix_ag 9411 zabbix 10u sock 0,7 0t0 9954053 can't identify protocol zabbix_ag 9411 zabbix 11u sock 0,7 0t0 9967950 can't identify protocol zabbix_ag 9411 zabbix 12u sock 0,7 0t0 9985879 can't identify protocol zabbix_ag 9411 zabbix 13u sock 0,7 0t0 9998726 can't identify protocol zabbix_ag 9411 zabbix 14u sock 0,7 0t0 10032923 can't identify protocol zabbix_ag 9411 zabbix 15u sock 0,7 0t0 10039702 can't identify protocol zabbix_ag 9411 zabbix 16u sock 0,7 0t0 10046318 can't identify protocol zabbix_ag 9411 zabbix 17u sock 0,7 0t0 10057943 can't identify protocol zabbix_ag 9411 zabbix 18u sock 0,7 0t0 10199702 can't identify protocol zabbix_ag 9412 zabbix 9u sock 0,7 0t0 9467007 can't identify protocol zabbix_ag 9412 zabbix 10u sock 0,7 0t0 9493068 can't identify protocol zabbix_ag 9412 zabbix 11u sock 0,7 0t0 9961674 can't identify protocol zabbix_ag 9412 zabbix 12u sock 0,7 0t0 9974215 can't identify protocol zabbix_ag 9412 zabbix 13u sock 0,7 0t0 9979441 can't identify protocol zabbix_ag 9412 zabbix 14u sock 0,7 0t0 10013694 can't identify protocol zabbix_ag 9412 zabbix 15u sock 0,7 0t0 10020707 can't identify protocol zabbix_ag 9412 zabbix 16u sock 0,7 0t0 10052669 can't identify protocol zabbix_ag 9412 zabbix 17u sock 0,7 0t0 10064311 can't identify protocol zabbix_ag 9412 zabbix 18u sock 0,7 0t0 10076997 can't identify protocol zabbix_ag 9413 zabbix 9u sock 0,7 0t0 9472233 can't identify protocol zabbix_ag 9413 zabbix 10u sock 0,7 0t0 9486571 can't identify protocol zabbix_ag 9413 zabbix 11u sock 0,7 0t0 9992374 can't identify protocol zabbix_ag 9413 zabbix 12u sock 0,7 0t0 10006026 can't identify protocol zabbix_ag 9413 zabbix 13u sock 0,7 0t0 10027431 can't identify protocol zabbix_ag 9413 zabbix 14u sock 0,7 0t0 10070554 can't identify protocol file descriptors are still open and linking to a socket: root@apollo:~# ls -la /proc/9411/fd total 0 dr-x------ 2 root root 0 Feb 24 15:51 . dr-xr-xr-x 8 zabbix zabbix 0 Feb 24 15:51 .. lr-x------ 1 root root 64 Feb 24 15:51 0 -> /dev/null l-wx------ 1 root root 64 Feb 24 15:51 1 -> /var/log/zabbix/zabbix_agentd.log lrwx------ 1 root root 64 Feb 24 15:51 10 -> socket:[9954053] lrwx------ 1 root root 64 Feb 24 15:51 11 -> socket:[9967950] lrwx------ 1 root root 64 Feb 24 15:51 12 -> socket:[9985879] lrwx------ 1 root root 64 Feb 24 15:51 13 -> socket:[9998726] lrwx------ 1 root root 64 Feb 24 15:51 14 -> socket:[10032923] lrwx------ 1 root root 64 Feb 24 15:51 15 -> socket:[10039702] lrwx------ 1 root root 64 Feb 24 15:51 16 -> socket:[10046318] lrwx------ 1 root root 64 Feb 24 15:51 17 -> socket:[10057943] lrwx------ 1 root root 64 Feb 24 15:51 18 -> socket:[10199702] l-wx------ 1 root root 64 Feb 24 15:51 2 -> /var/log/zabbix/zabbix_agentd.log l-wx------ 1 root root 64 Feb 24 15:51 3 -> /run/zabbix/zabbix_agentd.pid lr-x------ 1 root root 64 Feb 24 15:51 4 -> pipe:[9460899] lr-x------ 1 root root 64 Feb 24 15:51 5 -> pipe:[9460932] lrwx------ 1 root root 64 Feb 24 15:51 6 -> socket:[9460955] lrwx------ 1 root root 64 Feb 24 15:51 7 -> socket:[9460956] lrwx------ 1 root root 64 Feb 24 15:51 9 -> socket:[9478465] must be some orphan sockets, because for example for FD 9 (inode# 9478465) there is no active tcp, udp and/or unix socket/connection: root@apollo:~# grep 9478465 /proc/net/{tcp,udp,unix} root@apollo:~# I see a lot of TIME_WAITs root@apollo:~# netstat -anee | grep 10050.*WAIT tcp 0 0 10.10.90.1:10050 10.10.90.100:35228 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34804 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34808 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34972 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34996 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35209 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35233 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34951 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35009 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34824 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35195 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34910 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35181 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35125 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35097 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35131 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35108 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35043 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34801 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35156 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34978 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35267 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35013 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35208 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34897 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34964 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35166 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35023 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34859 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35207 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35087 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35117 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35265 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34981 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35037 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35194 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34846 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35080 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35066 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34943 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35235 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34841 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35215 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34919 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34872 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35162 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34793 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35078 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34934 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34986 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34821 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35173 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34994 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35032 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35244 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35179 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34817 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35057 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34828 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35029 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35003 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35227 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35112 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34792 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:34957 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35257 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35187 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35141 TIME_WAIT 0 0 tcp 0 0 10.10.90.1:10050 10.10.90.100:35052 TIME_WAIT 0 0 This means that the connection was closed by the Zabbix Agent and the TCP stack is waiting for packets, which belong to this connection. Of course this is a default TCP/IP behaviour! Is the agent / server reusing the connection for several checks, or is it opening a new connection for each check? Regardless of the TIME_WAIT entries, I don't think that our problem has to do something with it, I just saw it and it might not even be a problem (probably just works as designed in Zabbix). I still think there is something wrong in the code, and if I've to guess, I would say a socket will be opened but on some point it won't be closed properly (you see that a lot when an exception is thrown after the socket was opened and it isn't catched properly). Just my two cents... |
Comment by Domi Barton [ 2015 Feb 25 ] |
Guys if you've a closer look at it, you can see that:
Which means:
So on my point of view, you're opening a socket without closing it. This is just my point of view as a developer |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 25 ] |
Domi, thank you for the information and your comments. I'm looking at this issue now and will get back to you as soon as possible. |
Comment by dimir [ 2015 Feb 25 ] |
I don't have this problem (hit the limit of open files) but here's probably how you could get more information. You'd need LogLevel=4. Here's a one-liner for any zabbix daemon (adjust logdir and pattern variables if needed) : $ logdir=/var/log/zabbix; pattern=zabbix_; while true; do sudo lsof | egrep "$pattern.*can't identify" | awk '{print $2}' | while read pid; do grep " $pid:" $logdir/$pattern*.log; done; sleep 1; done I see next output, however the sockets seem to get closed immediately after issuing the error: /tmp/zabbix_server.log: 28380:20150224:230757.405 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28380:20150224:230842.417 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28380:20150224:230843.418 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28380:20150224:230843.482 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28380:20150224:230914.550 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28415:20150224:221326.539 server #227 started [proxy poller #46] /tmp/zabbix_server.log: 28415:20150224:230810.489 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28415:20150224:230811.570 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28370:20150224:221326.520 server #182 started [proxy poller #1] /tmp/zabbix_server.log: 28370:20150224:225624.995 Error while receiving answer from proxy [Proxy1] [ZBX_TCP_READ() failed: [104] Connection reset by peer] /tmp/zabbix_server.log: 28370:20150224:225704.001 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28370:20150224:225704.065 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28370:20150224:225705.066 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28370:20150224:225705.130 Unable to connect to the proxy [Proxy1] [192.168.1.7]:10051 [cannot connect to [[192.168.1.7]:10051]: [111] Connection refused] /tmp/zabbix_server.log: 28370:20150224:225706.131 Unable to connect to the proxy [Proxy2] [192.168.1.6]:10051 [cannot connect to [[192.168.1.6]:10051]: [111] Connection refused] ... $ lsof -u zabbix | wc -l 957 |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 26 ] |
Please also check what is returned by: sysctl -a | grep time_wait |
Comment by Domi Barton [ 2015 Feb 27 ] |
120s - the default on Debian: root@apollo:~# sysctl -a | grep time_wait net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120 |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 27 ] |
Domi, ok, that's good! Do you have an item monitoring mac address? Please try to run the script posted earlier to get the error messages printed in the log file. |
Comment by Domi Barton [ 2015 Feb 27 ] |
the one liner is grepping way to much information, because I'm running the agent in debug level 4 now. what do you mean with "an item monitoring mac address"? |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 27 ] |
Item system.hw.macaddr Could you upload at least a small portion of the script output? |
Comment by Domi Barton [ 2015 Feb 27 ] |
yes sir, I use that one.
no sir, because it won't help you |
Comment by Igors Homjakovs (Inactive) [ 2015 Feb 27 ] |
I've found out that after processing that item (system.hw.macaddr) the socket is left opened. I'm going to fix that shortly. However, as i understood correctly, in your case, the number of opened FDs is growing very fast. This means that there might be another problem which is causing that. There is one more thing i want to clarify. Do you have any active items, like log[ ] or logrt[ ]? No need to send the log file yet. |
Comment by Igors Homjakovs (Inactive) [ 2015 Mar 02 ] |
Fixed in svn://svn.zabbix.com/branches/dev/ZBX-9251 |
Comment by Andris Zeila [ 2015 Mar 03 ] |
Successfully tested |
Comment by Igors Homjakovs (Inactive) [ 2015 Mar 09 ] |
Available in 2.4.5rc1 r52605 and 2.5.0 (trunk) r52607. |
Comment by MATSUDA Daiki [ 2015 Mar 10 ] |
I saw the fix on svn and think it is needed on older version 2.2, 2.0 and 1.8. |
Comment by Igors Homjakovs (Inactive) [ 2015 Mar 10 ] |
This fix will be also added to versions 2.0 and 2.2 soon. |
Comment by Alexander Vladishev [ 2015 Mar 24 ] |
Available in:
|