[ZBX-8545] Low performance for key net.tcp.service[] which reads /proc/net/tcp file Created: 2014 Jul 30  Updated: 2017 May 30  Resolved: 2014 Nov 26

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.0.12, 2.2.4, 2.2.5, 2.3.2
Fix Version/s: 2.5.0

Type: Incident report Priority: Critical
Reporter: jaseywang Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: performance, tcp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

server:
zabbix-server-2.0.5-1.el6.x86_64

$ zabbix_agentd --version
Zabbix Agent (daemon) v2.0.5 (revision 33558) (12 February 2013)
Compilation time: Feb 14 2013 10:58:53



 Description   

After putting zabbix 2.0.5 into our production env for more than one year, we found some critical issues, here is one:
The built-in key net.tcp.service[] needs to call NET_TCP_LISTEN function located in src/libs/zbxsysinfo/linux/net.c, and this function need to read /proc/net/tcp file
we know, when the tcp connections are large, like 100k even more which is quite normal in our production servers, the performance is not so good, it needs to take tens even hundreds of seconds to get the correct result, which will cause agent timeout, for proxy or server, it won't get any data during that period, and it will trigger some false alert.

For us, we now use ss command to get the correct data ASAP to work around, ss doesn't need to read that file and it return the results very quickly.

The impact really depends on your service running on your server, for those who don't have so many connections, no worry, but for those who have tons of connections like us, it's really critical issue cuz it usually sends false alert.

At the moment, the latest stable version 2.2.5 haven't fix that, and hope you guys fix that ASAP.

ref:
http://stackoverflow.com/questions/11763376/difference-between-netstat-and-ss-in-linux



 Comments   
Comment by Juris Miščenko (Inactive) [ 2014 Aug 21 ]

The use of the netlink interface has been implemented at svn://svn.zabbix.com/branches/dev/ZBX-8545.

This change applies only to 1) systems using a Linux kernel starting from version 2.6.14 and 2) the net.tcp.listen item.

The netlink subprotocol required for this type of diagnostic (NETLINK_INET_DIAG) was added only in the 2.6.14 kernel and unfortunately, there's no clear way of retrieving UDP protocol socket information. Even the previously mentioned ss(1) utility from the iproute2 package resorts to reading from /proc/net/udp when it comes to UDP sockets.

The main issue in implementing anything netlink related is the severe lack of documentation regarding requests and responses, although documentation on the delivery mechanism and its operation is abundant.

Also, an issue that we're currently facing is determining the effectiveness of this change as detailed execution time advantages might only become apparent on high connectivity systems. If anyone has a system where the shortcomings of the previous implementations of the net.tcp.listen item were obviously too slow, it would be nice to hear some feedback on performance changes after applying this patch.

Comment by Andris Zeila [ 2014 Sep 08 ]

Successfully tested, please review my code changes in r48800, r48854

Comment by Juris Miščenko (Inactive) [ 2014 Sep 15 ]

Changes merged in 2.5.0 (trunk) at r48983.

Comment by Aleksandrs Saveljevs [ 2014 Sep 15 ]

(1) Compiler gives a warning regarding the new code:

$ make
...
net.c: In function ‘NET_TCP_LISTEN’:
net.c:570:44: warning: unused variable ‘found’ [-Wunused-variable]
  int  ret = SYSINFO_RET_FAIL, n, buffer_alloc = 64 * ZBX_KIBIBYTE, found = 0;
                                            ^
...

jurism RESOLVED.

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Sep 15 ]

(2) The following change is suggested:

$ svn di
Index: src/libs/zbxsysinfo/linux/net.c
===================================================================
--- src/libs/zbxsysinfo/linux/net.c     (revision 48989)
+++ src/libs/zbxsysinfo/linux/net.c     (working copy)
@@ -71,7 +71,7 @@
        NLERR_UNKNOWNMSGTYPE
 };
 
-int    nlerr;
+static int     nlerr;
 
 static int     find_tcp_port_by_state_nl(unsigned short port, int state, int *found)
 {
@@ -593,7 +593,7 @@
        {
                char    *error = NULL;
 
-               switch(nlerr)
+               switch (nlerr)
                {
                        case NLERR_UNKNOWN:
                                error = zbx_strdup(error, "unrecognized netlink error occurred");

jurism RESOLVED.

asaveljevs CLOSED

Comment by richlv [ 2014 Sep 15 ]

(3) docs :

  • whatsnew
  • upgrade notes
  • describe somewhere the mechanism that is used to obtain information

asaveljevs ChangeLog says that "Old method of information retrieval also improved". However, it nowhere says how exactly.

jurism What's new has been updated, Upgrade notes contain a terse description with a backlink to the what's new page containing the details of the change. RESOLVED.

asaveljevs Pages in question are:

The first one looks good, but the following change is proposed for the second:

net.tcp.listen on Linux kernels 2.6.14 and above, when detecting NETLINK capabilities, now tries to obtain socket information on sockets in the LISTEN state from the kernel.

to

net.tcp.listen on Linux kernels 2.6.14 and above, now tries to obtain information on sockets in the LISTEN state from the kernel's NETLINK interface.

Also, richlv's third suggestion was not addressed and it might be useful to add it to net.tcp.listen[] item at https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/zabbix_agent . It may be done it a way similar to sensor[] item. REOPENED.

jurism
I disagree. The system doesn't do so, unless it actually detects that the kernel provides netlink facilities. They are not guaranteed, and in the event of NOT being able to employ the interface, which isn't impossible, it falls back on reading from the kernels procfs interface. Why remove this piece of information?

Also, seeing as the code interfacing with the kernel isn't based on a standard or official documentation, we cannot guarantee the correctness and conformance of the code. This warrants internal documentation at best.

wiper I think it would be better not to oversaturate the https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/zabbix_agent with information, but to have separate pages with detailed description. The same goes for sensor item. However that would take a lot of work.

asaveljevs It might be that a couple of sentences like "On Linux 2.6.14 and above information is obtained using the kernel's NETLINK interface, if possible. If not, information is read from /proc/net.tcp." would not be that much of an oversaturation.

jurism Added comment about NETLINK to the net.tcp.listen item in 3.0 documentation.

asaveljevs Your changes above also touched upgrade notes. I have fixed a typo at https://www.zabbix.com/documentation/3.0/manual/installation/upgrade_notes_300?&#item_changes . Please review. RESOLVED.

jurism Everything looks fine. CLOSED.

Comment by Juris Miščenko (Inactive) [ 2014 Sep 15 ]

Fixes in code have been commited to trunk at r49015.
Documentation is pending.

jurism Documentation has been updated.

Comment by Aleksandrs Saveljevs [ 2014 Sep 16 ]

Please revert trunk changes and create a proper development branch.

jurism Changes have been reverted. RESOLVED.

asaveljevs The reverting commit was r49052. CLOSED.

Comment by Aleksandrs Saveljevs [ 2014 Sep 16 ]

(4) As sasha mentioned in (2) in ZBX-8367, functions should always set output parameters and not rely on the variable being initialized prior to the call. In this case, function find_tcp_port_by_state_nl() relies on "found" variable being initialized outside of the function.

jurism RESOLVED.

asaveljevs It seems variable "found" could have been left uninitialized in case we get NLMSG_DONE. I have fixed that and removed unnecessary initializations and string allocations in r49081. Please take a look. RESOLVED.

jurism Changes look fine. CLOSED.

Comment by Aleksandrs Saveljevs [ 2014 Sep 16 ]

(5) I have a Linux kernel above 2.6.14, but by default trunk compiles without Netlink support on my machine. It should be documented what needs to be done to compile with Netlink.

asaveljevs Seems to be an error on my part - did not rerun ./bootstrap.sh. WON'T FIX.

Comment by Juris Miščenko (Inactive) [ 2014 Sep 16 ]

The necessary changes have been commited to the development branch at its original location.

Comment by Juris Miščenko (Inactive) [ 2014 Sep 22 ]

Change merged into 2.5.0 (trunk) at r49187.

Comment by richlv [ 2014 Oct 25 ]

subissues still open : (3)

Generated at Fri Apr 26 03:20:10 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.