[ZBX-8839] Java gateway keeps connections without using any timeout Created: 2014 Sep 30  Updated: 2017 May 30  Resolved: 2015 Mar 23

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Java gateway (J)
Affects Version/s: 2.2.5
Fix Version/s: 2.0.15rc1, 2.2.10rc1, 2.4.5rc1, 2.5.0

Type: Incident report Priority: Blocker
Reporter: Andrei Gushchin (Inactive) Assignee: Unassigned
Resolution: Fixed Votes: 2
Labels: jmx, timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate

 Description   

Probably case

  • If java application exceed all memory in that state it accepts connections, but not responding any requests.
  • all available java pollers trying to get any value from such java host and hung on time_wait state while java application start to respond or reset connections because of rebooting/restarting.

So I think in such cases java_gateway should use timeout (probably we should add new parameter in the settings).



 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Oct 01 ]

This might be an explanation for ZBX-5271.

Comment by Aleksandrs Saveljevs [ 2015 Jan 10 ]

Research shows that there seem to be four ways to approach the problem:

(a) Try using "sun.rmi" properties documented at http://docs.oracle.com/javase/7/docs/technotes/guides/rmi/sunrmiproperties.html . For instance:

$ java -Dsun.rmi.transport.tcp.responseTimeout=1000 Test

However, testing shows that none of the timeout properties on the page have any effect on JMX connection timeout. Still, they might be used for specifying timeouts on an already established connnection.

(b) Try putting properties into environment for the call to JMXConnectorFactory.connect(url, env):

env.put("jmx.remote.x.request.waiting.timeout", new Long(1000));

However, according to https://community.oracle.com/thread/1176791 , this "property only applies to already-established connections, and only with the JMXMP connector (not the RMI connector)". Therefore, this is not a universal solution.

(c) Try replacing RMI socket factory with a custom one, see http://stackoverflow.com/questions/1822695/java-rmi-client-timeout . Considerations regarding this approach are described in http://dev.clojure.org/jira/browse/JMX-5 .

This solution seems to work in the current versions of Java gateway, because we only support URLs of the form "service:jmx:rmi:///jndi/rmi://<conn>:<port>/jmxrmi". However, if later in ZBXNEXT-1274 we add support for other URL types, then they might not use the default RMI socket factory. Therefore, this solution would probably help in the meanwhile, but would have to be rewritten later anyway.

(d) Try using a separate thread for making connections:

While this solution does not seem to be trivial, it seems to be used in practice.

Comment by Aleksandrs Saveljevs [ 2015 Jan 20 ]

Development branch svn://svn.zabbix.com/branches/dev/ZBX-8839 contains a prototype solution. This comment describes its current state, considerations and ways it can be improved.

Let us start with the fact that there were two ways to approach the timeout problem. One would be to set a timeout for the whole JMX-value-getting operation, which includes connecting, querying the objects, etc. This approach has not really been thought of. Instead, another approach that would set a timeout for each network operation was chosen. This is kind of similar to our SNMP implementation, where we set "session.timeout" for each attempt, but there can be multiple attempts due to bulk retries, checking cached indices, etc.

The implemented solution introduces two kinds of timeouts: (a) connect operation timeout based on https://weblogs.java.net/blog/emcmanus/archive/2007/05/making_a_jmx_co.html , where each connect operation is done in a separate thread, and (b) read operation timeout based on "sun.rmi.transport.tcp.responseTimeout". This might solve the problem in ZBX-5271, where all Java gateway threads hang on JMX connections: for over two minutes on connect operation and indefinitely on read operation.

However, there is one consideration mentioned on the Java blog:

If you're making a lot of connections to a lot of machines, you might want to think twice about abandoning threads, because you might end up with a lot of them. But in the more typical case where you're just making one connection, this technique may well be for you.

Zabbix case is the first case from the quote above. Indeed, suppose we set UnavailableDelay to 15 seconds in server configuration file. Connect operation timeout by default is over 2 minutes, so 8 connection threads in the gateway will be alive at any given time for any unavailable JMX host. If, say, we are monitoring 100 unavailable JMX hosts, then that will be 800 connection threads, which is not very inspiring.

Continuing the above, another consideration mentioned at http://dev.bizo.com/2014/06/cached-thread-pool-considered-harmlful.html is that Executors.newCachedThreadPool(), which is used in the current implementation, is unbounded. Therefore, a malicious attacker can create quite a number of connection threads in the gateway.

Yet another minor consideration is that currently threads created by Executors.newCachedThreadPool() with our DaemonThreadFactory have names "pool-2-thread-1", "pool-3-thread-1", "pool-4-thread-1", "pool-5-thread-1", etc., as opposed to our main threads "pool-1-thread-1", "pool-1-thread-2", "pool-1-thread-3", "pool-1-thread-4", etc. Once executor implementation is finalized, this should be improved.

Comment by dimir [ 2015 Feb 02 ]

Successfully tested!

Comment by Aleksandrs Saveljevs [ 2015 Mar 04 ]

It is important to address issues described in my comment above. Therefore, reopening to improve the implementation.

Comment by Aleksandrs Saveljevs [ 2015 Mar 05 ]

Commits 52510 and 52514 fix thread names created by our DaemonThreadFactory. Looking at Java source code in JDK installation, Executors.defaultThreadFactory() returns a new factory each time it is called. That is why threads were previously named "pool-2-thread-1", "pool-3-thread-1", "pool-4-thread-1", etc. They should now be named "pool-2-thread-1", "pool-2-thread-2", "pool-2-thread-3", and so on.

This is useful, because we now have two thread pools and we should be able to easily distinguish between their threads: one pool for pollers, instantiated in JavaGateway.java:70, with thread names beginning with "pool-1", and another for connection threads, created in ZabbixJMXConnectorFactory.java:44, with thread names beginning with "pool-2".

Comment by Aleksandrs Saveljevs [ 2015 Mar 05 ]

It should now be tested how heavyweight connection threads in the second pool are. They just wait for a connection to be established and do not consume any CPU.

Note that a malicious attacker cannot create an arbitrary number of threads. He can only create a maximum of around START_POLLERS * ("2 minutes 7 seconds" / TIMEOUT), where "2 minutes 7 seconds" is the hardcoded JMX connection timeout. With a default setting of TIMEOUT="3 seconds", this makes it "2 minutes 7 seconds" / "3 seconds" = 42.33 connection threads per poller. It should be tested whether this is acceptable, or we should impose a different, lower limit.

Comment by dimir [ 2015 Mar 11 ]

I've done some tests of Zabbix server working with unreachable JMX interface on a usual workstation (Pentium Dual-Core E5400 2.70 GHz, 4 GB RAM) with

Zabbix server settings:

UnavailableDelay=3
UnreachablePeriod=3
UnreachableDelay=3
StartPollersUnreachable=25

Java gateway settings:

START_POLLERS=25

and 1 minute 9 seconds timeout of TCP connection on my OS and here are the results I got:

JMX hosts Number of threads Memory usage CPU usage
500 ~550 ~140MB ~3%
1000 ~550 ~140MB ~3%
3000 ~550 ~140MB ~3%

So I conclude that this solution is acceptable.

Comment by Aleksandrs Saveljevs [ 2015 Mar 11 ]

(1) Also in this fix we shall replace legacy, synchronized Vector with non-synchronized ArrayList.

<dimir> Looks great. Please review my small change in r52720.

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Mar 13 ]

New timeout configuration option for Java gateway is available in pre-2.0.15 r52723, pre-2.2.10 r52724, pre-2.4.5 r52725, and pre-2.5.0 (trunk) r52726.

Comment by Aleksandrs Saveljevs [ 2015 Mar 13 ]

Documented at the following locations:

sasha CLOSED

Comment by Oleksii Zagorskyi [ 2015 Dec 03 ]

There was a mistake in official zabbix packages and this new parameter was not actually applied.
Reported in ZBX-10132

Generated at Fri Apr 19 10:37:29 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.