[ZBX-5593] Zabbix Java Gateway needs to be restarted periodically Created: 2012 Sep 18 Updated: 2017 May 30 Resolved: 2014 Feb 17 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Java gateway (J) |
Affects Version/s: | 2.0.2 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Critical |
Reporter: | Trevor McLeod | Assignee: | Unassigned |
Resolution: | Duplicate | Votes: | 8 |
Labels: | javagateway | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Red Hat Enterprise 6.x |
Attachments: |
![]() |
||||||||
Issue Links: |
|
Description |
Hello, We are monitoring 16 servers using both the Zabbix agent and JMX. Periodically, our JMX connections to all servers fail and all the JMX indicators turn red in Zabbix (Configuration- We have turned on debugging for the Java Gateway. Unfortunately, it seems that when the log reaches 5Mb in size it automatically rolls over a maximum of three times. Because of the volume of data being logged, by the time I am aware that our JMX connections are down, the relevant log entries have been lost. Is there anyway to increase the size of the log and the number of roll overs? I am wondering if the number of Java pollers needs to be increased but can find no rules of thumb regarding this. We were using the default value of 5 but have increased it to 10. However, this doesn't seem to have resolved the problem. Our Zabbix server and Java Gateway run on the same VM which runs Red Hat Enterprise 6.x. Any assistance would be appreciated. Trevor |
Comments |
Comment by Trevor McLeod [ 2012 Sep 18 ] |
This may now be resolved. I noticed that the server log had lots of messages like the one below while the gateway log had none. I theorized that there was a problem with the server talking to the gateway because once it got through to the gateway, the gateway sent a request to the host in question and retrieve the JMX values without any problems. I had to update the number of Java pollers in both the settings.sh for the Java Gateway and the zabbix_server.conf for the Zabbix server to 300. Its original value was 5. I kept incrementing this value by 10 and finally by 50 to arrive at a value where messages such as the following stopped being logged by the server: 13312:20120918:103603.988 JMX item [jmx["org.hornetq:module=JMS,type=Queue,name=\"SIS\"",MessagesAdded]] on host [mqprod01.lms.it.ubc.ca] failed: another network error, wait for 15 seconds I also had to increase the max_connections parameter in the my.cnf configuration file for MySQL from the default of 150 to 300. Whether this resolves the problem that the Java Gateway has to be restarted only time will tell. I'll leave this issue open for the time being. Trevor |
Comment by richlv [ 2012 Sep 19 ] |
let's have it closed better, if it indeed turns out to be a configuration issue. if the problem still happens, you can reopen the issue at any time |
Comment by Łukasz Jernaś [ 2012 Sep 20 ] |
Not really, we have a problem that all the thread in the java process get stuck in a waiting state and no new checks are being gathered |
Comment by Trevor McLeod [ 2012 Sep 20 ] |
This problem continues to re-occur even after tweaking the number of Java pollers. Since my last post I realized that setting the number of Java pollers to 300 was excessive: We gather about 3600 JMX items every 60 seconds. Assuming that the Java server batches the item requests in groups of 10 (on average) that's 360 requests every 60 seconds. Assuming the requests are spread over 60 seconds (rather than only at intervals of 60 seconds), that's 6 requests per second. So, I reasoned that setting the number of pollers to 6, at a minimum was necessary. I set it to 15 on the Java Gateway and 10 on the server. This setup ran pretty well for a day. We did not lose our JMX connections as originally described (JMX connection goes red in the Zabbix web app). However, the Zabbix server log was still full of error messages such as this: 26589:20120920:001227.221 JMX item [jmx["java.lang:type=MemoryPool,name=CMS Old Gen","Usage.max"]] on host [blverf01.lms.it.ubc.ca] failed: another network error, wait for 15 seconds Eventually we would see messages like this: 26563:20120920:001241.299 resuming JMX checks on host [blverf01.lms.it.ubc.ca]: connection restored So, some sort of recovery seemed to be happening. We were also getting Zabbix alerts like this from the Zabbix server: Trigger: Zabbix unreachable poller processes more than 75% busy Trigger status: PROBLEM Trigger severity: Average Trigger URL: Item values: 1. Zabbix busy unreachable poller processes, in % (Zabbix server:zabbix[process,unreachable poller,avg,busy]): 76.51 % 2. UNKNOWN (UNKNOWN:UNKNOWN): UNKNOWN 3. UNKNOWN (UNKNOWN:UNKNOWN): UNKNOWN However, eventually, all of our JMX connections in the Zabbix web app went red. In the Zabbix server log we see messages like this: 26563:20120920:061636.856 temporarily disabling JMX checks on host [blverf01.lms.it.ubc.ca]: host unavailable I took a jstack of the gateway process and can send it to you if you want. It is not like a similar problem reported on this forum where all the threads are blocked. I'd appreciate some assistance as I am now stumped. Trevor |
Comment by Trevor McLeod [ 2012 Sep 20 ] |
When the JMX connection goes red in the Zabbix web app, if I hover my mouse over the red indicator, the following message appears: ZBX_TCP_READ() failed: [4] Interrupted system call Also, if I enable debug level logging for the Zabbix server, I see the same error message. Trevor |
Comment by Trevor McLeod [ 2012 Sep 21 ] |
This is the Zabbix server log with debugging enabled when the problem was happening. |
Comment by Trevor McLeod [ 2012 Sep 27 ] |
After running without incident since September 21, the same problem, described above, reoccurred twice this morning on September 27th. No additional clues. Assistance would be appreciated. Trevor |
Comment by Trevor McLeod [ 2012 Oct 02 ] |
Some more clues: First, when this problem occurs, I have been trying to trace through a request from the Zabbix Server to the Zabbix Java Gateway and back. What I often see is that the request comes into the gateway, the gateway sends the request to the host being monitored, the metrics are returned from the host to the gateway, BUT, the gateway never returns the metrics to the server. This ties in with the timeout errors we are seeing in the server log: 13312:20120918:103603.988 JMX item [jmx["org.hornetq:module=JMS,type=Queue,name=\"SIS\"",MessagesAdded]] on host [mqprod01.lms.it.ubc.ca] failed: another network error, wait for 15 seconds Second, one one such occasion when I reviewed the gateway log, all the entries were from the same thread, e.g.: [pool-1-thread-1]. It was as if none of the other threads were being used (we have 20 configured). Finally, on another such occasion, there was no threading information at all on any gateway log entries, e.g.: instead of [pool-x-thread-y] only the word "[main]" appeared. Hope this helps, |
Comment by Nikola Ivačič [ 2012 Oct 11 ] |
I have 5 servers with aprox. 15 JVMs (solr, jetty, jboss) monitored and I've got the same problem. I tried the "more and more pollers" approach but it didn't help. First I've set config vars to ridiculous values just to tackle the problem:
I saw that zabbix_java responds to server with values but server (maybe) timeouts before that. |
Comment by Maxim Tyukov [ 2012 Nov 28 ] |
+1, have the same issue, can anyone from Zabbix team confirm that its configuration issue? |
Comment by Maxim Tyukov [ 2012 Nov 29 ] |
FYI, I have 15 hosts with JBoss. My conf: StartJavaPollers=5 in zabbix_proxy.conf and in javagateway config setting.sh START_POLLERS=5, |
Comment by David Israel [ 2013 Apr 04 ] |
I think this is caused by a JMX item taking to long during collection. For instance I saw it with a jmx.discovery item set to collect every 30 seconds. |
Comment by Axel Wienberg [ 2013 Apr 17 ] |
We have the same problem monitoring 100 JMX endpoints (modelled as hosts), most of which are restarted at night (these are development test systems). Regularly, the zabbix java gateway does not recover from the connection loss / restart and keeps logging the message "another network error, wait for 15 seconds" already reported above, and raises the "Zabbix unreachable poller processes more than 75% busy" trigger. |
Comment by Aleksandrs Saveljevs [ 2014 Feb 17 ] |
This issue seems to be the same as |