[#ZBX-4640] another network error retrying to get a value

[ZBX-4640] another network error retrying to get a value Created: 2012 Feb 10 Updated: 2017 May 30 Resolved: 2012 Feb 13
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	1.9.7 (beta), 1.9.8 (beta)
Fix Version/s:	None

Type:

Incident report

Priority:

Critical

Reporter:

Anton Ryabchenko

Assignee:

Unassigned

Resolution:

Cannot Reproduce

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Environment:

Debian x32 & x64

Attachments:

connection refused.png

Description

I have problems with retrying to get a value.
First found in version 1.9.7 (fresh install), upgrade to 1.9.9 didn't fixed it. Tested on 2 servers with lots of clients.
I found some fixed issues on similar errors, but it seems they are not completely fixed, upgrade to the 1.9.9 doen't help.

Logs are populated with the following:
17808:20120210:161916.259 resuming Zabbix agent checks on host [lari-casino]: connection restored
17821:20120210:161923.164 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
17796:20120210:161925.241 Zabbix agent item [system.swap.size[,pfree]] on host [lari-poker] failed: first network error, wait for 20 seconds
17751:20120210:161929.219 Zabbix agent item [system.cpu.load[,avg15]] on host [lari-casino] failed: first network error, wait for 20 seconds
17812:20120210:161949.182 resuming Zabbix agent checks on host [lari-casino]: connection restored
17782:20120210:161958.749 Zabbix agent item [vm.memory.size[total]] on host [gw.viaden.com] failed: first network error, wait for 20 seconds
17782:20120210:162005.730 Zabbix agent item [vfs.fs.size[/,pfree]] on host [lari-casino] failed: first network error, wait for 20 seconds
17819:20120210:162018.302 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
17816:20120210:162025.407 resuming Zabbix agent checks on host [lari-casino]: connection restored
17785:20120210:162102.411 Zabbix agent item [vfs.fs.inode[/home,pfree]] on host [lari-casino] failed: first network error, wait for 20 seconds
17806:20120210:162122.346 resuming Zabbix agent checks on host [lari-casino]: connection restored
17714:20120210:162124.508 Zabbix agent item [system.cpu.util[,idle,avg1]] on host [gw.viaden.com] failed: first network error, wait for 20 seconds
17793:20120210:162126.288 Zabbix agent item [vm.memory.inactive] on host [gw.viaden.com] failed: another network error, wait for 20 seconds
17726:20120210:162140.805 Zabbix agent item [system.cpu.load[,avg1]] on host [lari-casino] failed: first network error, wait for 20 seconds
17805:20120210:162146.459 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
17805:20120210:162200.672 resuming Zabbix agent checks on host [lari-casino]: connection restored

Note, keys and servers are different.
Tested different UnreachableDelay (from 5 to 20).
This is not connectivity issue, the same time tested with multiple zabbix_get - no errors at all.

The agent log with debug enabled shows no errors - it always sends data back.
tcpdump shows a lot of RST flags from server. It doesn't seem to be right tcp session end.

I tried to disable checks on host, wait until queue is cleared, then start monitoring again. It doesn't help.
Agent and server restarts sometimes help, sometimes not. The issue occurs randomly and can dissaper after some time (few hours ordinary), or stay for a long time.
There are no strange spikes on the internal zabbix monitoring graphs (except housekeeping tasks), network activity and pooling are stable.

Comments

Comment by Alexei Vladishev [ 2012 Feb 11 ]

That's interesting. Could it be related to some limits of Linux kernel related to TCP stack? Do you see anything suspicious in syslog or kern.log?

Comment by Oleksii Zagorskyi [ 2012 Feb 11 ]

Anton, let me know which error do you see for host error when it's in error state in GUI (temporarily disabling - host unavailable)? Maybe the error is "Invalid port number[]" ?
Are those items inherited from template?
If yes, try for one host Delete and clear template and then link it again.

I had very similar hosts behavior when some part of theirs items did not have defined iterfaceid (they had NULL for iterfaceid) <- because we are using trunk

Here is behavior when one agent item has NULL for iterfaceid:
44813:20120211:133759.283 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: first network error, wait for 15 seconds
44816:20120211:133814.187 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: another network error, wait for 15 seconds
44816:20120211:133829.215 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: another network error, wait for 15 seconds
44816:20120211:133844.235 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: another network error, wait for 15 seconds
44816:20120211:133859.302 temporarily disabling Zabbix agent checks on host [it5]: host unavailable
44816:20120211:134000.520 enabling Zabbix agent checks on host [it5]: host became available

Try to execute this SQL statement to find zabbix agent items with NULL interfaceid on real hosts:
SELECT DISTINCT h.host,i.itemid,i.name,i.key_,i.interfaceid FROM items i, hosts h WHERE i.type=0 AND i.interfaceid IS NULL AND h.status=0 AND i.hostid=h.hostid;

Comment by Anton Ryabchenko [ 2012 Feb 13 ]

I found NULL interfaceid on one of the hosts on one of the servers. There were itemes, that were added directly to host, then I copied them to the template, but interfaceid is null.
Even after 'Unlink and clear' and link again - interfaceid is null. And errors continue.
But the second server has no items with null interfaceid.

I use all items inherited from tamplates.
I see no errors in general Linux logs and , as I mentioned before, it's definetly not a network/os/limit issue - I checked these things first.
Sorry, cannot figure out where should I see the state in GUI (Monitoring -> Hosts shows availability icon only)

Comment by Oleksii Zagorskyi [ 2012 Feb 13 ]

> Sorry, cannot figure out where should I see the state in GUI (Monitoring -> Hosts shows availability icon only)
Yes, it is. When a host (agent, snmp, etc) in error state then you can move mouse oved RED icon and you will see tool-tip with error description.
Which error do you see?

Note interfaceid=NULL for items in template is ok.

Try to delete this item (copied-recopied ) and create it manually again.

Comment by Anton Ryabchenko [ 2012 Feb 13 ]

In GUI for one server I see the following 2 errors:
"...error (111): connection refused" (see attachment for full error)
"Invalid port number[]"
Both errors appears on the JMX hosts (we use zapcat agent and port 10052 for monitoring).
We have used zapcat for weeks without any problem, but sometimes we had white spaces in our graphs, I thought it was caused by performance. But it's not.

Trying to reproduce on another server. to see an error.

Comment by Oleksii Zagorskyi [ 2012 Feb 13 ]

> We have used zapcat for weeks without any problem, but sometimes we had white spaces in our graphs, I thought it was caused by performance. But it's not.
No, that happened because the hosts (zabbix-agent type I mean) were periodically disabled because even of single problematic item (without interfaceid in the DB)

So you have fix all problems which are generating the error "Invalid port number[]"
Maybe interface is not empty, but the port is empty?

Yes, I can generate the error "Invalid port number []" when "interface.port" is empty (deleted manually in the DB)

Try this:
mysql> SELECT * FROM interface WHERE port="";
and its result should be empty.

Comment by Anton Ryabchenko [ 2012 Feb 13 ]

I have
SELECT * FROM interface WHERE port="";
Empty set (0.00 sec)

Comment by Oleksii Zagorskyi [ 2012 Feb 13 ]

I have no more ideas, sorry.

Comment by Anton Ryabchenko [ 2012 Feb 13 ]

Wow, it's seems the issue disapeared from one of the servers!
I have unlinked templates with cleanup and linked them back on the problematic hosts.
I have no more NULLs in the DB and no errors in logs
Thanks a lot!

p.s. Problems on the other server can be network issues indeed, there are some hosts monitored over Internet from US to Asia.

Comment by Oleksii Zagorskyi [ 2012 Feb 13 ]

Be careful with trunk next time, but thanks for using it in production

Issue closed.

Generated at Wed Jul 30 03:26:38 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-4640] another network error retrying to get a value Created: 2012 Feb 10 Updated: 2017 May 30 Resolved: 2012 Feb 13