[ZBX-17303] timeout in "net.tcp.service" blocks agent Created: 2020 Feb 11  Updated: 2020 Mar 23  Resolved: 2020 Mar 23

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: None
Affects Version/s: 4.0.16
Fix Version/s: None

Type: Problem report Priority: Trivial
Reporter: Onlyjob Assignee: Aigars Kadikis
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian



 Description   

I'm investigating regular gaps (5...10 min. duration) in all graphs for a particular VM (passive agent, standard "Template OS Linux" template plus few custom checks).
For example, "CPU iowait time" receives data for a moment then there is nothing for about 5 minutes, then some data again, then another 5...10 min. gap and so on.

Here is what I've found in Zabbix server log, repeatedly logged:

```
Zabbix agent item "net.tcp.service[tcp,b2btest.internal,8880]" on host "web31.vm" failed: first network error, wait for 15 seconds
resuming Zabbix agent checks on host "web31.vm": connection restored
```

Interval for the check is 300s and the problem appears to be because firewall is dropping connections:

```
$ time telnet b2btest.internal 8880
Trying xx.xx.xxx.xx...
telnet: Unable to connect to remote host: Connection timed out

real 2m11.103s
user 0m0.003s
sys 0m0.000s
```

The problem is that timeout in "net.tcp.service" makes Zabbix agent unresponsive which affects all other checks. Problem is exacerbated when there are more than one timeouting "net.tcp.service" check.

I've isolated the problem by disabling the problematic "net.tcp.service" item which instantly restored graphs to normal.



 Comments   
Comment by Aigars Kadikis [ 2020 Feb 13 ]

If the endpoint is not reachable there is nothing much to do, But to improve agent performance you can try:

  • convert the 'net.tcp.service[tcp,b2btest.internal,8880]' check to "Zabbix agent (active)" to free up pollers, Use big 'Timeout=' in agent conf
  • Use small 'Timeout=' in Zabbix agent. So those failed checks will timeout faster
  • Increase 'StartAgents=' in zabbix agent.
Comment by Onlyjob [ 2020 Feb 14 ]

Thanks for suggestions.
Agent's Timeout is set to `22` so even with 3 pre-forked agents just one timeouting check with 300s period shouldn't cause such effect, shouldn't it?
Anyway, increasing number of pre-forked agents to `8` did not fix the problem so it looks like a bug to me....

Comment by Aigars Kadikis [ 2020 Feb 27 ]

If you are not running long shell command I really suggest to have a small timeout, the default is 'Timeout=3'.

Agent's Timeout is set to `22` so even with 3 pre-forked agents just one timeouting check with 300s period shouldn't cause such effect, shouldn't it?

if you have 3+ hosts which are not reachable then it will totally block the agent capability every for 22 seconds during 5 minutes.

By doing the math, I think to have 3 pre-forked agents, timeout=22, host checking every 300s. 14+ unreachable hosts will be the threshold which will block the agent functionality. This is by design.

Use smaller 'Timeout='

Comment by Onlyjob [ 2020 Feb 28 ]

It is not desirable to use smaller timeout as we might introduce a command (UserParameter).

For now, just one unreachable host blocks enough to disrupt regular data flow with 8 pre-forked agents. Does it not look like a bug to you?

Comment by Aigars Kadikis [ 2020 Mar 02 ]

It looks like a bug, I will try to reproduce it. Please attach a screenshot of full item list on the host you do the checking.

+ zabbix_agentd.conf.

Comment by Aigars Kadikis [ 2020 Mar 23 ]

Closing due to missing details on how to reproduce the issue.

Comment by Onlyjob [ 2020 Mar 23 ]

This is quite unfair and disappointing.

I have provided enough details to reproduce the problem.

Yes I could not follow-up in time to add quite irrelevant comment about needless additional information requested but you could at least try to reproduce, could you?

Agent is configured with mostly default config. Host is monitored using standard Linux_OS template + 4 custom 'net.tcp.service' checks.

That's pretty much it. There is no need to provide full list of checks as I have already reliably isolated the problem to just one single 'net.tcp.service' item.

Comment by dimir [ 2020 Mar 23 ]

You could provide what was asked for and re-open the ticket.

Comment by Onlyjob [ 2020 Mar 23 ]

All information to reproduce the problem is already provided, as per original report and my previous comments.

Generated at Thu May 15 07:38:04 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.