[ZBX-10868] Spawns "zabbix agent on <hostname>" if items like net.tcp.port[<IP>,<port>] Created: 2016 Jun 01  Updated: 2017 May 30  Resolved: 2016 Jun 06

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G), Server (S)
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Incident report Priority: Trivial
Reporter: Selivanov Pavel Assignee: Unassigned
Resolution: Won't fix Votes: 0
Labels: items
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS: Ubuntu 14.04 Trusty
Packages: zabbix-server-mysql,zabbix-frontend-php, zabbix-agent, mysql-server-5.5



 Description   

Steps to reproduce:

  • Create template, link it to 10 hosts
  • Inside template, create 10 items like net.tcp.port[<ipN>,<portN>]. Important: TCP connection to <ipN>:<portN> should not get any response. It should end up with connect timeout.
  • Inside template, create triggers for all items, otherwise items are not updated.

It may be usable to use XML export-import to create all this items and triggers.

Result:

Multiple events like "Zabbix agent on web13 is unreachable for 2 min" start spawning randomly on this 10 hosts, where template is applied. When created items are disabled, problem disappears.

Additional information:

Server and Agents do not suffer performance problems, disk I/O and CPU load are small, plenty of free memory. MySQL server is running fine, not long queries.

Server and Agent configs are stanrard, with all default values for StartPollers, StartPollersUnreachable,Timeout etc.

Increasing innodb_buffer_pool_size on MySQL server didn't help.

Administration/Queue shows 0 value in whole table.

Value of zabbix[wcache,values on graph "Zabbix Server Preformance" goes down when items are enabled and problem starts. It goes back up when items are disabled. Value of zabbix[queue] on the same graph stays at 0.

After enabling items this messager often appear in server log:

2126:20160601:185801.820 Zabbix agent item "net.tcp.port[1.1.1.1,3136]" on host "web13" failed: first network error, wait for 15 seconds
2126:20160601:185816.870 resuming Zabbix agent checks on host "web13": connection restored



 Comments   
Comment by Selivanov Pavel [ 2016 Jun 01 ]

Messages in agent log:

28245:20160601:190601.699 TCP expect network error: cannot connect to [[1.1.1.1]:3136]: [4] Interrupted system call
28245:20160601:190601.699 Sending back [0]

Comment by richlv [ 2016 Jun 01 ]

this is most likely a support case. note that the reporter also had an open request on serverfault at http://serverfault.com/questions/780012/zabbix-adding-items-make-agents-not-available

Comment by Selivanov Pavel [ 2016 Jun 01 ]

@richhlv: Yes, first I tried to find help at zabbix forum and serverfault. I have digged it for some time and now I am sure that this is a bug. I have written explicit steps how to reproduce it.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jun 02 ]

In what way this is a bug?

By default Timeout's in server/proxy and agent configuration files are equal (3 seconds). For agent this timeout means: "if check takes longer than 3 seconds report not supported item to server". For poller this timeout means: "if there is no response from host within 3 seconds mark this host as unreachable". Since process cannot wait for exactly 3 seconds due to technical limitations, sometimes agent gets bored first, sometimes poller.

Try to reproduce with Timeout=4 in server configuration file.

Comment by Selivanov Pavel [ 2016 Jun 02 ]

@glebs.ivanovskis:
> In what way this is a bug?

Server starts to randomly switch trigger "zabbix-agent on <hostname> in not available for 2 minutes". Trigger is not about problematic item itself, but about whole agent connectivity.

Comment by Selivanov Pavel [ 2016 Jun 03 ]

> Try to reproduce with Timeout=4 in server configuration file.

Yep, with Timeout=10 in zabbix_server.conf and Timeout=5 in zabbix_agentd.conf problem disappears. But I insist, that this is a real bug:

If:

  • timeout on both agent and server are the same (default: timeout = 3)
  • there is item net.tcp.port[<IP>,<port>] and trigger using it
  • pair [<IP>,<port>] is unavailable by TCP timeout

Then:

"Zabbix-agent on <hostname> is unawailable" ( trigger expression: {agent.ping.nodata(2m)} = 1 ) start spawning on hosts with this item. Not the trigger for specific item, but the trigger for the agent availability. This is bug.

Comment by Aleksandrs Saveljevs [ 2016 Jun 06 ]

Some time ago we had a problematic behavior with timeouting items. It used to be that a single timeouting item on a host would block other items from processing: that item would timeout, all items would move to unreachable pollers, and unreachable pollers would still try to use that same timeouting item every UnreachableDelay (15) and UnavailableDelay (60) seconds for checking the host.

As far as I remember, that problem was addressed in ZBX-4284. After the fix, Zabbix tries different items for checking the host using unreachable pollers. However, in your case, there are 10 timeouting items on a host, and if it happens that they are all checked in order, then that takes a lot more than 2 minutes, so the "agent.ping" trigger fires. Consequently, there is no bug.

The issue with Timeout parameter being set to 3 by default in all configuration files was addressed in ZBXNEXT-2637: since then server's timeout is 4 seconds by default.

Comment by Aleksandrs Saveljevs [ 2016 Jun 06 ]

It seems like there is nothing to fix, so closing as "Won't fix".

Comment by Selivanov Pavel [ 2016 Jun 06 ]

@asaveljevs: thank you for reply.

It would be a great idea to change default server timeout > client timeout in stable 3.0 branch. It will solve some mystery problems for some zabbix users.

And second thought: it would be great to add optional <timeout> parameter to agent items like net.tcp.port: net.tcp.port[ IP, port, {timeout} ]. And raise warning, if <timeout> is more than timeout on agent. Because users want to have ability to add as much net.tcp.port items, as they need, without getting any implicit problems.

Comment by Selivanov Pavel [ 2016 Jun 06 ]

For second thought created separate ticket: https://support.zabbix.com/browse/ZBX-10882

Generated at Thu Apr 25 10:07:34 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.