[ZBX-9781] stale NFS stops agent operations Created: 2015 Aug 14  Updated: 2017 May 30  Resolved: 2016 Nov 14

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.4.4
Fix Version/s: 2.0.20rc1, 2.2.16rc1, 3.0.6rc1, 3.2.2rc1, 3.4.0alpha1

Type: Incident report Priority: Blocker
Reporter: Emmanuel Oginni Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: activeagent, delay, nfs, timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Zabbix Server 2.4.4 with active agents item vfs.fs.size or vfs.fs.inode used for NFS filesystems


Issue Links:
Duplicate
is duplicated by ZBX-9750 Agent in active mode hangs at times w... Closed

 Description   

if any of the items vfs.fs.size or vfs.fs.inode is used for monitoring of NFS mounted filesystems and the NFS filesystem goes stale or unreachable for any reason ( network failure, NFS service failure etc), the entire agent items starts reporting no data even for items that has no relation to filesystem monitoring.



 Comments   
Comment by Aleksandrs Saveljevs [ 2015 Aug 14 ]

Related issue: ZBX-9750 (iSCSI).

Comment by richlv [ 2015 Aug 15 ]

in general, that is how hard nfs mounts are supposed to operate - applications wait indefinitely. maybe we can use separate processes for nfs mount checking... but that might be very complicated.
soft mounting might help, although i'm not sure how zabbix agent would react to i/o errors.

Comment by Emmanuel Oginni [ 2015 Aug 17 ]

Thanks for the update. Our observation is that this occur for both hard and soft mounted NFS filesystems tested on Red Hat Linux versions. We were expecting the item to report nodata or Zabbix unsurported error but be timed out by the agent. Since there is a command timeout configuration on the zabbix agent (which by default is 3 seconds), we expect this to time out the item data retrieval process but this does not happen. Also this affected the entire agent operations and not just the item being retrieved.
The agent operations is only restored when the NFS filesystem becomes reachable.
We created a custom user parameter that timeout the data fetch using perl I/O pipe timeout successfully. This is what we expect from the Zabbix items vfs.fs.size or vfs.fs.inode as well. However, this client environment does not like to run such code for several thousands of servers involved in this setup.

Comment by Aleksandrs Saveljevs [ 2016 Jun 08 ]

Just to document a little investigation on the topic. In vfs.fs.size[] and vfs.fs.inode[], we use statfs() and statvfs() calls to get the necessary data. I have tried the following, making the NFS server unreachable by dropping packets using iptables:

  1. Added alarm(1) before these calls. This did not seem to have any effect - the statfs() and statvfs() calls would not be interrupted. Neither they would be interrupted by SIGTERM.
  2. Tried using stat() expecting ESTALE as per http://stackoverflow.com/questions/1643347/is-there-a-good-way-to-detect-a-stale-nfs-mount - stat() hanged for me, too.
  3. Tried mounting soft and hard with "intr" as per https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Deployment_Guide/s1-nfs-client-config-options.html . Nothing changed.

Found https://communities.vmware.com/thread/348081?start=0&tstart=0 , which talks about sigar_file_system_ping() - see https://github.com/hyperic/sigar/blob/master/src/sigar.c , which tries to detect the file system type and, if it is NFS, tries to do an RPC ping. Not sure whether it solves the problem if the filesystem goes down right after the ping.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Sep 28 ]

This probably explains why your attempts failed:

The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified, this mount option is ignored to provide backwards compatibility with older kernels.

Your link refers to RHEL 5 manual which was based on kernel 2.6.18. RHEL 6 uses kernel 2.6.32. Strangely, intr option is still mentioned in RHEL 7 manual.

Comment by Andris Zeila [ 2016 Oct 04 ]

Implemented brute force approach by spawning vfs.fs.size and vfs.fs.inode checs in separate process/thread and kill them after timeout period.

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-9781

Comment by Aleksandrs Saveljevs [ 2016 Oct 04 ]

I wonder if we could try to detect file system type and only spawn a thread/process for NFS? Or maybe let the user specify fork/nofork in the item key?

wiper Any 'inside' checks would mean that we cannot universally execute any agent check in a separate process. Providing a way to mark items (on server or agent) to be executed in a separate process might be more insteresting.

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(1) Considering that alarm_signal_handler() in zabbix_agentd.c does not do anything, I wonder if it could be established in a simpler way using signal() instead of sigaction(). However, it does not hurt to leave sigaction() method as is.

wiper It is 'temporary' solution for 2.0. In 2.2 this should be moved to sighandler.c

asaveljevs OK, CLOSED.

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(2) When merging into 3.0 we may wish to replace alarm() calls with zbx_alarm_on() and zbx_alarm_off(), and check timeout using zbx_timed_out variable.

asaveljevs Similarly, exit(0) should be replaced with EXIT_SUCCESS.

asaveljevs Shall we a set a different proctitle for forked processes?

wiper Decided leave proctitle as it is. Other points fixed, continuing in (10)
RESOLVED

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(3) Related to (2), CONFIG_TIMEOUT < time(NULL) - now does not seem to be the most precise way of checking for timeout, because it works with integral seconds. Maybe we could use zbx_time() here?

wiper RESOLVED in r63385

asaveljevs Hm, with zbx_time() it would be more readable. Let's use zbx_time() here as discussed! REOPENED

wiper RESOLVED in r63462

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(4) We may potentially wish to use zbx_agent_execute_threaded_metric() for checks other than vfs.fs.size[] and vfs.fs.inode[]. For instance, checks that are dangerous and may crash the forked process. In that case, the return value of -1 from read() may indicate a crashed process (broken pipe) and not necessarily a timeout.

Similarly, we may wish to use this function on the server or proxy side, so it may not be a good idea to refer to agent in the error message: "Timeout while executing agent check.". For the same reason, how about renaming the function to zbx_execute_threaded_metric() and zbx_agent_metric_func_t to zbx_metric_func_t (for consistency with ZBX_METRIC)?

wiper Actually 'agent' stands not for Zabbix agent, but for the agent checks mentioned in documentation (Zabbix, SNMP, IPMI or JMX agents). I agree that it's confusing. So I renamed as suggested and also added crash detection. I'm not sure though if we should give extended crash info (backtrace and stuff).

RESOLVED in r63386

asaveljevs Error messages in the Windows version and function header comments still refer to agent checks. While at it, please review r63449. Also, shall we use zbx_fork() instead of fork()?

wiper Replaced fork() with zbx_fork(), updated comments/error messages, reviewed r63449.
RESOLVED in r63460, r63461

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(5) How about adding some debug logging to zbx_agent_execute_threaded_metric()? It may help debugging user reports later on.

wiper RESOLVED in r63387

asaveljevs Added logging of the forked process/thread identifier in r63477. It is useful for distinguishing that process among other Zabbix processes, otherwise it has the same proctitle, see (2).

wiper Good idea. Thanks.
CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(6) Inside zbx_agent_execute_threaded_metric(), there is a mysterious call to close(STDOUT_FILENO). We should either understand what it is doing there or remove it.

wiper second one. Might have been leftover of some testing. Removed.
RESOLVED in r63388

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(7) Inside the Windows version of zbx_agent_execute_threaded_metric(), is it OK that we check for zbx_thread_start() returning 0 instead of ZBX_THREAD_ERROR? Is it OK that we do not CloseHandle(thread) if the thread completes successfully (see zbx_thread_wait() function)?

wiper no, not ok.
RESOLVED in r63389

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ]

(8) Please review my changes in r63361.

wiper Thanks, CLOSED

Comment by Aleksandrs Saveljevs [ 2016 Nov 01 ]

(9) It seems that error message for not supported items is lost during serialization and deserialization process.

wiper RESOLVED in r63485

asaveljevs CLOSED

Comment by Andris Zeila [ 2016 Nov 09 ]

Ported to new development brachnes:

  • 2.2 - svn://svn.zabbix.com/branches/dev/ZBX-9781_22
  • 3.0 - svn://svn.zabbix.com/branches/dev/ZBX-9781_30

The 2.2 branch looks good. For 3.0 branch (also note that (2) was not fully addressed yet):

(10) Not sure about function zbx_alarm_timed_out(). At a minimum, it should return "int" instead of "unsigned int", because FAIL has a value of -1. In general, if we introduce this function, we should probably make use of it in other places.

wiper RESOLVED in r63656

asaveljevs Please verify my changes in r63659.

wiper Makes sense. CLOSED

asaveljevs The agent does not compile on Windows:

comms.o : error LNK2019: unresolved external symbol _zbx_alarm_timed_out referenced in function _zbx_tcp_write
..\..\..\bin\win32\zabbix_agentd.exe : fatal error LNK1120: 1 unresolved externals�

wiper RESOLVED in r63660 and r63661.

asaveljevs CLOSED

Comment by Andris Zeila [ 2016 Nov 11 ]

Released in:

  • pre-2.0.20rc1 r63715
  • pre-2.2.16rc1 r63717
  • pre-3.0.6rc1 r63719
  • pre-3.2.2rc1 r63720
  • pre-3.3.0 r63721
Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 28 ]

I don't know if it is important or if we should not worry about that. I'm just clearing my notes, so I'll just leave it here

When child (data gathering) process receives a SIGINT between close() and exit(), parent (listener) process waits indefinitely long in waitpid(). Anyway, whoever interrupted one process could have interrupted any other process...

Generated at Wed Apr 24 11:52:44 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.