[ZBX-9781] stale NFS stops agent operations Created: 2015 Aug 14 Updated: 2017 May 30 Resolved: 2016 Nov 14 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 2.4.4 |
Fix Version/s: | 2.0.20rc1, 2.2.16rc1, 3.0.6rc1, 3.2.2rc1, 3.4.0alpha1 |
Type: | Incident report | Priority: | Blocker |
Reporter: | Emmanuel Oginni | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | activeagent, delay, nfs, timeout | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Zabbix Server 2.4.4 with active agents item vfs.fs.size or vfs.fs.inode used for NFS filesystems |
Issue Links: |
|
Description |
if any of the items vfs.fs.size or vfs.fs.inode is used for monitoring of NFS mounted filesystems and the NFS filesystem goes stale or unreachable for any reason ( network failure, NFS service failure etc), the entire agent items starts reporting no data even for items that has no relation to filesystem monitoring. |
Comments |
Comment by Aleksandrs Saveljevs [ 2015 Aug 14 ] |
Related issue: |
Comment by richlv [ 2015 Aug 15 ] |
in general, that is how hard nfs mounts are supposed to operate - applications wait indefinitely. maybe we can use separate processes for nfs mount checking... but that might be very complicated. |
Comment by Emmanuel Oginni [ 2015 Aug 17 ] |
Thanks for the update. Our observation is that this occur for both hard and soft mounted NFS filesystems tested on Red Hat Linux versions. We were expecting the item to report nodata or Zabbix unsurported error but be timed out by the agent. Since there is a command timeout configuration on the zabbix agent (which by default is 3 seconds), we expect this to time out the item data retrieval process but this does not happen. Also this affected the entire agent operations and not just the item being retrieved. |
Comment by Aleksandrs Saveljevs [ 2016 Jun 08 ] |
Just to document a little investigation on the topic. In vfs.fs.size[] and vfs.fs.inode[], we use statfs() and statvfs() calls to get the necessary data. I have tried the following, making the NFS server unreachable by dropping packets using iptables:
Found https://communities.vmware.com/thread/348081?start=0&tstart=0 , which talks about sigar_file_system_ping() - see https://github.com/hyperic/sigar/blob/master/src/sigar.c , which tries to detect the file system type and, if it is NFS, tries to do an RPC ping. Not sure whether it solves the problem if the filesystem goes down right after the ping. |
Comment by Glebs Ivanovskis (Inactive) [ 2016 Sep 28 ] |
This probably explains why your attempts failed:
Your link refers to RHEL 5 manual which was based on kernel 2.6.18. RHEL 6 uses kernel 2.6.32. Strangely, intr option is still mentioned in RHEL 7 manual. |
Comment by Andris Zeila [ 2016 Oct 04 ] |
Implemented brute force approach by spawning vfs.fs.size and vfs.fs.inode checs in separate process/thread and kill them after timeout period. Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-9781 |
Comment by Aleksandrs Saveljevs [ 2016 Oct 04 ] |
I wonder if we could try to detect file system type and only spawn a thread/process for NFS? Or maybe let the user specify fork/nofork in the item key? wiper Any 'inside' checks would mean that we cannot universally execute any agent check in a separate process. Providing a way to mark items (on server or agent) to be executed in a separate process might be more insteresting. |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(1) Considering that alarm_signal_handler() in zabbix_agentd.c does not do anything, I wonder if it could be established in a simpler way using signal() instead of sigaction(). However, it does not hurt to leave sigaction() method as is. wiper It is 'temporary' solution for 2.0. In 2.2 this should be moved to sighandler.c asaveljevs OK, CLOSED. |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(2) When merging into 3.0 we may wish to replace alarm() calls with zbx_alarm_on() and zbx_alarm_off(), and check timeout using zbx_timed_out variable. asaveljevs Similarly, exit(0) should be replaced with EXIT_SUCCESS. asaveljevs Shall we a set a different proctitle for forked processes? wiper Decided leave proctitle as it is. Other points fixed, continuing in (10) asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(3) Related to (2), CONFIG_TIMEOUT < time(NULL) - now does not seem to be the most precise way of checking for timeout, because it works with integral seconds. Maybe we could use zbx_time() here? wiper RESOLVED in r63385 asaveljevs Hm, with zbx_time() it would be more readable. Let's use zbx_time() here as discussed! REOPENED wiper RESOLVED in r63462 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(4) We may potentially wish to use zbx_agent_execute_threaded_metric() for checks other than vfs.fs.size[] and vfs.fs.inode[]. For instance, checks that are dangerous and may crash the forked process. In that case, the return value of -1 from read() may indicate a crashed process (broken pipe) and not necessarily a timeout. Similarly, we may wish to use this function on the server or proxy side, so it may not be a good idea to refer to agent in the error message: "Timeout while executing agent check.". For the same reason, how about renaming the function to zbx_execute_threaded_metric() and zbx_agent_metric_func_t to zbx_metric_func_t (for consistency with ZBX_METRIC)? wiper Actually 'agent' stands not for Zabbix agent, but for the agent checks mentioned in documentation (Zabbix, SNMP, IPMI or JMX agents). I agree that it's confusing. So I renamed as suggested and also added crash detection. I'm not sure though if we should give extended crash info (backtrace and stuff). RESOLVED in r63386 asaveljevs Error messages in the Windows version and function header comments still refer to agent checks. While at it, please review r63449. Also, shall we use zbx_fork() instead of fork()? wiper Replaced fork() with zbx_fork(), updated comments/error messages, reviewed r63449. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(5) How about adding some debug logging to zbx_agent_execute_threaded_metric()? It may help debugging user reports later on. wiper RESOLVED in r63387 asaveljevs Added logging of the forked process/thread identifier in r63477. It is useful for distinguishing that process among other Zabbix processes, otherwise it has the same proctitle, see (2). wiper Good idea. Thanks. |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(6) Inside zbx_agent_execute_threaded_metric(), there is a mysterious call to close(STDOUT_FILENO). We should either understand what it is doing there or remove it. wiper second one. Might have been leftover of some testing. Removed. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(7) Inside the Windows version of zbx_agent_execute_threaded_metric(), is it OK that we check for zbx_thread_start() returning 0 instead of ZBX_THREAD_ERROR? Is it OK that we do not CloseHandle(thread) if the thread completes successfully (see zbx_thread_wait() function)? wiper no, not ok. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2016 Oct 26 ] |
(8) Please review my changes in r63361. wiper Thanks, CLOSED |
Comment by Aleksandrs Saveljevs [ 2016 Nov 01 ] |
(9) It seems that error message for not supported items is lost during serialization and deserialization process. wiper RESOLVED in r63485 asaveljevs CLOSED |
Comment by Andris Zeila [ 2016 Nov 09 ] |
Ported to new development brachnes:
The 2.2 branch looks good. For 3.0 branch (also note that (2) was not fully addressed yet): (10) Not sure about function zbx_alarm_timed_out(). At a minimum, it should return "int" instead of "unsigned int", because FAIL has a value of -1. In general, if we introduce this function, we should probably make use of it in other places. wiper RESOLVED in r63656 asaveljevs Please verify my changes in r63659. wiper Makes sense. CLOSED asaveljevs The agent does not compile on Windows: comms.o : error LNK2019: unresolved external symbol _zbx_alarm_timed_out referenced in function _zbx_tcp_write ..\..\..\bin\win32\zabbix_agentd.exe : fatal error LNK1120: 1 unresolved externals� wiper RESOLVED in r63660 and r63661. asaveljevs CLOSED |
Comment by Andris Zeila [ 2016 Nov 11 ] |
Released in:
|
Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 28 ] |
I don't know if it is important or if we should not worry about that. I'm just clearing my notes, so I'll just leave it here When child (data gathering) process receives a SIGINT between close() and exit(), parent (listener) process waits indefinitely long in waitpid(). Anyway, whoever interrupted one process could have interrupted any other process... |