[ZBX-8555] "write:Broken pipe" error while executing remote command Created: 2014 Jul 30  Updated: 2017 May 30  Resolved: 2014 Aug 07

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.2, 2.3.3
Fix Version/s: 2.2.9rc1, 2.4.4rc1, 2.5.0

Type: Incident report Priority: Minor
Reporter: Arturs Galapovs (Inactive) Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: logging, remotecommands
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.57-3+deb7u1 x86_64 GNU/Linux


Attachments: PNG File 01_Conditions.png     PNG File 02_Operations.png    

 Description   

Executing custom command as action operation produce "write: Broken pipe" error in server logs.

Reproduction steps:
1. Configure simple Host, Item and Trigger
2. Go to Configuration->Action->Create Action
3. Name action in "Action" tab
4. Add your trigger in "Conditions" tab (see possible variant on 01_Conditions.jpg attached)
5. Configure operation to run simple custom command with output into standard output and run it on server side (not agent). For example simple "date" call. (see possible variant on 02_Operation.jpg attached)
6. Make action to run
7. Observe "write:Broken pipe" error message in zabbix_server log file



 Comments   
Comment by Arturs Galapovs (Inactive) [ 2014 Aug 01 ]

Fix located in svn://svn.zabbix.com/branches/dev/ZBX-8555

Comment by Aleksandrs Saveljevs [ 2014 Aug 04 ]

(1) In https://groups.google.com/d/msg/comp.unix.programmer/NXKriaq6G3g/UMBQg88EWEcJ it says the following:

The latter. You typically shouldn't call waitpid() until the child has closed its end of the pipe; at least, not without the WNOHANG flag.

You'll get away with it provided that that child doesn't write more than pathconf(_PC_PIPE_BUF) bytes to the pipe. If it tries to write any more than that, you'll get a deadlock: the child will block waiting for the reader (i.e. the parent) to consume some data to make space in the pipe, which won't happen because the parent will be blocked on waitpid().

In the proposed solution waitpid() is called before close() and, indeed, this whole process structure hangs when the executed script produces a lot of output.

For instance, I replaced "date" in your example with "for i in `seq 1 100000`; do echo $i; done | tee /tmp/remote-command.txt" and the escalator hangs in zbx_waitpid() for 5 minutes, which is the value of CONFIG_TRAPPER_TIMEOUT. If we do "tail -n 1 /tmp/remote-command.txt" at that moment, we will see that the last number written is 12762.

One solution might be to redirect the child process output streams to /dev/null instead, or use zbx_execute_nowait() if we do not need the output, similarly to how we do in "nowait" case in SYSTEM_RUN(). REOPENED.

arturs.galapovs New implementation in r47912:47915. RESOLVED

asaveljevs It seems to me that the implementation can be improved, so I would like to ask sasha to take a look at it. The general direction seems correct, except that it currently requires zbx_exec() to be followed with exit(), which is not elegant.

The simplest solution with least intrusive changes is to leave the old code as is, but read from the pipe and discard the output if "buffer" is NULL. That might not be most efficient though.

Meanwhile, I have suggested some stylistic fixed in r48068. Note that the branch in its current state will not compile on Windows.

arturs.galapovs windows agent compilation fixed in r48084. Please review these changes as well

asaveljevs Original solution abandonded in favor of a simpler solution. WON'T FIX.

Comment by Andris Zeila [ 2015 Jan 08 ]

After short discussion it was decided to move forward with the simplest solution - always read from the pipe, even if the buffer is NULL (just discard the results in this case).

Comment by Andris Zeila [ 2015 Jan 08 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8555

Comment by Aleksandrs Saveljevs [ 2015 Jan 09 ]

(2) Please review suggestions in r51464 and r51470.

wiper Looks good, CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 09 ]

(3) The problem with current code is that, in zbx_read_from_pipe() function on Windows, "offset" variable is never increased in case "buf" is NULL. Therefore, the PeekNamedPipe() loop can loop indefinitely, as long as there is output to read, and this process is bound neither by size, nor by time. A similar situation is on Unix, where SIGALRM can arrive outside of a call to read().

wiper Windows: I decided to move timeout check to the start of reading loop. While this introduces slight overhead while reading, it will ensure that timeout properly occurs even if there are still data to read. RESOLVED in r51466

wiper Regarding Unix-like systems - we decided to postpone until current timeout system based on alarms are reviewed and reworked.

asaveljevs CLOSED

Comment by Andris Zeila [ 2015 Jan 10 ]

Released in:

  • pre-2.2.9rc1 r51478
  • pre-2.4.4rc1 r51479
  • pre-2.5.0 r51480
Generated at Wed Apr 24 13:16:58 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.