ZABBIX BUGS AND ISSUES
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-8555

"write:Broken pipe" error while executing remote command

    Details

    • Type: Incident report Incident report
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.2, 2.3.3
    • Fix Version/s: 2.2.9rc1, 2.4.4rc1, 2.5.0
    • Component/s: Server (S)
    • Environment:
      Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.57-3+deb7u1 x86_64 GNU/Linux

      Description

      Executing custom command as action operation produce "write: Broken pipe" error in server logs.

      Reproduction steps:
      1. Configure simple Host, Item and Trigger
      2. Go to Configuration->Action->Create Action
      3. Name action in "Action" tab
      4. Add your trigger in "Conditions" tab (see possible variant on 01_Conditions.jpg attached)
      5. Configure operation to run simple custom command with output into standard output and run it on server side (not agent). For example simple "date" call. (see possible variant on 02_Operation.jpg attached)
      6. Make action to run
      7. Observe "write:Broken pipe" error message in zabbix_server log file

        Activity

        Hide
        Arturs Galapovs (Inactive) added a comment -

        Fix located in svn://svn.zabbix.com/branches/dev/ZBX-8555

        Show
        Arturs Galapovs (Inactive) added a comment - Fix located in svn://svn.zabbix.com/branches/dev/ZBX-8555
        Hide
        Aleksandrs Saveljevs added a comment - - edited

        (1) In https://groups.google.com/d/msg/comp.unix.programmer/NXKriaq6G3g/UMBQg88EWEcJ it says the following:

        The latter. You typically shouldn't call waitpid() until the child has closed its end of the pipe; at least, not without the WNOHANG flag.

        You'll get away with it provided that that child doesn't write more than pathconf(_PC_PIPE_BUF) bytes to the pipe. If it tries to write any more than that, you'll get a deadlock: the child will block waiting for the reader (i.e. the parent) to consume some data to make space in the pipe, which won't happen because the parent will be blocked on waitpid().

        In the proposed solution waitpid() is called before close() and, indeed, this whole process structure hangs when the executed script produces a lot of output.

        For instance, I replaced "date" in your example with "for i in `seq 1 100000`; do echo $i; done | tee /tmp/remote-command.txt" and the escalator hangs in zbx_waitpid() for 5 minutes, which is the value of CONFIG_TRAPPER_TIMEOUT. If we do "tail -n 1 /tmp/remote-command.txt" at that moment, we will see that the last number written is 12762.

        One solution might be to redirect the child process output streams to /dev/null instead, or use zbx_execute_nowait() if we do not need the output, similarly to how we do in "nowait" case in SYSTEM_RUN(). REOPENED.

        Arturs Galapovs New implementation in r47912:47915. RESOLVED

        Aleksandrs Saveljevs It seems to me that the implementation can be improved, so I would like to ask Alexander Vladishev to take a look at it. The general direction seems correct, except that it currently requires zbx_exec() to be followed with exit(), which is not elegant.

        The simplest solution with least intrusive changes is to leave the old code as is, but read from the pipe and discard the output if "buffer" is NULL. That might not be most efficient though.

        Meanwhile, I have suggested some stylistic fixed in r48068. Note that the branch in its current state will not compile on Windows.

        Arturs Galapovs windows agent compilation fixed in r48084. Please review these changes as well

        Aleksandrs Saveljevs Original solution abandonded in favor of a simpler solution. WON'T FIX.

        Show
        Aleksandrs Saveljevs added a comment - - edited (1) In https://groups.google.com/d/msg/comp.unix.programmer/NXKriaq6G3g/UMBQg88EWEcJ it says the following: The latter. You typically shouldn't call waitpid() until the child has closed its end of the pipe; at least, not without the WNOHANG flag. You'll get away with it provided that that child doesn't write more than pathconf(_PC_PIPE_BUF) bytes to the pipe. If it tries to write any more than that, you'll get a deadlock: the child will block waiting for the reader (i.e. the parent) to consume some data to make space in the pipe, which won't happen because the parent will be blocked on waitpid(). In the proposed solution waitpid() is called before close() and, indeed, this whole process structure hangs when the executed script produces a lot of output. For instance, I replaced "date" in your example with "for i in `seq 1 100000`; do echo $i; done | tee /tmp/remote-command.txt" and the escalator hangs in zbx_waitpid() for 5 minutes, which is the value of CONFIG_TRAPPER_TIMEOUT. If we do "tail -n 1 /tmp/remote-command.txt" at that moment, we will see that the last number written is 12762. One solution might be to redirect the child process output streams to /dev/null instead, or use zbx_execute_nowait() if we do not need the output, similarly to how we do in "nowait" case in SYSTEM_RUN(). REOPENED. Arturs Galapovs New implementation in r47912:47915. RESOLVED Aleksandrs Saveljevs It seems to me that the implementation can be improved, so I would like to ask Alexander Vladishev to take a look at it. The general direction seems correct, except that it currently requires zbx_exec() to be followed with exit(), which is not elegant. The simplest solution with least intrusive changes is to leave the old code as is, but read from the pipe and discard the output if "buffer" is NULL. That might not be most efficient though. Meanwhile, I have suggested some stylistic fixed in r48068. Note that the branch in its current state will not compile on Windows. Arturs Galapovs windows agent compilation fixed in r48084. Please review these changes as well Aleksandrs Saveljevs Original solution abandonded in favor of a simpler solution. WON'T FIX.
        Hide
        Andris Zeila added a comment -

        After short discussion it was decided to move forward with the simplest solution - always read from the pipe, even if the buffer is NULL (just discard the results in this case).

        Show
        Andris Zeila added a comment - After short discussion it was decided to move forward with the simplest solution - always read from the pipe, even if the buffer is NULL (just discard the results in this case).
        Hide
        Andris Zeila added a comment -

        Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8555

        Show
        Andris Zeila added a comment - Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8555
        Hide
        Aleksandrs Saveljevs added a comment - - edited

        (2) Please review suggestions in r51464 and r51470.

        Andris Zeila Looks good, CLOSED

        Show
        Aleksandrs Saveljevs added a comment - - edited (2) Please review suggestions in r51464 and r51470. Andris Zeila Looks good, CLOSED
        Hide
        Aleksandrs Saveljevs added a comment - - edited

        (3) The problem with current code is that, in zbx_read_from_pipe() function on Windows, "offset" variable is never increased in case "buf" is NULL. Therefore, the PeekNamedPipe() loop can loop indefinitely, as long as there is output to read, and this process is bound neither by size, nor by time. A similar situation is on Unix, where SIGALRM can arrive outside of a call to read().

        Andris Zeila Windows: I decided to move timeout check to the start of reading loop. While this introduces slight overhead while reading, it will ensure that timeout properly occurs even if there are still data to read. RESOLVED in r51466

        Andris Zeila Regarding Unix-like systems - we decided to postpone until current timeout system based on alarms are reviewed and reworked.

        Aleksandrs Saveljevs CLOSED

        Show
        Aleksandrs Saveljevs added a comment - - edited (3) The problem with current code is that, in zbx_read_from_pipe() function on Windows, "offset" variable is never increased in case "buf" is NULL. Therefore, the PeekNamedPipe() loop can loop indefinitely, as long as there is output to read, and this process is bound neither by size, nor by time. A similar situation is on Unix, where SIGALRM can arrive outside of a call to read(). Andris Zeila Windows: I decided to move timeout check to the start of reading loop. While this introduces slight overhead while reading, it will ensure that timeout properly occurs even if there are still data to read. RESOLVED in r51466 Andris Zeila Regarding Unix-like systems - we decided to postpone until current timeout system based on alarms are reviewed and reworked. Aleksandrs Saveljevs CLOSED
        Hide
        Andris Zeila added a comment -

        Released in:

        • pre-2.2.9rc1 r51478
        • pre-2.4.4rc1 r51479
        • pre-2.5.0 r51480
        Show
        Andris Zeila added a comment - Released in: pre-2.2.9rc1 r51478 pre-2.4.4rc1 r51479 pre-2.5.0 r51480

          People

          • Assignee:
            Unassigned
            Reporter:
            Arturs Galapovs (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: