Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-9416

escalation stops when action hangs

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • 2.0.12
    • Server (S)
    • None

      We encountered a problem below in version 2.0.4;

      We have some actions which run php scripts when events happened. One day, many events happened at the same time so many actions ran, and one of these php scripts hanged in some reasons.

      But the script continued to run after timeout ("TrapperTimeout" in zabbix_server.def) expired, so Zabbix didn't run succeeding scripts.

      We checked execute.c, which runs actions. I will quote zbx_execute function where parent process, escalator in this case, runs and waits for child process;

      int	zbx_execute(const char *command, char **buffer, char *error, size_t max_error_len, int timeout)
      {
          :
      if (-1 == rc || -1 == zbx_waitpid(pid))  (1)
      {
          if (EINTR == errno)
              ret = TIMEOUT_ERROR;
          else
              zbx_snprintf(error, max_error_len, "zbx_waitpid() failed: %s", zbx_strerror(errno));
      
          /* kill the whole process group, pid must be 
              the leader */
          if (-1 == kill(-pid, SIGTERM))   (2)
              zabbix_log(LOG_LEVEL_ERR, "failed to kill [%s]: %s", command, zbx_strerror(errno));
      
          zbx_waitpid(pid);   (3)
      }
      

      In this case, the following steps would happen;

      i) escalator ran and waited for a php script with zbx_waitpid (1)
      ii) the php script hanged in some reasons, so timeout expired and SIGALRM raised. zbx_waitpid (1) returned by signal interrupt
      iii) escalator sent SIGTERM to the php script with kill (2). Sending succeeded, but the php script didn't terminate in some reasons.
      iv) escalator waited for the php script again with zbx_waitpid (3). the php script still existed in hanging status, so escalator waited for it forever at zbx_waitpid (3), and succeeding scripts weren't run.

      The same problem would also happen in many processes which use zbx_execute.

      So, zbx_waitpid (3) needs to return even if the child process hangs.

      some possible resolutions;
      1. set alarm() for a few seconds before zbx_waitpid (3) to return by signal interrupt.
      2. remove zbx_waitpid (3) itself
      in this case, however, child processes would become "zombies" in normal cases, so you need to modify another source codes too.
      3. send SIGKILL at kill (2)
      but some process might not terminate even by SIGKILL.

      Thank you.

            Unassigned Unassigned
            Junichi Junichi Sakuyama
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: