-
Incident report
-
Resolution: Duplicate
-
Critical
-
None
-
2.0.12
-
None
We encountered a problem below in version 2.0.4;
We have some actions which run php scripts when events happened. One day, many events happened at the same time so many actions ran, and one of these php scripts hanged in some reasons.
But the script continued to run after timeout ("TrapperTimeout" in zabbix_server.def) expired, so Zabbix didn't run succeeding scripts.
We checked execute.c, which runs actions. I will quote zbx_execute function where parent process, escalator in this case, runs and waits for child process;
int zbx_execute(const char *command, char **buffer, char *error, size_t max_error_len, int timeout) { : if (-1 == rc || -1 == zbx_waitpid(pid)) (1) { if (EINTR == errno) ret = TIMEOUT_ERROR; else zbx_snprintf(error, max_error_len, "zbx_waitpid() failed: %s", zbx_strerror(errno)); /* kill the whole process group, pid must be the leader */ if (-1 == kill(-pid, SIGTERM)) (2) zabbix_log(LOG_LEVEL_ERR, "failed to kill [%s]: %s", command, zbx_strerror(errno)); zbx_waitpid(pid); (3) }
In this case, the following steps would happen;
i) escalator ran and waited for a php script with zbx_waitpid (1)
ii) the php script hanged in some reasons, so timeout expired and SIGALRM raised. zbx_waitpid (1) returned by signal interrupt
iii) escalator sent SIGTERM to the php script with kill (2). Sending succeeded, but the php script didn't terminate in some reasons.
iv) escalator waited for the php script again with zbx_waitpid (3). the php script still existed in hanging status, so escalator waited for it forever at zbx_waitpid (3), and succeeding scripts weren't run.
The same problem would also happen in many processes which use zbx_execute.
So, zbx_waitpid (3) needs to return even if the child process hangs.
some possible resolutions;
1. set alarm() for a few seconds before zbx_waitpid (3) to return by signal interrupt.
2. remove zbx_waitpid (3) itself
in this case, however, child processes would become "zombies" in normal cases, so you need to modify another source codes too.
3. send SIGKILL at kill (2)
but some process might not terminate even by SIGKILL.
Thank you.
- duplicates
-
ZBX-7084 alert scripts can hang alerter
- Open