-
Incident report
-
Resolution: Unresolved
-
Trivial
-
None
-
None
-
10
This was moved from (93) of ZBXNEXT-1263.
Something is improper with dynamically changing logging levels.
...
A GnuTLS proxy initially logs the following every second:32419:20151001:104527.821 zbx_tls_accept(): gnutls_handshake() returned: -50 The request is invalid. 32419:20151001:104527.821 failed to accept an incoming connection: from 127.0.0.1: zbx_tls_accept(): gnutls_handshake() failed: -50 The request is invalid.After we increase the log level once and decrease it once, it starts logging just the following:
343:20151001:104844.013 Got signal [signal:10(SIGUSR1),sender_pid:339,sender_uid:1000,value_int:1538(0x00000602)]. 343:20151001:104844.013 log level has been decreased to 3 (warning) 343:20151001:104845.002 zbx_tls_accept(): gnutls_handshake() returned: -50 The request is invalid. 343:20151001:104846.006 zbx_tls_accept(): gnutls_handshake() returned: -50 The request is invalid. 343:20151001:104847.015 zbx_tls_accept(): gnutls_handshake() returned: -50 The request is invalid.
Server/proxy trapper processes and agent's listener processes spend most of their time waiting inside one of select()/accept()/recv() system calls within zbx_tcp_accept() function. When we issue a runtime command we have almost 100% chance of interrupting one of these calls with a signal. That causes system call to fail with errno set to EINTR. Before encryption was introduced in ZBXNEXT-1263 zbx_tcp_accept() could only fail because of select()/accept()/recv() fail which meant we could check errno on zbx_tcp_accept() failure and rely on the fact that errno is overwritten by system call on every such occasion. With encryption zbx_tcp_accept() has new reasons to fail which don't overwrite errno.
So, after we increase/decrease debug level once errno will be set to EINTR. If there are no more system call fails errno will be "stuck" in EINTR state and when zbx_tcp_accept() fails for some encryption related reason second error message will disappear from log file since we print it only if errno != EINTR.
This is probably not the only place where incoming signals can have such an effect if they interrupt system calls. But it is very hard to discover other places "by chance" since the probability of interrupting shorter calls is very small.
The way to completely eliminate such places in Zabbix code is to follow best error handling practices: