[ZBX-11479] incomplete data not detected by zbx_execute_threaded_metric() interface Created: 2016 Nov 14  Updated: 2017 May 30  Resolved: 2017 Jan 02

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.0.19
Fix Version/s: 2.0.21rc1, 2.2.17rc1, 3.0.7rc1, 3.2.3rc1, 3.4.0alpha1

Type: Incident report Priority: Major
Reporter: Sandis Neilands (Inactive) Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File ZBX-11479-reproduce.patch     File main.c     File main_fixed.c    

 Description   

zbx_execute_threaded_metric() doesn't serialize length of the data. This can lead to sending incomplete agent result to server when the child process received a signal (SIGUSR1 for example) during write(2).

The feature was implemented in ZBX-9781 to allow vfs.fs.size, etc. for NFS to be timed out.

Coverity: CID 154194.



 Comments   
Comment by Vladislavs Sokurenko [ 2016 Nov 29 ]

Fixed in development branch: svn://svn.zabbix.com/branches/dev/ZBX-11479

Comment by Vladislavs Sokurenko [ 2016 Nov 29 ]

To reproduce the error, compile attached source file main.c and pkill a.out -SIGUSR1

main_fixed.c reuses function committed to bug to handle interrupt.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 02 ]

(1) Neither return nor out parameter of write_all() is checked in zbx_execute_threaded_metric(). If we don't care about result of write_all() it should be void. If we do it should return SUCCEED/FAIL or SYSINFO_RET_OK/SYSINFO_RET_FAIL.

vso If return of write is to be checked, I think it's best then that we return EXIT_FAILURE if write_all fails and check WEXITSTATUS later on do you agree ?

glebs.ivanovskis Do whatever you think is best.

vso RESOLVED in r64171

glebs.ivanovskis A bit of optimization in r64191. Have a look.

vso CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 02 ]

(2) 0 in return of write() shouldn't cause immediate failure.
vso RESOLVED in r64170

glebs.ivanovskis CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 02 ]

(3) Copying a whole serial_header_t structure ruins the very idea of serialization. We could just dump AGENT_RESULT into pipe, but we don't do that for a reason.
vso First of all we don't need serialization there.
Second of all it does not ruin anything since there was already integer stored, and it means no serialization was respected.

-	memcpy(*data + *data_offset, &agent_ret, sizeof(int));
-	*data_offset += sizeof(int);
+	serial_header_pos = *data + *data_offset;
+	*data_offset += sizeof(serial_header);

vso WONTFIX

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 05 ]

(4) In serialize_agent_result() we know exactly how much memory we need for data, no need for a loop. Since the total length of data is deterministic it would be easier to understand the code if we copied length, return code and value in the same order as they lay down in the serialized form.

vso WONTFIX

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 05 ]

(5) Even though we have a length in message header we still read() until there is nothing to read() in a pipe and check the length only after that. Theoretically, with such approach it should be possible to get away without message length and simply read() until child closes the pipe. Or maybe it would be better for the future (when child process will be reusable) to make a more sophisticated read function which will process the header first and will not read from the pipe more than necessary.
vso RESOLVED in 64294 this issue is fixed by having write in a loop, no need to change anything else. Please see attached patch that sends kill to interrupt write and see that it is handled.

glebs.ivanovskis I like it this way - simple and clear. One more finishing touch needed. Child process needs to check whether write_all() has written all the data without errors and then exit() with appropriate code. Parent process should check WEXITSTATUS(status). REOPENED

vso RESOLVED in r64314

glebs.ivanovskis Should be fine now. Slight improvements in r64327.
CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 09 ]

Successfully tested.

Comment by Vladislavs Sokurenko [ 2016 Dec 09 ]

Fixed in:

  • pre-2.0.21rc1 r64328
  • pre-2.2.17rc1 r64329
  • pre-3.0.7rc1 r64330
  • pre-3.2.3rc1 r64331
  • pre-3.3.0 (trunk) r64332
Comment by Andrey Melnikov [ 2016 Dec 11 ]

Same fix need for fgets() calls in icmppinger and zbx_popen().

Comment by Viktors Tjarve [ 2016 Dec 13 ]

(6) Compilation warning:

sysinfo.c: In function ‘write_all’:
sysinfo.c:1278:8: warning: pointer of type ‘void *’ used in arithmetic [-Wpointer-arith]
    buf += ret;
        ^

void* pointer arithmetic is not allowed in C standard. GCC has an extension that allows to do this.
vso RESOLVED in r64442 of development branch svn://svn.zabbix.com/branches/dev/ZBX-11479

glebs.ivanovskis Good enough, warning disappeared.
CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 19 ]

Dear lynxchaus, where in zbx_popen() do you think the check is needed? Would you create a dedicated issue(s)?

I tend to agree about fgets() in icmpping, but this needs some further investigation, because according to man 7 signal file operations cannot be interrupted:

If a blocked call to one of the following interfaces is interrupted by a signal handler, then the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise the call will fail with the error EINTR:

  • read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices. A "slow" device is one where the I/O call may block for an indefinite time, for example, a terminal, pipe, or socket. If an I/O call on a slow device has already transferred some data by the time it is interrupted by a signal handler, then the call will return a success status (normally, the number of bytes transferred). Note that a (local) disk is not a slow device according to this definition; I/O operations on disk devices are not interrupted by signals.
  • ...
Comment by Andrey Melnikov [ 2016 Dec 19 ]

pipe not disk device. Sending USR1/USR2 signals to child's process break read() with EINTR, which propagated to fgets() and it return NULL. This break any of running scripts:

  2228:20161219:152115.127 log level has been increased to 3 (warning)
  2175:20161219:152115.127 the signal was redirected to process pid:2229
  2229:20161219:152115.127 log level has been increased to 3 (warning)
  2175:20161219:152115.127 the signal was redirected to process pid:2230
  2230:20161219:152115.127 log level has been increased to 3 (warning)
  2215:20161219:152115.151 fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss)
  2225:20161219:152115.173 item "xx.xx.xx.18:icmpping" became not supported: fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss) 
  2225:20161219:152115.173 item "xx.xx.xx.18:icmppingloss" became not supported: fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss) 
  2225:20161219:152115.173 item "xx.xx.xx.18:icmppingsec" became not supported: fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss) 
  2225:20161219:152115.173 item "xx.xx.xx.58:icmpping" became not supported: fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss) 
  2225:20161219:152115.173 item "xx.xx.xx.58:icmppingloss" became not supported: fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss) 
  2225:20161219:152115.173 item "xx.xx.xx.58:icmppingsec" became not supported: fping failed: xx.xx.xx.19  : [4], 84 bytes, 44.5 ms (44.5 avg, 0% loss) 
  2210:20161219:152117.264 fping failed: xx.xx.xx.1  : [1], 84 bytes, 27.8 ms (27.8 avg, 0% loss) 
  2223:20161219:152117.276 item "xx.xx.xx.9:icmpping" became not supported: fping failed: xx.xx.xx.1  : [1], 84 bytes, 27.8 ms (27.8 avg, 0% loss)   
  2223:20161219:152117.276 item "xx.xx.xx.9:icmppingloss" became not supported: fping failed: xx.xx.xx.1  : [1], 84 bytes, 27.8 ms (27.8 avg, 0% loss)   
  2223:20161219:152117.276 item "xx.xx.xx.9:icmppingsec" became not supported: fping failed: xx.xx.xx.1  : [1], 84 bytes, 27.8 ms (27.8 avg, 0% loss)    
  2175:20161219:152131.347 Got signal [signal:10(SIGUSR1),sender_pid:32312,sender_uid:0,value_int:33025(0x00008101)].
  2175:20161219:152131.347 the signal was redirected to process pid:2178

Real problem not in fgets() and read(). problem in your signal handler. Simple one line patch cure problem for me.

--- a/src/libs/zbxnix/daemon.c	2016-12-08 15:07:42.000000000 +0300
+++ b/src/libs/zbxnix/daemon.c	2016-12-19 15:30:28.053938010 +0300
@@ -268,7 +268,7 @@ static void	set_daemon_signal_handlers(v
 	sig_parent_pid = (int)getpid();
 
 	sigemptyset(&phan.sa_mask);
-	phan.sa_flags = SA_SIGINFO;
+	phan.sa_flags = SA_SIGINFO | SA_RESTART;
 
 	phan.sa_sigaction = user1_signal_handler;
 	sigaction(SIGUSR1, &phan, NULL);
Comment by Glebs Ivanovskis (Inactive) [ 2016 Dec 19 ]

Of course pipe is not a disk device. My apologies, I didn't investigate icmpping.c thoroughly enough. My error illustrates that this issue should be handled without a hurry and in a separate ticket. Please create or find existing one. ZBX-10042 and ZBXNEXT-3233 seem to be distantly related, so you can share your thoughts there. Unfortunately, your one line fix will break all timeouts in Zabbix because they are based on SIGALRM and EINTR.

Comment by Andrey Melnikov [ 2016 Dec 19 ]

Please, split this ticket.

It can't break timeouts - this patch affects only SIGUSR1/SIGPIPE handlers.

Comment by Vladislavs Sokurenko [ 2016 Dec 22 ]

Fixed warning in:

  • pre-2.0.21rc1 r64655
  • pre-2.2.17rc1 r64656
  • pre-3.0.7rc1 r64657
  • pre-3.2.3rc1 r64658
  • pre-3.3.0 (trunk) r64659
Generated at Fri Apr 26 19:54:41 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.