[ZBX-15935] zbx_perform_all_openipmi_ops can enter infinite loop Created: 2019 Apr 04  Updated: 2024 Apr 10  Resolved: 2019 May 19

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 4.0.6, 4.2.0
Fix Version/s: 4.0.8rc1, 4.2.2rc1, 4.4.0alpha1, 4.4 (plan)

Type: Problem report Priority: Critical
Reporter: Eric A. Borisch Assignee: Andrejs Sitals (Inactive)
Resolution: Fixed Votes: 0
Labels: bug, ipmi
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OpenIPMI >= 2.0.26


Issue Links:
Causes
caused by ZBX-15578 IPMI times out and fails to read valu... Closed
Team: Team I
Team: Team I
Sprint: Sprint 51 (Apr 2019), Sprint 52 (May 2019)
Story Points: 0.5

 Description   

Steps to reproduce:

  1. Have IPMI checks which enter zbx_perform_all_openipmi_ops() and return from perform_one_op() before the timeout expires. perform_one_op() updates the remaining timeout, which will get driven down to 0.0, at which point we just sit and spin.

Result:
**

100% CPU usage on IPMI thread 

Expected:

Not this.

 

Further discussion:

Once perform_one_op() returns before timeout, we never break out of the loop, since we reset start_time each cycle, but we keep comparing duration against the original timeout, not the (updated by perform_one_op()) remaining timeout.

Since perform_one_op() updates the remaining timeout internally (and returns a timeout of {0,0} if it did timeout), skip the start_time tracking completely, and just loop while (tv.tv_sec + tv.tv_usec > 0) and reset the tv to the timeout at the start of each loop:

void	zbx_perform_all_openipmi_ops(int timeout)
{
	struct timeval	tv = {1, 0};

	while (tv.tv_sec + tv.tv_usec > 0)
	{
		int	res;

		tv.tv_sec = timeout;
		tv.tv_usec = 0;

		res = os_hnd->perform_one_op(os_hnd, &tv);

		/* perform_one_op() returns 0 on success, errno on failure (timeout means success) */
		if (0 != res)
		{
			zabbix_log(LOG_LEVEL_DEBUG, "IPMI error: %s", zbx_strerror(res));
			break;
		}
	}
}


 Comments   
Comment by Andrejs Sitals (Inactive) [ 2019 Apr 04 ]

Thanks for your report, eborisch.

perform_one_op() just passes timeout to sel_select() which is defined in selector.c. It doesn't do anything else with timeout.

sel_select() started updating timeout in version 2.0.26 which was released on 2018-12-14. It didn't modify timeout in 2.0.25 and older versions.

Comment by Eric A. Borisch [ 2019 Apr 04 ]

Aha; yes I am running 2.0.27, so that explains it. Perhaps just resetting tv after every call, then.

Thanks for digging into this!

Comment by Andrejs Sitals (Inactive) [ 2019 Apr 08 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-15935

Comment by Eric A. Borisch [ 2019 Apr 23 ]

Doesn't appear to have made the window for 4.0.7...

Comment by Andrejs Sitals (Inactive) [ 2019 Apr 24 ]

Available in versions:

  • pre-4.0.8rc1 e9a4af8a397
  • pre-4.2.2rc1 c0eb9246d67
  • pre-4.4.0alpha1 60d301f0207
Comment by Eric A. Borisch [ 2019 Apr 26 ]

No surprise, but also encountered on FreeBSD with zabbix_proxy performing IPMI checks – pegged the CPU.

Current FreeBSD ports versions of openipmi and zabbix4 are 2.0.27 and 4.0.7, respectively.

Generated at Sat Apr 20 00:56:56 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.