[ZBX-7862] Zabbix does not use Timeout option for IPMI checks Created: 2014 Feb 24  Updated: 2022 Oct 08  Resolved: 2015 Dec 14

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Incident report Priority: Blocker
Reporter: Alexey Pustovalov Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: ipmi, timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate

 Description   

Zabbix must use Timeout configuration option for IPMI checks.



 Comments   
Comment by Oleksii Zagorskyi [ 2015 Jul 08 ]

ZBXNEXT-1096 would resolve current request in a more general way.

Comment by Sandis Neilands (Inactive) [ 2015 Nov 05 ]

This is a feature request rather than bug report. Both documentation and comment in configuration file state that Timeout is only applicable to Zabbix agent, SNMP, external checks: "Specifies how long we wait for agent, SNMP device or external check (in seconds)."

Low-level IPMI timeouts

IPMI over LAN uses unreliable transport protocol (UDP). Multiple RMCP or RMCP+ request/response exchanges are needed to get a single value even if there are no IP/UDP level errors. We are using OpenIPMI libray for communications with BMC over LAN. The library manages all low-level protocol details including timeouts, number of retries. As of the current version (2.0.21) these parameters are hardcoded in internal OpenIPMI C file.

lib/ipmi_lan.c

* Timeout to wait for IPMI responses, in microseconds.  For commands
   with side effects, we wait 5 seconds, not one. */
#define LAN_RSP_TIMEOUT 1000000 
#define LAN_RSP_TIMEOUT_SIDEEFF 5000000

/* # of times to try a message before we fail it. */
#define LAN_RSP_RETRIES 6

Errors

Low-level

If OpenIPMI invokes Zabbix callback functions with error then Zabbix sets NETWORK_ERROR and closes the connection. Generally this happens when BMC responds with unsuccessful CC or RCMP+ error code, or if there was an internal error in the OpenIPMI library (OS error).

In case of IPMI over LAN the OpenIPMI library sets IPMI_TIMEOUT_CC (0xC3) if BMC has not responded to six consecutive requests in time. This CC can also be included in the response from the BMC. The library sets IPMI_UNKNOWN_ERR_CC (0xFF) for various (and many) reasons. Unfortunately these reasons are not logged or otherwise reported back to the library's user.

High level

For other, high-level IPMI item errors Zabbix sets the item to NOTSUPPORTED and provides relevant error message.

Troubleshooting

Troubleshooting OpenIPMI is rather difficult due to most of the IPMI over LAN traffic being encrypted. In development environment one can use DEBUG_RAWMSG_ENABLE(), DEBUG_MSG_ERR_ENABLE() and other OpenIPMI macros to enable extra logging from the library.

As for the following code pattern in read_ipmi_sensor(), read_ipmi_control(), set_ipmi_control() and init_ipmi_host()...

src/zabbix_server/poller/checks_ipmi.c

        tv.tv_sec = 10;
        tv.tv_usec = 0;

        while (0 == h->done)
                os_hnd->perform_one_op(os_hnd, &tv);

As far as I can tell the timeout tv is used as maximum timeout for select() call inside OpenIPMI library. The actual timeout for select() will be the time remaining to the nearest timer expiration. If there is no activity on the read or write sockets, or timers expiring during this time then perform_one_op() returns and is invoked again with maximum timeout of 10 seconds.

The value 10 is arbitrary, it can be changed with no effect to the overall functionality.

Summary

The low-level IPMI protocol timeouts and retry counts are defined in OpenIPMI library. The library doesn't provide interface to change the values of these constants so there is nothing that we can do on the Zabbix side.

High-level, per-item check IPMI timeouts

Since multiple requests/responses are needed to get value for a single item it might be desirable from the users PoV to have a total timeout for the whole interaction.

The options for what to do when such timeout is reached are limited by the design of the IPMI and OpenIPMI library. Currently the only straightforward and safe option is to destroy the domain and start over during the next check (this will cause a full scan of the BMC - can take at least a minute).

In summary: implementing a per-item check IPMI timeout currently has no justification.

IPMI session inactivity timeout

The IPMI specification (section 6.12.15 Session Inactivity Timeouts) requires session inactivity timeout to be implemented. For LAN the default timeout 60 +/- 3 seconds. In order to keep inactive session open the system monitoring software can use Activate Session command.

I doesn't seem to be possible to enable periodic sending of Activate Session command with OpenIPMI. If there are no IPMI item checks form Zabbix to a particular BMC for more than the session timeout configured in BMC then for the next IPMI check after the timeout is expired will time out due to low-level timeouts, retires or receive error CC. After that a new session will be opened and full rescan of the BMC will be initiated.

In summary: if the users want to avoid unnecessary rescans of the BMC it is advised to set the IPMI item polling interval below IPMI's session inactivity timeout configured in BMC.

Comment by Oleksii Zagorskyi [ 2015 Nov 05 ]

What about a feature request creation on a OpenIPMI tracker ?
https://sourceforge.net/p/openipmi/feature-requests/

Could be this report (reported as bug) related ?
https://sourceforge.net/p/openipmi/bugs/52/

sandis.neilands Yes, that is the plan. In any case it will take time until they have a new release and it will take even longer time until that release lands in distros. Until then - if you are building from source you might as well change these lines to suit your use case.

zalex_ua Lines in the patch mentions a bit different variables than you mentioned above. Could you check that ?

Also, looking to recent version of sources OpenIPMI-2.0.21.tar.gz with date 2014-01-28 AND taking into account that the existing patch reported 2010-04-19 AND that, for example, in Debian SID packaged version is still 2.0.16 - we can only hope that our children will get the feature in libopenipmi and then in Zabbix
https://packages.debian.org/sid/openipmi

sandis.neilands Yes, the issue that you mentioned deals with this exact problem but only in another type of interface.

sandis.neilands Opened a feature request in OpenIPMI tracker to make the low-level LAN interface timeouts tunable.

Comment by Sandis Neilands (Inactive) [ 2015 Nov 06 ]

We should document the following OpenIPMI behaviours:

  • low-level timeouts;
  • session inactivity timeout handling (e.g. that there are no periodic Activate Session commands towards BMC).

martins-v I tried to add some of the details mentioned here to:

Please review.

Reviewed by sandis.neilands.

Copied to:

CLOSED.

Generated at Tue Apr 23 22:53:42 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.