Ubuntu 18.04, OpenIPMI 2.0.22 from repository, OpenIPMI 2.0.25 from sources
Sprint 49 (Feb 2019), Sprint 50 (Mar 2019)
Steps to reproduce:
- Set DebugLevel=4 in zabbix_server.conf
- Set StartIPMIPollers=1 in zabbix_server.conf
- Create host "IPMI timeouts" with a single IPMI interface
- Create item "Power", set update interval to 30s
- Start monitoring server's log by running
tail -f /var/log/zabbix/zabbix_server.log | grep -i -E "(ipmi (poller|manager)|ipmi_lan\.c|network error|restored|power)"
- Run Zabbix server
See log file. There's a repeating pattern between "Connection 0 to the BMC is up" lines. Connection goes up, gets a value, then there are network errors with "Received IPMI error: c3" and "Received IPMI error: ff" log entries.
0xC3 is IPMI_TIMEOUT_CC in OpenIPMI. 0xFF is IPMI_UNKNOWN_ERR_CC.
There are also gaps in "Latest data" when these errors occur.
- No "Received IPMI error" in log file
- No gaps in collected data
There are calls to os_hnd->perform_one_op(os_hnd) in some places. If I understand correctly, these calls allow OpenIPMI to do internal processing. If these calls aren't frequent enough, then OpenIPMI cannot do its work and timeouts happen.
To fix this, IPMI poller process should call os_hnd->perform_one_op(os_hnd) while it is waiting for messages from IPMI manager.