[ZBX-3243] Network error while retrieving IPMI data Created: 2010 Nov 29  Updated: 2017 May 30  Resolved: 2017 Feb 08

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 1.8.4rc2
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Sergey Syreskin Assignee: Unassigned
Resolution: Duplicate Votes: 7
Labels: ipmi
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Centos 5.5, PostgreSQL 8.4.4, Apache 2.2.3, PHP 5.2.10, Zabbix 1.8.4rc2 (15477);
IBM System x3550 M2 as IPMI client.


Attachments: File ipmi_error_report.tgz    
Issue Links:
Duplicate
duplicates ZBXNEXT-3386 IPMI connection to a single device is... Closed
duplicates ZBX-3188 IPMI host unreachable Closed
is duplicated by ZBX-4270 IPMI insufficient resource Closed

 Description   

Zabbix periodically fails to retrieve some sensor's data via IPMI. The reported error is "End of read_ipmi_sensor():NETWORK_ERROR". This leads to gaps on graphs.
Graphs with gaps, debug log, zabbix server conf, ipmi and host templates are attached in a tgz file.



 Comments   
Comment by Sergey Syreskin [ 2010 Nov 29 ]

The network connection is stable, this is confirmed by ping graphs. Ping frequency is 3 seconds, IPMI requests frequency is 300 seconds. IPMI library version is 2.0.16-7.el5. StartIPMIPollers=2, 3 hosts are monitored via IPMI and there are some other hosts monitored via agents, snmp and simple checks.

Comment by Aleksandrs Saveljevs [ 2010 Nov 29 ]

Seems to be the same problem as in ZBX-3188.

Comment by Sergey Syreskin [ 2010 Nov 29 ]

Aleksandrs Saveljevs, no this is not the same problem as ZBX-3188, I have no "host unreachable" errors in zabbix log. Besides that, all sensors work in my setup, the gaps in data happen periodically, after some time sensors data becomes reachable again, then after some time network error occurs, then it becomes ok again. Just see the graphs attached to my first post.

Comment by richlv [ 2011 Apr 25 ]

could it be that by polling ipmi too often it becomes slow, locks up or just applies some connection throttling ?
how many ipmi items you have ? do they all have the same interval ?

Comment by Sergey Syreskin [ 2011 Apr 26 ]

There are 3 hosts with 125 IPMI items each. Polling interval is set to 300 seconds for each item.
I'm using Zabbix 1.8.5 now and don't experience this problem any more.
I can't remember when the problem disappeared, it could be Zabbix update or changes in the IPMI template,
that I have done some time ago.
The only thing I can say for sure, is that I didn't change any settings on the IPMI devices.

Looking at the template in the attached ipmi_error_report.tgz archive, I can see that my current template is
definitely different from the old one. The old template had only 19 items.

Comment by Chris Witte [ 2012 Mar 27 ]

Same erros with Zabbix 1.8.10 / 2.0.0 RC1 / 2.0.0 RC2

Installed new machine (Debian 6.0.4 -x64 / Virtual machine on VMware ESXi 5) and installed Zabbix 2.0 RC1 - Compiled with openipmi-2.0.19 (tried older version as well).

Zabbix 2.0 is monitoring just ONE host with ONE item (directly, no template) for testing and the errors in the zabbix_server.log appear.
(Interval: 15 sec, no flixible intervals)

I assumed that the BMC was too busy and made checks with openipmish (two checks per second).
Result: All requests were answered correct and in time.

Monitored Host: Dell PowerEdge R610 + R710 with iDRAC6 - Ver: 1.80 (also tested with Ver. 1.71)

    1. configure##
      ./configure --enable-server --enable-agent --with-mysql --enable-ipv6 --with-net-snmp --with-libcurl --with-ssh2 --with-ldap --enable-proxy --openipmi --prefix=/opt/zabbix
      ###
    1. zabbix_server.conf ##
      StartPollers=5
      StartIPMIPollers=5 # incremented step-by-step but no changes
      ###
    1. zabbix_server.log - Zabbix 2.0.0 RC1 ##
      13292:20120327:113333.407 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113350.583 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:113406.933 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:113425.415 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113440.949 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:113504.423 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113519.964 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:113555.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113610.980 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:113625.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113640.994 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:113649.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113705.008 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:113718.027 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113733.026 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:113955.094 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114017.064 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114033.070 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114050.561 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:114110.112 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114125.586 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:114134.110 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114149.602 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:114219.124 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114234.618 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:114640.743 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114702.663 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114718.673 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114736.178 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:114755.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114810.189 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:114819.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114834.205 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:115155.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115217.250 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:115233.257 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:115250.893 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:115310.016 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115325.904 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:115333.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115348.920 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:115418.036 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115433.941 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:115510.801 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115525.959 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:115534.808 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115549.975 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:115940.013 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120002.020 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120018.026 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120035.679 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:120055.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120110.695 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:120119.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120134.711 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:120149.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120204.725 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:120255.625 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120317.749 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120333.765 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120351.272 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:120404.641 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120419.283 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:121655.018 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121717.398 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:121732.402 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:121749.902 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:121803.030 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121818.918 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:121855.640 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121910.935 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:121919.648 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121934.950 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:122249.007 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:122311.990 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:122327.995 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:122345.767 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:122355.249 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:122410.781 resuming IPMI checks on host [F2-CN-01]: connection restored
      ###

-EDIT: 2012 Mar 28-

1. Converted VM from VMware to VirtualBox (Windows) on another host (win7) in another network segment (to exclude hypervisor, Host-OS, network connectivity from error source)
2. Compiled Zabbix 2.0.0 RC2 and updated the system
3. Added host and templates

Result:

    1. zabbix_server.log - Zabbix 2.0.0 RC2 ##
      1539:20120328:123650.011 Starting Zabbix Server. Zabbix 2.0.0rc2 (revision 26343).
      1539:20120328:123650.011 ****** Enabled features ******
      1539:20120328:123650.011 SNMP monitoring: YES
      1539:20120328:123650.012 IPMI monitoring: YES
      1539:20120328:123650.012 WEB monitoring: YES
      1539:20120328:123650.012 Jabber notifications: NO
      1539:20120328:123650.012 Ez Texting notifications: YES
      1539:20120328:123650.012 ODBC: NO
      1539:20120328:123650.012 SSH2 support: YES
      1539:20120328:123650.012 IPv6 support: YES
      1539:20120328:123650.012 ******************************
      1541:20120328:123650.068 server #2 started db watchdog #1
      1540:20120328:123650.070 server #1 started configuration syncer #1
      1548:20120328:123650.126 server #9 started trapper #1
      1549:20120328:123650.128 server #10 started trapper #2
      1550:20120328:123650.130 server #11 started trapper #3
      1551:20120328:123650.158 server #12 started trapper #4
      1544:20120328:123650.161 server #5 started poller #3
      1542:20120328:123650.163 server #3 started poller #1
      1545:20120328:123650.164 server #6 started poller #4
      1543:20120328:123650.165 server #4 started poller #2
      1546:20120328:123650.167 server #7 started poller #5
      1547:20120328:123650.170 server #8 started unreachable poller #1
      1552:20120328:123650.173 server #13 started trapper #5
      1553:20120328:123650.179 server #14 started icmp pinger #1
      1554:20120328:123650.185 server #15 started alerter #1
      1555:20120328:123650.192 server #16 started housekeeper #1
      1555:20120328:123650.192 executing housekeeper
      1566:20120328:123650.204 server #17 started timer #1
      1567:20120328:123650.206 server #18 started http poller #1
      1569:20120328:123650.215 server #20 started history syncer #1
      1570:20120328:123650.217 server #21 started history syncer #2
      1571:20120328:123650.220 server #22 started history syncer #3
      1572:20120328:123650.223 server #23 started history syncer #4
      1579:20120328:123650.244 server #24 started escalator #1
      1580:20120328:123650.247 server #25 started ipmi poller #1
      1581:20120328:123650.250 server #26 started ipmi poller #2
      1582:20120328:123650.253 server #27 started ipmi poller #3
      1568:20120328:123650.262 server #19 started discoverer #1
      1586:20120328:123650.273 server #29 started ipmi poller #5
      1587:20120328:123650.275 server #30 started proxy poller #1
      1539:20120328:123650.280 server #0 started [main process]
      1585:20120328:123650.284 server #28 started ipmi poller #4
      1592:20120328:123650.289 server #31 started self-monitoring #1
      1555:20120328:123651.371 housekeeper deleted: 10190 records from history and trends, 500 records of deleted items, 0 events, 0 alerts, 0 sessions
      1547:20120328:123655.299 temporarily disabling IPMI checks on host [F2-VH-01]: host unavailable
      1580:20120328:123700.385 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:123712.952 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123712.967 temporarily disabling IPMI checks on host [F2-CN-01]: host unavailable
      1581:20120328:123713.314 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1547:20120328:123715.974 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds
      1547:20120328:123728.994 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123731.006 IPMI item [FAN_MOD_4A_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds
      1586:20120328:123739.330 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1547:20120328:123746.025 temporarily disabling IPMI checks on host [F2-VH-02]: host unavailable
      1547:20120328:123754.040 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123758.285 enabling IPMI checks on host [F2-VH-01]: host became available
      1585:20120328:123809.353 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1547:20120328:123815.477 enabling IPMI checks on host [F2-CN-01]: host became available
      1586:20120328:123816.364 IPMI item [FAN_4_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1581:20120328:123816.364 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123824.499 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123831.521 resuming IPMI checks on host [F2-VH-01]: connection restored
      1547:20120328:123831.529 resuming IPMI checks on host [F2-CN-01]: connection restored
      1585:20120328:123839.381 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1581:20120328:123842.384 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1586:20120328:123846.388 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123854.560 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123857.572 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:123901.582 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:123911.420 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1582:20120328:123912.411 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1582:20120328:123917.457 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123926.620 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123927.629 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:123932.640 resuming IPMI checks on host [F2-VH-01]: connection restored
      1585:20120328:123942.166 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1585:20120328:123946.191 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123957.679 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:124001.691 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:124013.463 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1586:20120328:124013.477 IPMI item [FAN_1_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1586:20120328:124016.486 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124028.718 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124028.726 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:124031.735 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:124039.505 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1582:20120328:124042.515 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1586:20120328:124046.517 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124049.716 enabling IPMI checks on host [F2-VH-02]: host became available
      1547:20120328:124054.727 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124057.741 resuming IPMI checks on host [F2-CN-01]: connection restored
      1582:20120328:124059.529 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:124101.759 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:124109.537 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1580:20120328:124112.541 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1547:20120328:124114.776 resuming IPMI checks on host [F2-VH-02]: connection restored
      1582:20120328:124116.547 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124124.796 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124127.807 resuming IPMI checks on host [F2-CN-01]: connection restored
      1586:20120328:124129.554 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:124131.818 resuming IPMI checks on host [F2-VH-01]: connection restored
      1582:20120328:124141.569 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1586:20120328:124142.568 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1547:20120328:124144.832 resuming IPMI checks on host [F2-VH-02]: connection restored
      1581:20120328:124146.017 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: another network error, wait for 15 seconds
      1581:20120328:124147.024 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124157.854 resuming IPMI checks on host [F2-CN-01]: connection restored
      1581:20120328:124159.046 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:124201.865 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124202.872 resuming IPMI checks on host [F2-VH-01]: connection restored
      1581:20120328:124209.063 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1581:20120328:124212.072 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1547:20120328:124214.886 resuming IPMI checks on host [F2-VH-02]: connection restored
      1581:20120328:124216.085 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      ...
      ####
Comment by Chris Witte [ 2012 Mar 28 ]

It seems that the BMC gets too many requests/connections.
From time to time I get following message when running ipmitool:

# ipmitool sdr -H <HOSTNAME> -U <USER> -P <PASSWORD> -L USER

Get Session Challenge command failed: Node busy
Error: Unable to establish LAN session
Get Device ID command failed
Unable to open SDR for reading

Does Zabbix use a sdr cache ? This could increase the performance.

ipmitool offers this parameter:

-S <sdr_cache_file>
              Use local file for remote SDR cache.  Using a local  SDR  cache
              can  drastically increase performance for commands that require
              knowledge of the entire SDR to perform their  function.   Local
              SDR cache from a remote system can be created with the sdr dump
              command.

BMC busy topic: http://old.nabble.com/possible-causes-for-%22ipmi_ctx_open_outofband%3A-BMC-busy%22-td31448014.html

Comment by Chris Witte [ 2012 Mar 30 ]

Posted this problem on Dell Community:

http://en.community.dell.com/support-forums/servers/f/177/p/19442918/20078853.aspx#20078853

Comment by Sergey Syreskin [ 2012 May 22 ]

My colleague has done some testing on this issue, and he came to the conclusion that IPMI CPU is unable to handle all those requests. As he says, for each request to IPMI host Zabbix opens one separate connection and IBM System x IMM module is unable to handle all the requests. So he had to write a wrapper script that requests all IPMI items from the host at a time, stores them in a cache file, and gives items to Zabbix when it requests.

Comment by Chris Witte [ 2012 May 22 ]

Thanks for your reply. Could you post the wrapper script here ?

What about caching the sdr query like ipmitool does whe using the parameter -s ?

-S <sdr_cache_file> 
              Use local file for remote SDR cache. Using a local SDR cache 
              can drastically increase performance for commands that require 
              knowledge of the entire SDR to perform their function. Local 
              SDR cache from a remote system can be created with the sdr dump 
              command. 

I know that freeipmi automaticaly creates a cachefile of the sdr. But Zabbix uses openipmi.
For sure Zabbix's IPMI-Engine would have a better performace when using the caching option by default.

Chris

Comment by Sergey Syreskin [ 2012 May 22 ]

The script is rather simple, it just stores values in a local file with a timestamp. Then, when Zabbix requests a value, script examines the timestamp, and either renews its cache first, or just gives out data from cache, if it's recent enough.

Comment by Sergey Syreskin [ 2012 Aug 14 ]

This issue is covered by ZBXNEXT-1210, which is related to ZBXNEXT-98.

Comment by Aaron Smart [ 2012 Aug 24 ]

I'm experiencing the same network errors in the server log as Chris is above (running 2.0.2), trying to connect to a Dell PowerEdge 1950 (BMC) and PowerEdge R210 II (iDRAC 6 Express). Is there some way to make the IPMI poller more accommodating for slow devices?

Comment by dimir [ 2012 Aug 24 ]

There is a discussion going recently about fixing this one. We will report as soon as there is more information.

Comment by Falk G. [ 2012 Sep 09 ]

i have the same issue ... and its not related to DELL. I am using Supermicro IPMI to monitor RAM and Environment Temperature and i got the same issues:

24006:20120909:171011.011 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171033.326 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171048.330 IPMI item [P2-DIMM3B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171104.539 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:171111.023 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171126.547 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:171135.955 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171150.556 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:171559.993 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
24007:20120909:171603.995 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171625.608 IPMI item [Fan6] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171641.611 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171657.737 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:171718.004 IPMI item [P1-DIMM3A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171733.748 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172329.683 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172351.819 IPMI item [Fan2] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:172407.825 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:172424.027 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172429.695 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172444.037 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:172453.725 IPMI item [P1-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172508.047 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:172511.730 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172526.056 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172559.839 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172614.069 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172615.843 IPMI item [P1-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172630.078 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:173101.923 IPMI item [Fan3] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173123.135 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173139.139 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173155.263 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:173156.932 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173211.665 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:173347.011 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173409.694 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173425.699 IPMI item [System_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds

23678:20120904:202202.840 Starting Zabbix Server. Zabbix 2.0.2 (revision 29214).
23678:20120904:202202.840 ****** Enabled features ******
23678:20120904:202202.840 SNMP monitoring: YES
23678:20120904:202202.840 IPMI monitoring: YES
23678:20120904:202202.840 WEB monitoring: NO
23678:20120904:202202.840 Jabber notifications: NO
23678:20120904:202202.840 Ez Texting notifications: NO
23678:20120904:202202.840 ODBC: NO
23678:20120904:202202.840 SSH2 support: NO
23678:20120904:202202.840 IPv6 support: NO
23678:20120904:202202.840 ******************************
23680:20120904:202202.900 server #1 started configuration syncer #1
23681:20120904:202202.900 server #2 started db watchdog #1
23682:20120904:202202.901 server #3 started poller #1
23683:20120904:202202.902 server #4 started poller #2
23684:20120904:202202.904 server #5 started poller #3
23685:20120904:202202.905 server #6 started poller #4
23686:20120904:202202.906 server #7 started poller #5
23678:20120904:202202.906 server #0 started [main process]
23704:20120904:202202.906 server #25 started ipmi poller #1
23687:20120904:202202.907 server #8 started unreachable poller #1
23705:20120904:202202.907 server #26 started ipmi poller #2
23706:20120904:202202.907 server #27 started ipmi poller #3
23707:20120904:202202.907 server #28 started proxy poller #1
23708:20120904:202202.908 server #29 started self-monitoring #1
23692:20120904:202202.910 server #13 started trapper #5
23693:20120904:202202.910 server #14 started icmp pinger #1
23698:20120904:202202.911 server #19 started discoverer #1
23697:20120904:202202.911 server #18 started http poller #1
23696:20120904:202202.912 server #17 started timer #1
23695:20120904:202202.912 server #16 started housekeeper #1
23695:20120904:202202.912 executing housekeeper
23694:20120904:202202.912 server #15 started alerter #1
23699:20120904:202202.913 server #20 started history syncer #1
23688:20120904:202202.913 server #9 started trapper #1
23689:20120904:202202.913 server #10 started trapper #2
23690:20120904:202202.913 server #11 started trapper #3
23691:20120904:202202.913 server #12 started trapper #4
23702:20120904:202202.914 server #23 started history syncer #4
23701:20120904:202202.914 server #22 started history syncer #3
23700:20120904:202202.914 server #21 started history syncer #2
23703:20120904:202202.915 server #24 started escalator #1

Comment by Milosz Modrzewski [ 2012 Sep 27 ]

Same problem for me:

Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> Dell Remote Access Controller 5 A01 Firmware Version 1.60 (11.03.03) IP: 172.30.5.96
Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> iLO4 Firmware Version 1.05 ILOCZ22240991 IP: 172.30.5.98

1689:20120927:144113.549 resuming IPMI checks on host [172.30.5.98]: connection restored
1679:20120927:144123.137 Received configuration data from server. Datalen 7766
1709:20120927:144206.227 IPMI item [ipmi.ambient_temp] on host [172.30.5.96] failed: first network error, wait for 15 seconds
1679:20120927:144223.268 Received configuration data from server. Datalen 7766
1689:20120927:144228.939 resuming IPMI checks on host [172.30.5.96]: connection restored
1679:20120927:144323.726 Received configuration data from server. Datalen 7766
1711:20120927:144416.828 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: first network error, wait for 15 seconds
1679:20120927:144423.899 Received configuration data from server. Datalen 7766
1689:20120927:144432.106 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: another network error, wait for 15 seconds
1689:20120927:144457.405 resuming IPMI checks on host [172.30.5.98]: connection restored
1679:20120927:144524.008 Received configuration data from server. Datalen 7766

Comment by Anton Samets [ 2012 Nov 22 ]

I know the solution for this issue:
as for me, if you are using LO-100 you must set password size to 16 bytes (not 20). After that monitoring of IPMI will start to work.
So, zabbix don't use ipmi 2.0 and I can't find where I can set it.

Print out of commands if you have password size set to 20 bytes:

ipmitool  -H 10.145.1.129 -U admin -P admin chassis status
Invalid user name
Error: Unable to establish LAN session
Error sending Chassis Status command
ipmitool -I lanplus -H 10.145.1.129 -U admin -P admin chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Sleep Button Disable : allowed
Diag Button Disable  : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false

So, where we can set parameters for ipmi-tools?

Comment by Anton Samets [ 2012 Nov 22 ]

Hm, I found that if you set Authentication algorithm from "none" to "RMCP+" all is works fine.

Comment by Andrej Kacian [ 2013 Mar 14 ]

I too had same problem (monitoring 5 hosts with around 10 items each), and was getting unsupported items intermittently every minute or so. Based on a suggestion from forums[1], I changed number of IPMI pollers to just one. Since then, there was no problem with getting IPMI values at all. This was on zabbix 2.0.3 at that time, and still works flawlessly on 2.05 with just one IPMI poller.

1. https://www.zabbix.com/forum/showpost.php?s=783bdc9aff7d3ea26999f74f4d223e59&p=118389&postcount=4

Comment by Alexey Pustovalov [ 2014 Feb 24 ]

if IPMI sensor is located at the end of table of sensors, getting value can take about 40-50 seconds and sometimes can be failed with network error:

 10673:20140224:182441.012 In get_value() key:'ipmi.cpu[FAN 1]'
 10673:20140224:182441.012 In get_value_ipmi() key:'Zabbix server:ipmi.cpu[FAN 1]'
 10673:20140224:182441.012 In init_ipmi_host() host:'[10.100.52.28]:623'
 10673:20140224:182441.012 In get_ipmi_host() host:'[10.100.52.28]:623'
 10673:20140224:182441.012 End of get_ipmi_host():0x2f9bbf0
 10673:20140224:182441.013 End of init_ipmi_host():0x2f9bbf0
 10673:20140224:182441.013 In get_ipmi_sensor_by_id() sensor:'FAN 1@[10.100.52.28]:623'
 10673:20140224:182441.013 End of get_ipmi_sensor_by_id():0x307fcf8
 10673:20140224:182441.013 In read_ipmi_sensor() sensor:'FAN 1@[10.100.52.28]:623'
 10673:20140224:182448.020 In got_thresh_reading()
 10673:20140224:182448.020 got_thresh_reading() fail: [16777411] Unknown error 16777411
 10673:20140224:182448.020 End of got_thresh_reading():NETWORK_ERROR
 10673:20140224:182448.020 End of read_ipmi_sensor():NETWORK_ERROR
 10673:20140224:182448.020 Item [Zabbix server:ipmi.cpu[FAN 1]] error: error 0x10000c3 while reading threshold sensor
 10673:20140224:182448.020 End of get_value():NETWORK_ERROR
Comment by Michael Sphar [ 2014 Apr 14 ]

I have had the same experience, that reducing the number of IPMI pollers to just one has stopped the frequent Network Error messages and gaps.

I had both my production server and a small test VM server, the production server was only polling a few IPMI items and a lot of other non-IPMI monitoring, and the test server was only polling a few IPMI items and doing no other monitoring. Both servers were showing frequent Network Errors and gaps from the items then going unsupported. I reduced the number of IPMI pollers to just one last week and have yet to see a Network Error warning since. Neither server showed all the IPMI pollers as busy.

This is with Zabbix 2.2.2.

Thinking it might have something to do with two different ipmi pollers polling the same device at the same time, I did a simple test where from two different hosts I issued an ipmitool sensor command to the same IPMI device. What I observed is that the resulting output from the IPMI is only sent to one device at a time. The effect I observe is that one ipmitool output starts scrolling while the other is paused for a few seconds, then the other starts scrolling and the first one pauses, and this goes back and forth a few times until both are complete.

Comment by Norbert Wögerbauer [ 2014 Oct 09 ]

Same here with 2.2.6
I observed that I had an item with an invalid sensor id. Seems that if there is a problem with any one item, further processing just breaks.
I disabled all items that did not give a value and the problem disappears.
Note that this happens even if the sensor is listed and basically available, but simply doesn't provide a value (e.g. I have sensor FAN4 but no fan connected to it)!

Comment by Jeroen van den Berg [ 2015 Jun 27 ]

This issue still exists in 2.4.5, and I can confirm that if you disable unavailable sensors it works without problems.
Looks like the handling of not available sensors is incorrect.

Comment by Ilya Kruchinin [ 2015 Aug 12 ]

Disabling unavailable sensors did not help (zabbix_server v2.4.5) in my case - those sensors that had been successfully receiving data still had issues.
However, when I set IPMIPollers to 1, the issue disappeared.

Comment by pfoo [ 2015 Dec 13 ]

Similar behaviour on my supermicro board monitored using latest zabbix server and agent from zabbix debian repository (2.4.7-1+jessie).
I was actually able to fix the issue by lowering the update interval of one IPMI item from 300s to 60s (all others ipmi items were kept to their 300s interval). If I set this ipmi item to 90s, the issue appears again.
Could be some ipmi session handling issue.

Comment by Sascha Plumhoff [ 2016 Jun 14 ]

Same here with DELL PowerEdge R510.

It seems to be a know issue with OpenIPMI, see https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/ipmi :

"IPMI session inactivity timeout for LAN is 60 +/-3 seconds. [...] then the next IPMI check after the timeout expires will time out due to individual message timeouts, retries or receive error."

Reducing the check interval to 45 seconds fixed the problem for me.
The issue appears naturally more frequently in testing environments e.g. when checking only 1 item on a server.

Comment by Alexander Vladishev [ 2017 Feb 08 ]

Already fixed under ZBXNEXT-3386.

Generated at Fri Apr 26 16:10:49 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.