ZABBIX BUGS AND ISSUES

Network error while retrieving IPMI data

Details

  • Type: Bug Bug
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: 1.8.4rc2
  • Fix Version/s: None
  • Component/s: Server (S)
  • Labels:
  • Environment:
    Centos 5.5, PostgreSQL 8.4.4, Apache 2.2.3, PHP 5.2.10, Zabbix 1.8.4rc2 (15477);
    IBM System x3550 M2 as IPMI client.
  • Zabbix ID:
    RTD

Description

Zabbix periodically fails to retrieve some sensor's data via IPMI. The reported error is "End of read_ipmi_sensor():NETWORK_ERROR". This leads to gaps on graphs.
Graphs with gaps, debug log, zabbix server conf, ipmi and host templates are attached in a tgz file.

Issue Links

Activity

Hide
Sergey Sireskin added a comment - - edited

The network connection is stable, this is confirmed by ping graphs. Ping frequency is 3 seconds, IPMI requests frequency is 300 seconds. IPMI library version is 2.0.16-7.el5. StartIPMIPollers=2, 3 hosts are monitored via IPMI and there are some other hosts monitored via agents, snmp and simple checks.

Show
Sergey Sireskin added a comment - - edited The network connection is stable, this is confirmed by ping graphs. Ping frequency is 3 seconds, IPMI requests frequency is 300 seconds. IPMI library version is 2.0.16-7.el5. StartIPMIPollers=2, 3 hosts are monitored via IPMI and there are some other hosts monitored via agents, snmp and simple checks.
Hide
Aleksandrs Saveljevs added a comment -

Seems to be the same problem as in ZBX-3188.

Show
Aleksandrs Saveljevs added a comment - Seems to be the same problem as in ZBX-3188.
Hide
Sergey Sireskin added a comment - - edited

Aleksandrs Saveljevs, no this is not the same problem as ZBX-3188, I have no "host unreachable" errors in zabbix log. Besides that, all sensors work in my setup, the gaps in data happen periodically, after some time sensors data becomes reachable again, then after some time network error occurs, then it becomes ok again. Just see the graphs attached to my first post.

Show
Sergey Sireskin added a comment - - edited Aleksandrs Saveljevs, no this is not the same problem as ZBX-3188, I have no "host unreachable" errors in zabbix log. Besides that, all sensors work in my setup, the gaps in data happen periodically, after some time sensors data becomes reachable again, then after some time network error occurs, then it becomes ok again. Just see the graphs attached to my first post.
Hide
richlv added a comment -

could it be that by polling ipmi too often it becomes slow, locks up or just applies some connection throttling ?
how many ipmi items you have ? do they all have the same interval ?

Show
richlv added a comment - could it be that by polling ipmi too often it becomes slow, locks up or just applies some connection throttling ? how many ipmi items you have ? do they all have the same interval ?
Hide
Sergey Sireskin added a comment -

There are 3 hosts with 125 IPMI items each. Polling interval is set to 300 seconds for each item.
I'm using Zabbix 1.8.5 now and don't experience this problem any more.
I can't remember when the problem disappeared, it could be Zabbix update or changes in the IPMI template,
that I have done some time ago.
The only thing I can say for sure, is that I didn't change any settings on the IPMI devices.

Looking at the template in the attached ipmi_error_report.tgz archive, I can see that my current template is
definitely different from the old one. The old template had only 19 items.

Show
Sergey Sireskin added a comment - There are 3 hosts with 125 IPMI items each. Polling interval is set to 300 seconds for each item. I'm using Zabbix 1.8.5 now and don't experience this problem any more. I can't remember when the problem disappeared, it could be Zabbix update or changes in the IPMI template, that I have done some time ago. The only thing I can say for sure, is that I didn't change any settings on the IPMI devices. Looking at the template in the attached ipmi_error_report.tgz archive, I can see that my current template is definitely different from the old one. The old template had only 19 items.
Hide
Chris Witte added a comment - - edited

Same erros with Zabbix 1.8.10 / 2.0.0 RC1 / 2.0.0 RC2

Installed new machine (Debian 6.0.4 -x64 / Virtual machine on VMware ESXi 5) and installed Zabbix 2.0 RC1 - Compiled with openipmi-2.0.19 (tried older version as well).

Zabbix 2.0 is monitoring just ONE host with ONE item (directly, no template) for testing and the errors in the zabbix_server.log appear.
(Interval: 15 sec, no flixible intervals)

I assumed that the BMC was too busy and made checks with openipmish (two checks per second).
Result: All requests were answered correct and in time.

Monitored Host: Dell PowerEdge R610 + R710 with iDRAC6 - Ver: 1.80 (also tested with Ver. 1.71)

    1. configure##
      ./configure --enable-server --enable-agent --with-mysql --enable-ipv6 --with-net-snmp --with-libcurl --with-ssh2 --with-ldap --enable-proxy --openipmi --prefix=/opt/zabbix
      ###
    1. zabbix_server.conf ##
      StartPollers=5
      StartIPMIPollers=5 # incremented step-by-step but no changes
      ###
    1. zabbix_server.log - Zabbix 2.0.0 RC1 ##
      13292:20120327:113333.407 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113350.583 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:113406.933 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:113425.415 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113440.949 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:113504.423 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113519.964 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:113555.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113610.980 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:113625.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113640.994 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:113649.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113705.008 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:113718.027 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:113733.026 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:113955.094 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114017.064 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114033.070 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114050.561 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:114110.112 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114125.586 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:114134.110 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114149.602 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:114219.124 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114234.618 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:114640.743 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114702.663 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114718.673 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:114736.178 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:114755.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114810.189 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:114819.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:114834.205 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:115155.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115217.250 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:115233.257 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:115250.893 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:115310.016 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115325.904 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:115333.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115348.920 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:115418.036 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115433.941 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:115510.801 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115525.959 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:115534.808 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:115549.975 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:115940.013 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120002.020 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120018.026 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120035.679 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:120055.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120110.695 resuming IPMI checks on host [F2-CN-01]: connection restored
      13290:20120327:120119.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120134.711 resuming IPMI checks on host [F2-CN-01]: connection restored
      13292:20120327:120149.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120204.725 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:120255.625 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120317.749 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120333.765 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:120351.272 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:120404.641 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:120419.283 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:121655.018 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121717.398 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:121732.402 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:121749.902 resuming IPMI checks on host [F2-CN-01]: connection restored
      13288:20120327:121803.030 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121818.918 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:121855.640 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121910.935 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:121919.648 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:121934.950 resuming IPMI checks on host [F2-CN-01]: connection restored
      13289:20120327:122249.007 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:122311.990 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:122327.995 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
      13271:20120327:122345.767 resuming IPMI checks on host [F2-CN-01]: connection restored
      13291:20120327:122355.249 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      13271:20120327:122410.781 resuming IPMI checks on host [F2-CN-01]: connection restored
      ###

-EDIT: 2012 Mar 28-

1. Converted VM from VMware to VirtualBox (Windows) on another host (win7) in another network segment (to exclude hypervisor, Host-OS, network connectivity from error source)
2. Compiled Zabbix 2.0.0 RC2 and updated the system
3. Added host and templates

Result:

    1. zabbix_server.log - Zabbix 2.0.0 RC2 ##
      1539:20120328:123650.011 Starting Zabbix Server. Zabbix 2.0.0rc2 (revision 26343).
      1539:20120328:123650.011 ****** Enabled features ******
      1539:20120328:123650.011 SNMP monitoring: YES
      1539:20120328:123650.012 IPMI monitoring: YES
      1539:20120328:123650.012 WEB monitoring: YES
      1539:20120328:123650.012 Jabber notifications: NO
      1539:20120328:123650.012 Ez Texting notifications: YES
      1539:20120328:123650.012 ODBC: NO
      1539:20120328:123650.012 SSH2 support: YES
      1539:20120328:123650.012 IPv6 support: YES
      1539:20120328:123650.012 ******************************
      1541:20120328:123650.068 server #2 started db watchdog #1
      1540:20120328:123650.070 server #1 started configuration syncer #1
      1548:20120328:123650.126 server #9 started trapper #1
      1549:20120328:123650.128 server #10 started trapper #2
      1550:20120328:123650.130 server #11 started trapper #3
      1551:20120328:123650.158 server #12 started trapper #4
      1544:20120328:123650.161 server #5 started poller #3
      1542:20120328:123650.163 server #3 started poller #1
      1545:20120328:123650.164 server #6 started poller #4
      1543:20120328:123650.165 server #4 started poller #2
      1546:20120328:123650.167 server #7 started poller #5
      1547:20120328:123650.170 server #8 started unreachable poller #1
      1552:20120328:123650.173 server #13 started trapper #5
      1553:20120328:123650.179 server #14 started icmp pinger #1
      1554:20120328:123650.185 server #15 started alerter #1
      1555:20120328:123650.192 server #16 started housekeeper #1
      1555:20120328:123650.192 executing housekeeper
      1566:20120328:123650.204 server #17 started timer #1
      1567:20120328:123650.206 server #18 started http poller #1
      1569:20120328:123650.215 server #20 started history syncer #1
      1570:20120328:123650.217 server #21 started history syncer #2
      1571:20120328:123650.220 server #22 started history syncer #3
      1572:20120328:123650.223 server #23 started history syncer #4
      1579:20120328:123650.244 server #24 started escalator #1
      1580:20120328:123650.247 server #25 started ipmi poller #1
      1581:20120328:123650.250 server #26 started ipmi poller #2
      1582:20120328:123650.253 server #27 started ipmi poller #3
      1568:20120328:123650.262 server #19 started discoverer #1
      1586:20120328:123650.273 server #29 started ipmi poller #5
      1587:20120328:123650.275 server #30 started proxy poller #1
      1539:20120328:123650.280 server #0 started [main process]
      1585:20120328:123650.284 server #28 started ipmi poller #4
      1592:20120328:123650.289 server #31 started self-monitoring #1
      1555:20120328:123651.371 housekeeper deleted: 10190 records from history and trends, 500 records of deleted items, 0 events, 0 alerts, 0 sessions
      1547:20120328:123655.299 temporarily disabling IPMI checks on host [F2-VH-01]: host unavailable
      1580:20120328:123700.385 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:123712.952 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123712.967 temporarily disabling IPMI checks on host [F2-CN-01]: host unavailable
      1581:20120328:123713.314 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1547:20120328:123715.974 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds
      1547:20120328:123728.994 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123731.006 IPMI item [FAN_MOD_4A_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds
      1586:20120328:123739.330 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1547:20120328:123746.025 temporarily disabling IPMI checks on host [F2-VH-02]: host unavailable
      1547:20120328:123754.040 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123758.285 enabling IPMI checks on host [F2-VH-01]: host became available
      1585:20120328:123809.353 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1547:20120328:123815.477 enabling IPMI checks on host [F2-CN-01]: host became available
      1586:20120328:123816.364 IPMI item [FAN_4_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1581:20120328:123816.364 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123824.499 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123831.521 resuming IPMI checks on host [F2-VH-01]: connection restored
      1547:20120328:123831.529 resuming IPMI checks on host [F2-CN-01]: connection restored
      1585:20120328:123839.381 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1581:20120328:123842.384 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1586:20120328:123846.388 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123854.560 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123857.572 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:123901.582 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:123911.420 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1582:20120328:123912.411 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1582:20120328:123917.457 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123926.620 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:123927.629 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:123932.640 resuming IPMI checks on host [F2-VH-01]: connection restored
      1585:20120328:123942.166 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1585:20120328:123946.191 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:123957.679 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:124001.691 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:124013.463 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1586:20120328:124013.477 IPMI item [FAN_1_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1586:20120328:124016.486 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124028.718 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124028.726 resuming IPMI checks on host [F2-CN-01]: connection restored
      1547:20120328:124031.735 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:124039.505 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1582:20120328:124042.515 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1586:20120328:124046.517 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124049.716 enabling IPMI checks on host [F2-VH-02]: host became available
      1547:20120328:124054.727 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124057.741 resuming IPMI checks on host [F2-CN-01]: connection restored
      1582:20120328:124059.529 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:124101.759 resuming IPMI checks on host [F2-VH-01]: connection restored
      1586:20120328:124109.537 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1580:20120328:124112.541 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1547:20120328:124114.776 resuming IPMI checks on host [F2-VH-02]: connection restored
      1582:20120328:124116.547 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124124.796 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124127.807 resuming IPMI checks on host [F2-CN-01]: connection restored
      1586:20120328:124129.554 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:124131.818 resuming IPMI checks on host [F2-VH-01]: connection restored
      1582:20120328:124141.569 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1586:20120328:124142.568 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1547:20120328:124144.832 resuming IPMI checks on host [F2-VH-02]: connection restored
      1581:20120328:124146.017 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: another network error, wait for 15 seconds
      1581:20120328:124147.024 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      1547:20120328:124157.854 resuming IPMI checks on host [F2-CN-01]: connection restored
      1581:20120328:124159.046 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
      1547:20120328:124201.865 resuming IPMI checks on host [F2-CN-02]: connection restored
      1547:20120328:124202.872 resuming IPMI checks on host [F2-VH-01]: connection restored
      1581:20120328:124209.063 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
      1581:20120328:124212.072 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
      1547:20120328:124214.886 resuming IPMI checks on host [F2-VH-02]: connection restored
      1581:20120328:124216.085 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
      ...
      ####
Show
Chris Witte added a comment - - edited Same erros with Zabbix 1.8.10 / 2.0.0 RC1 / 2.0.0 RC2 Installed new machine (Debian 6.0.4 -x64 / Virtual machine on VMware ESXi 5) and installed Zabbix 2.0 RC1 - Compiled with openipmi-2.0.19 (tried older version as well). Zabbix 2.0 is monitoring just ONE host with ONE item (directly, no template) for testing and the errors in the zabbix_server.log appear. (Interval: 15 sec, no flixible intervals) I assumed that the BMC was too busy and made checks with openipmish (two checks per second). Result: All requests were answered correct and in time. Monitored Host: Dell PowerEdge R610 + R710 with iDRAC6 - Ver: 1.80 (also tested with Ver. 1.71)
    1. configure## ./configure --enable-server --enable-agent --with-mysql --enable-ipv6 --with-net-snmp --with-libcurl --with-ssh2 --with-ldap --enable-proxy --openipmi --prefix=/opt/zabbix ###
    1. zabbix_server.conf ## StartPollers=5 StartIPMIPollers=5 # incremented step-by-step but no changes ###
    1. zabbix_server.log - Zabbix 2.0.0 RC1 ## 13292:20120327:113333.407 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113350.583 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:113406.933 resuming IPMI checks on host [F2-CN-01]: connection restored 13288:20120327:113425.415 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113440.949 resuming IPMI checks on host [F2-CN-01]: connection restored 13288:20120327:113504.423 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113519.964 resuming IPMI checks on host [F2-CN-01]: connection restored 13289:20120327:113555.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113610.980 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:113625.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113640.994 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:113649.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113705.008 resuming IPMI checks on host [F2-CN-01]: connection restored 13289:20120327:113718.027 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:113733.026 resuming IPMI checks on host [F2-CN-01]: connection restored 13288:20120327:113955.094 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114017.064 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:114033.070 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:114050.561 resuming IPMI checks on host [F2-CN-01]: connection restored 13289:20120327:114110.112 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114125.586 resuming IPMI checks on host [F2-CN-01]: connection restored 13288:20120327:114134.110 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114149.602 resuming IPMI checks on host [F2-CN-01]: connection restored 13289:20120327:114219.124 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114234.618 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:114640.743 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114702.663 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:114718.673 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:114736.178 resuming IPMI checks on host [F2-CN-01]: connection restored 13289:20120327:114755.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114810.189 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:114819.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:114834.205 resuming IPMI checks on host [F2-CN-01]: connection restored 13290:20120327:115155.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:115217.250 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:115233.257 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:115250.893 resuming IPMI checks on host [F2-CN-01]: connection restored 13292:20120327:115310.016 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:115325.904 resuming IPMI checks on host [F2-CN-01]: connection restored 13292:20120327:115333.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:115348.920 resuming IPMI checks on host [F2-CN-01]: connection restored 13290:20120327:115418.036 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:115433.941 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:115510.801 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:115525.959 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:115534.808 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:115549.975 resuming IPMI checks on host [F2-CN-01]: connection restored 13292:20120327:115940.013 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:120002.020 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:120018.026 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:120035.679 resuming IPMI checks on host [F2-CN-01]: connection restored 13290:20120327:120055.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:120110.695 resuming IPMI checks on host [F2-CN-01]: connection restored 13290:20120327:120119.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:120134.711 resuming IPMI checks on host [F2-CN-01]: connection restored 13292:20120327:120149.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:120204.725 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:120255.625 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:120317.749 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:120333.765 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:120351.272 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:120404.641 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:120419.283 resuming IPMI checks on host [F2-CN-01]: connection restored 13288:20120327:121655.018 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:121717.398 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:121732.402 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:121749.902 resuming IPMI checks on host [F2-CN-01]: connection restored 13288:20120327:121803.030 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:121818.918 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:121855.640 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:121910.935 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:121919.648 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:121934.950 resuming IPMI checks on host [F2-CN-01]: connection restored 13289:20120327:122249.007 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:122311.990 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:122327.995 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds 13271:20120327:122345.767 resuming IPMI checks on host [F2-CN-01]: connection restored 13291:20120327:122355.249 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds 13271:20120327:122410.781 resuming IPMI checks on host [F2-CN-01]: connection restored ###
-EDIT: 2012 Mar 28- 1. Converted VM from VMware to VirtualBox (Windows) on another host (win7) in another network segment (to exclude hypervisor, Host-OS, network connectivity from error source) 2. Compiled Zabbix 2.0.0 RC2 and updated the system 3. Added host and templates Result:
    1. zabbix_server.log - Zabbix 2.0.0 RC2 ## 1539:20120328:123650.011 Starting Zabbix Server. Zabbix 2.0.0rc2 (revision 26343). 1539:20120328:123650.011 ****** Enabled features ****** 1539:20120328:123650.011 SNMP monitoring: YES 1539:20120328:123650.012 IPMI monitoring: YES 1539:20120328:123650.012 WEB monitoring: YES 1539:20120328:123650.012 Jabber notifications: NO 1539:20120328:123650.012 Ez Texting notifications: YES 1539:20120328:123650.012 ODBC: NO 1539:20120328:123650.012 SSH2 support: YES 1539:20120328:123650.012 IPv6 support: YES 1539:20120328:123650.012 ****************************** 1541:20120328:123650.068 server #2 started db watchdog #1 1540:20120328:123650.070 server #1 started configuration syncer #1 1548:20120328:123650.126 server #9 started trapper #1 1549:20120328:123650.128 server #10 started trapper #2 1550:20120328:123650.130 server #11 started trapper #3 1551:20120328:123650.158 server #12 started trapper #4 1544:20120328:123650.161 server #5 started poller #3 1542:20120328:123650.163 server #3 started poller #1 1545:20120328:123650.164 server #6 started poller #4 1543:20120328:123650.165 server #4 started poller #2 1546:20120328:123650.167 server #7 started poller #5 1547:20120328:123650.170 server #8 started unreachable poller #1 1552:20120328:123650.173 server #13 started trapper #5 1553:20120328:123650.179 server #14 started icmp pinger #1 1554:20120328:123650.185 server #15 started alerter #1 1555:20120328:123650.192 server #16 started housekeeper #1 1555:20120328:123650.192 executing housekeeper 1566:20120328:123650.204 server #17 started timer #1 1567:20120328:123650.206 server #18 started http poller #1 1569:20120328:123650.215 server #20 started history syncer #1 1570:20120328:123650.217 server #21 started history syncer #2 1571:20120328:123650.220 server #22 started history syncer #3 1572:20120328:123650.223 server #23 started history syncer #4 1579:20120328:123650.244 server #24 started escalator #1 1580:20120328:123650.247 server #25 started ipmi poller #1 1581:20120328:123650.250 server #26 started ipmi poller #2 1582:20120328:123650.253 server #27 started ipmi poller #3 1568:20120328:123650.262 server #19 started discoverer #1 1586:20120328:123650.273 server #29 started ipmi poller #5 1587:20120328:123650.275 server #30 started proxy poller #1 1539:20120328:123650.280 server #0 started [main process] 1585:20120328:123650.284 server #28 started ipmi poller #4 1592:20120328:123650.289 server #31 started self-monitoring #1 1555:20120328:123651.371 housekeeper deleted: 10190 records from history and trends, 500 records of deleted items, 0 events, 0 alerts, 0 sessions 1547:20120328:123655.299 temporarily disabling IPMI checks on host [F2-VH-01]: host unavailable 1580:20120328:123700.385 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: first network error, wait for 15 seconds 1547:20120328:123712.952 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:123712.967 temporarily disabling IPMI checks on host [F2-CN-01]: host unavailable 1581:20120328:123713.314 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1547:20120328:123715.974 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds 1547:20120328:123728.994 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:123731.006 IPMI item [FAN_MOD_4A_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds 1586:20120328:123739.330 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1547:20120328:123746.025 temporarily disabling IPMI checks on host [F2-VH-02]: host unavailable 1547:20120328:123754.040 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:123758.285 enabling IPMI checks on host [F2-VH-01]: host became available 1585:20120328:123809.353 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1547:20120328:123815.477 enabling IPMI checks on host [F2-CN-01]: host became available 1586:20120328:123816.364 IPMI item [FAN_4_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1581:20120328:123816.364 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:123824.499 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:123831.521 resuming IPMI checks on host [F2-VH-01]: connection restored 1547:20120328:123831.529 resuming IPMI checks on host [F2-CN-01]: connection restored 1585:20120328:123839.381 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1581:20120328:123842.384 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1586:20120328:123846.388 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:123854.560 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:123857.572 resuming IPMI checks on host [F2-CN-01]: connection restored 1547:20120328:123901.582 resuming IPMI checks on host [F2-VH-01]: connection restored 1586:20120328:123911.420 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1582:20120328:123912.411 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1582:20120328:123917.457 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:123926.620 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:123927.629 resuming IPMI checks on host [F2-CN-01]: connection restored 1547:20120328:123932.640 resuming IPMI checks on host [F2-VH-01]: connection restored 1585:20120328:123942.166 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1585:20120328:123946.191 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:123957.679 resuming IPMI checks on host [F2-CN-01]: connection restored 1547:20120328:124001.691 resuming IPMI checks on host [F2-VH-01]: connection restored 1586:20120328:124013.463 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1586:20120328:124013.477 IPMI item [FAN_1_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1586:20120328:124016.486 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:124028.718 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:124028.726 resuming IPMI checks on host [F2-CN-01]: connection restored 1547:20120328:124031.735 resuming IPMI checks on host [F2-VH-01]: connection restored 1586:20120328:124039.505 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1582:20120328:124042.515 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1586:20120328:124046.517 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:124049.716 enabling IPMI checks on host [F2-VH-02]: host became available 1547:20120328:124054.727 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:124057.741 resuming IPMI checks on host [F2-CN-01]: connection restored 1582:20120328:124059.529 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds 1547:20120328:124101.759 resuming IPMI checks on host [F2-VH-01]: connection restored 1586:20120328:124109.537 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1580:20120328:124112.541 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1547:20120328:124114.776 resuming IPMI checks on host [F2-VH-02]: connection restored 1582:20120328:124116.547 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:124124.796 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:124127.807 resuming IPMI checks on host [F2-CN-01]: connection restored 1586:20120328:124129.554 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds 1547:20120328:124131.818 resuming IPMI checks on host [F2-VH-01]: connection restored 1582:20120328:124141.569 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1586:20120328:124142.568 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1547:20120328:124144.832 resuming IPMI checks on host [F2-VH-02]: connection restored 1581:20120328:124146.017 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: another network error, wait for 15 seconds 1581:20120328:124147.024 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds 1547:20120328:124157.854 resuming IPMI checks on host [F2-CN-01]: connection restored 1581:20120328:124159.046 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds 1547:20120328:124201.865 resuming IPMI checks on host [F2-CN-02]: connection restored 1547:20120328:124202.872 resuming IPMI checks on host [F2-VH-01]: connection restored 1581:20120328:124209.063 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds 1581:20120328:124212.072 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds 1547:20120328:124214.886 resuming IPMI checks on host [F2-VH-02]: connection restored 1581:20120328:124216.085 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds ... ####
Hide
Chris Witte added a comment -

It seems that the BMC gets too many requests/connections.
From time to time I get following message when running ipmitool:

  1. ipmitool sdr -H <HOSTNAME> -U <USER> -P <PASSWORD> -L USER

Get Session Challenge command failed: Node busy
Error: Unable to establish LAN session
Get Device ID command failed
Unable to open SDR for reading

####

Does Zabbix use a sdr cache ? This could increase the performance.

ipmitool offers this parameter:

-S <sdr_cache_file>
Use local file for remote SDR cache. Using a local SDR cache
can drastically increase performance for commands that require
knowledge of the entire SDR to perform their function. Local
SDR cache from a remote system can be created with the sdr dump
command.

BMC busy topic: http://old.nabble.com/possible-causes-for-%22ipmi_ctx_open_outofband%3A-BMC-busy%22-td31448014.html

Show
Chris Witte added a comment - It seems that the BMC gets too many requests/connections. From time to time I get following message when running ipmitool:
  1. ipmitool sdr -H <HOSTNAME> -U <USER> -P <PASSWORD> -L USER
Get Session Challenge command failed: Node busy Error: Unable to establish LAN session Get Device ID command failed Unable to open SDR for reading #### Does Zabbix use a sdr cache ? This could increase the performance. ipmitool offers this parameter: -S <sdr_cache_file> Use local file for remote SDR cache. Using a local SDR cache can drastically increase performance for commands that require knowledge of the entire SDR to perform their function. Local SDR cache from a remote system can be created with the sdr dump command. BMC busy topic: http://old.nabble.com/possible-causes-for-%22ipmi_ctx_open_outofband%3A-BMC-busy%22-td31448014.html
Hide
Chris Witte added a comment -
Show
Chris Witte added a comment - Posted this problem on Dell Community: http://en.community.dell.com/support-forums/servers/f/177/p/19442918/20078853.aspx#20078853
Hide
Sergey Sireskin added a comment -

My colleague has done some testing on this issue, and he came to the conclusion that IPMI CPU is unable to handle all those requests. As he says, for each request to IPMI host Zabbix opens one separate connection and IBM System x IMM module is unable to handle all the requests. So he had to write a wrapper script that requests all IPMI items from the host at a time, stores them in a cache file, and gives items to Zabbix when it requests.

Show
Sergey Sireskin added a comment - My colleague has done some testing on this issue, and he came to the conclusion that IPMI CPU is unable to handle all those requests. As he says, for each request to IPMI host Zabbix opens one separate connection and IBM System x IMM module is unable to handle all the requests. So he had to write a wrapper script that requests all IPMI items from the host at a time, stores them in a cache file, and gives items to Zabbix when it requests.
Hide
Chris Witte added a comment -

Thanks for your reply. Could you post the wrapper script here ?

What about caching the sdr query like ipmitool does whe using the parameter -s ?

-S <sdr_cache_file>
Use local file for remote SDR cache. Using a local SDR cache
can drastically increase performance for commands that require
knowledge of the entire SDR to perform their function. Local
SDR cache from a remote system can be created with the sdr dump
command.

I know that freeipmi automaticaly creates a cachefile of the sdr. But Zabbix uses openipmi.
For sure Zabbix's IPMI-Engine would have a better performace when using the caching option by default.

Chris

Show
Chris Witte added a comment - Thanks for your reply. Could you post the wrapper script here ? What about caching the sdr query like ipmitool does whe using the parameter -s ? -S <sdr_cache_file> Use local file for remote SDR cache. Using a local SDR cache can drastically increase performance for commands that require knowledge of the entire SDR to perform their function. Local SDR cache from a remote system can be created with the sdr dump command. I know that freeipmi automaticaly creates a cachefile of the sdr. But Zabbix uses openipmi. For sure Zabbix's IPMI-Engine would have a better performace when using the caching option by default. Chris
Hide
Sergey Sireskin added a comment - - edited

The script is rather simple, it just stores values in a local file with a timestamp. Then, when Zabbix requests a value, script examines the timestamp, and either renews its cache first, or just gives out data from cache, if it's recent enough.

Show
Sergey Sireskin added a comment - - edited The script is rather simple, it just stores values in a local file with a timestamp. Then, when Zabbix requests a value, script examines the timestamp, and either renews its cache first, or just gives out data from cache, if it's recent enough.
Hide
Sergey Sireskin added a comment - - edited

This issue is covered by ZBXNEXT-1210, which is related to ZBXNEXT-98.

Show
Sergey Sireskin added a comment - - edited This issue is covered by ZBXNEXT-1210, which is related to ZBXNEXT-98.
Hide
Aaron Smart added a comment - - edited

I'm experiencing the same network errors in the server log as Chris is above (running 2.0.2), trying to connect to a Dell PowerEdge 1950 (BMC) and PowerEdge R210 II (iDRAC 6 Express). Is there some way to make the IPMI poller more accommodating for slow devices?

Show
Aaron Smart added a comment - - edited I'm experiencing the same network errors in the server log as Chris is above (running 2.0.2), trying to connect to a Dell PowerEdge 1950 (BMC) and PowerEdge R210 II (iDRAC 6 Express). Is there some way to make the IPMI poller more accommodating for slow devices?
Hide
dimir added a comment -

There is a discussion going recently about fixing this one. We will report as soon as there is more information.

Show
dimir added a comment - There is a discussion going recently about fixing this one. We will report as soon as there is more information.
Hide
Falk G. added a comment -

i have the same issue ... and its not related to DELL. I am using Supermicro IPMI to monitor RAM and Environment Temperature and i got the same issues:

24006:20120909:171011.011 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171033.326 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171048.330 IPMI item [P2-DIMM3B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171104.539 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:171111.023 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171126.547 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:171135.955 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171150.556 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:171559.993 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
24007:20120909:171603.995 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171625.608 IPMI item [Fan6] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171641.611 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171657.737 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:171718.004 IPMI item [P1-DIMM3A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171733.748 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172329.683 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172351.819 IPMI item [Fan2] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:172407.825 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:172424.027 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172429.695 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172444.037 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:172453.725 IPMI item [P1-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172508.047 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:172511.730 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172526.056 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172559.839 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172614.069 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172615.843 IPMI item [P1-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172630.078 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:173101.923 IPMI item [Fan3] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173123.135 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173139.139 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173155.263 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:173156.932 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173211.665 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:173347.011 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173409.694 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173425.699 IPMI item [System_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds

23678:20120904:202202.840 Starting Zabbix Server. Zabbix 2.0.2 (revision 29214).
23678:20120904:202202.840 ****** Enabled features ******
23678:20120904:202202.840 SNMP monitoring: YES
23678:20120904:202202.840 IPMI monitoring: YES
23678:20120904:202202.840 WEB monitoring: NO
23678:20120904:202202.840 Jabber notifications: NO
23678:20120904:202202.840 Ez Texting notifications: NO
23678:20120904:202202.840 ODBC: NO
23678:20120904:202202.840 SSH2 support: NO
23678:20120904:202202.840 IPv6 support: NO
23678:20120904:202202.840 ******************************
23680:20120904:202202.900 server #1 started configuration syncer #1
23681:20120904:202202.900 server #2 started db watchdog #1
23682:20120904:202202.901 server #3 started poller #1
23683:20120904:202202.902 server #4 started poller #2
23684:20120904:202202.904 server #5 started poller #3
23685:20120904:202202.905 server #6 started poller #4
23686:20120904:202202.906 server #7 started poller #5
23678:20120904:202202.906 server #0 started [main process]
23704:20120904:202202.906 server #25 started ipmi poller #1
23687:20120904:202202.907 server #8 started unreachable poller #1
23705:20120904:202202.907 server #26 started ipmi poller #2
23706:20120904:202202.907 server #27 started ipmi poller #3
23707:20120904:202202.907 server #28 started proxy poller #1
23708:20120904:202202.908 server #29 started self-monitoring #1
23692:20120904:202202.910 server #13 started trapper #5
23693:20120904:202202.910 server #14 started icmp pinger #1
23698:20120904:202202.911 server #19 started discoverer #1
23697:20120904:202202.911 server #18 started http poller #1
23696:20120904:202202.912 server #17 started timer #1
23695:20120904:202202.912 server #16 started housekeeper #1
23695:20120904:202202.912 executing housekeeper
23694:20120904:202202.912 server #15 started alerter #1
23699:20120904:202202.913 server #20 started history syncer #1
23688:20120904:202202.913 server #9 started trapper #1
23689:20120904:202202.913 server #10 started trapper #2
23690:20120904:202202.913 server #11 started trapper #3
23691:20120904:202202.913 server #12 started trapper #4
23702:20120904:202202.914 server #23 started history syncer #4
23701:20120904:202202.914 server #22 started history syncer #3
23700:20120904:202202.914 server #21 started history syncer #2
23703:20120904:202202.915 server #24 started escalator #1

Show
Falk G. added a comment - i have the same issue ... and its not related to DELL. I am using Supermicro IPMI to monitor RAM and Environment Temperature and i got the same issues: 24006:20120909:171011.011 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:171033.326 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:171048.330 IPMI item [P2-DIMM3B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:171104.539 resuming IPMI checks on host [Supermicro SC836]: connection restored 24006:20120909:171111.023 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:171126.547 resuming IPMI checks on host [Supermicro SC836]: connection restored 24005:20120909:171135.955 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:171150.556 resuming IPMI checks on host [Supermicro SC836]: connection restored 24005:20120909:171559.993 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 24007:20120909:171603.995 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:171625.608 IPMI item [Fan6] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:171641.611 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:171657.737 resuming IPMI checks on host [Supermicro SC836]: connection restored 24007:20120909:171718.004 IPMI item [P1-DIMM3A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:171733.748 resuming IPMI checks on host [Supermicro SC836]: connection restored 24006:20120909:172329.683 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:172351.819 IPMI item [Fan2] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:172407.825 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:172424.027 resuming IPMI checks on host [Supermicro SC836]: connection restored 24006:20120909:172429.695 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:172444.037 resuming IPMI checks on host [Supermicro SC836]: connection restored 24005:20120909:172453.725 IPMI item [P1-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:172508.047 resuming IPMI checks on host [Supermicro SC836]: connection restored 24005:20120909:172511.730 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:172526.056 resuming IPMI checks on host [Supermicro SC836]: connection restored 24006:20120909:172559.839 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:172614.069 resuming IPMI checks on host [Supermicro SC836]: connection restored 24006:20120909:172615.843 IPMI item [P1-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:172630.078 resuming IPMI checks on host [Supermicro SC836]: connection restored 24007:20120909:173101.923 IPMI item [Fan3] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:173123.135 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:173139.139 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:173155.263 resuming IPMI checks on host [Supermicro SC836]: connection restored 24007:20120909:173156.932 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:173211.665 resuming IPMI checks on host [Supermicro SC836]: connection restored 24005:20120909:173347.011 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds 23988:20120909:173409.694 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23988:20120909:173425.699 IPMI item [System_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds 23678:20120904:202202.840 Starting Zabbix Server. Zabbix 2.0.2 (revision 29214). 23678:20120904:202202.840 ****** Enabled features ****** 23678:20120904:202202.840 SNMP monitoring: YES 23678:20120904:202202.840 IPMI monitoring: YES 23678:20120904:202202.840 WEB monitoring: NO 23678:20120904:202202.840 Jabber notifications: NO 23678:20120904:202202.840 Ez Texting notifications: NO 23678:20120904:202202.840 ODBC: NO 23678:20120904:202202.840 SSH2 support: NO 23678:20120904:202202.840 IPv6 support: NO 23678:20120904:202202.840 ****************************** 23680:20120904:202202.900 server #1 started configuration syncer #1 23681:20120904:202202.900 server #2 started db watchdog #1 23682:20120904:202202.901 server #3 started poller #1 23683:20120904:202202.902 server #4 started poller #2 23684:20120904:202202.904 server #5 started poller #3 23685:20120904:202202.905 server #6 started poller #4 23686:20120904:202202.906 server #7 started poller #5 23678:20120904:202202.906 server #0 started [main process] 23704:20120904:202202.906 server #25 started ipmi poller #1 23687:20120904:202202.907 server #8 started unreachable poller #1 23705:20120904:202202.907 server #26 started ipmi poller #2 23706:20120904:202202.907 server #27 started ipmi poller #3 23707:20120904:202202.907 server #28 started proxy poller #1 23708:20120904:202202.908 server #29 started self-monitoring #1 23692:20120904:202202.910 server #13 started trapper #5 23693:20120904:202202.910 server #14 started icmp pinger #1 23698:20120904:202202.911 server #19 started discoverer #1 23697:20120904:202202.911 server #18 started http poller #1 23696:20120904:202202.912 server #17 started timer #1 23695:20120904:202202.912 server #16 started housekeeper #1 23695:20120904:202202.912 executing housekeeper 23694:20120904:202202.912 server #15 started alerter #1 23699:20120904:202202.913 server #20 started history syncer #1 23688:20120904:202202.913 server #9 started trapper #1 23689:20120904:202202.913 server #10 started trapper #2 23690:20120904:202202.913 server #11 started trapper #3 23691:20120904:202202.913 server #12 started trapper #4 23702:20120904:202202.914 server #23 started history syncer #4 23701:20120904:202202.914 server #22 started history syncer #3 23700:20120904:202202.914 server #21 started history syncer #2 23703:20120904:202202.915 server #24 started escalator #1
Hide
Milosz Modrzewski added a comment -

Same problem for me:

Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> Dell Remote Access Controller 5 A01 Firmware Version 1.60 (11.03.03) IP: 172.30.5.96
Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> iLO4 Firmware Version 1.05 ILOCZ22240991 IP: 172.30.5.98

1689:20120927:144113.549 resuming IPMI checks on host [172.30.5.98]: connection restored
1679:20120927:144123.137 Received configuration data from server. Datalen 7766
1709:20120927:144206.227 IPMI item [ipmi.ambient_temp] on host [172.30.5.96] failed: first network error, wait for 15 seconds
1679:20120927:144223.268 Received configuration data from server. Datalen 7766
1689:20120927:144228.939 resuming IPMI checks on host [172.30.5.96]: connection restored
1679:20120927:144323.726 Received configuration data from server. Datalen 7766
1711:20120927:144416.828 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: first network error, wait for 15 seconds
1679:20120927:144423.899 Received configuration data from server. Datalen 7766
1689:20120927:144432.106 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: another network error, wait for 15 seconds
1689:20120927:144457.405 resuming IPMI checks on host [172.30.5.98]: connection restored
1679:20120927:144524.008 Received configuration data from server. Datalen 7766

Show
Milosz Modrzewski added a comment - Same problem for me: Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> Dell Remote Access Controller 5 A01 Firmware Version 1.60 (11.03.03) IP: 172.30.5.96 Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> iLO4 Firmware Version 1.05 ILOCZ22240991 IP: 172.30.5.98 1689:20120927:144113.549 resuming IPMI checks on host [172.30.5.98]: connection restored 1679:20120927:144123.137 Received configuration data from server. Datalen 7766 1709:20120927:144206.227 IPMI item [ipmi.ambient_temp] on host [172.30.5.96] failed: first network error, wait for 15 seconds 1679:20120927:144223.268 Received configuration data from server. Datalen 7766 1689:20120927:144228.939 resuming IPMI checks on host [172.30.5.96]: connection restored 1679:20120927:144323.726 Received configuration data from server. Datalen 7766 1711:20120927:144416.828 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: first network error, wait for 15 seconds 1679:20120927:144423.899 Received configuration data from server. Datalen 7766 1689:20120927:144432.106 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: another network error, wait for 15 seconds 1689:20120927:144457.405 resuming IPMI checks on host [172.30.5.98]: connection restored 1679:20120927:144524.008 Received configuration data from server. Datalen 7766
Hide
Anton Samets added a comment - - edited

I know the solution for this issue:
as for me, if you are using LO-100 you must set password size to 16 bytes (not 20). After that monitoring of IPMI will start to work.
So, zabbix don't use ipmi 2.0 and I can't find where I can set it.

Print out of commands if you have password size set to 20 bytes:

ipmitool  -H 10.145.1.129 -U admin -P admin chassis status
Invalid user name
Error: Unable to establish LAN session
Error sending Chassis Status command
ipmitool -I lanplus -H 10.145.1.129 -U admin -P admin chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Sleep Button Disable : allowed
Diag Button Disable  : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false

So, where we can set parameters for ipmi-tools?

Show
Anton Samets added a comment - - edited I know the solution for this issue: as for me, if you are using LO-100 you must set password size to 16 bytes (not 20). After that monitoring of IPMI will start to work. So, zabbix don't use ipmi 2.0 and I can't find where I can set it. Print out of commands if you have password size set to 20 bytes:
ipmitool  -H 10.145.1.129 -U admin -P admin chassis status
Invalid user name
Error: Unable to establish LAN session
Error sending Chassis Status command
ipmitool -I lanplus -H 10.145.1.129 -U admin -P admin chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Sleep Button Disable : allowed
Diag Button Disable  : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false
So, where we can set parameters for ipmi-tools?
Hide
Anton Samets added a comment -

Hm, I found that if you set Authentication algorithm from "none" to "RMCP+" all is works fine.

Show
Anton Samets added a comment - Hm, I found that if you set Authentication algorithm from "none" to "RMCP+" all is works fine.
Hide
Andrej Kacian added a comment -

I too had same problem (monitoring 5 hosts with around 10 items each), and was getting unsupported items intermittently every minute or so. Based on a suggestion from forums[1], I changed number of IPMI pollers to just one. Since then, there was no problem with getting IPMI values at all. This was on zabbix 2.0.3 at that time, and still works flawlessly on 2.05 with just one IPMI poller.

1. https://www.zabbix.com/forum/showpost.php?s=783bdc9aff7d3ea26999f74f4d223e59&p=118389&postcount=4

Show
Andrej Kacian added a comment - I too had same problem (monitoring 5 hosts with around 10 items each), and was getting unsupported items intermittently every minute or so. Based on a suggestion from forums[1], I changed number of IPMI pollers to just one. Since then, there was no problem with getting IPMI values at all. This was on zabbix 2.0.3 at that time, and still works flawlessly on 2.05 with just one IPMI poller. 1. https://www.zabbix.com/forum/showpost.php?s=783bdc9aff7d3ea26999f74f4d223e59&p=118389&postcount=4
Hide
Alexey Pustovalov added a comment -

if IPMI sensor is located at the end of table of sensors, getting value can take about 40-50 seconds and sometimes can be failed with network error:

 10673:20140224:182441.012 In get_value() key:'ipmi.cpu[FAN 1]'
 10673:20140224:182441.012 In get_value_ipmi() key:'Zabbix server:ipmi.cpu[FAN 1]'
 10673:20140224:182441.012 In init_ipmi_host() host:'[10.100.52.28]:623'
 10673:20140224:182441.012 In get_ipmi_host() host:'[10.100.52.28]:623'
 10673:20140224:182441.012 End of get_ipmi_host():0x2f9bbf0
 10673:20140224:182441.013 End of init_ipmi_host():0x2f9bbf0
 10673:20140224:182441.013 In get_ipmi_sensor_by_id() sensor:'FAN 1@[10.100.52.28]:623'
 10673:20140224:182441.013 End of get_ipmi_sensor_by_id():0x307fcf8
 10673:20140224:182441.013 In read_ipmi_sensor() sensor:'FAN 1@[10.100.52.28]:623'
 10673:20140224:182448.020 In got_thresh_reading()
 10673:20140224:182448.020 got_thresh_reading() fail: [16777411] Unknown error 16777411
 10673:20140224:182448.020 End of got_thresh_reading():NETWORK_ERROR
 10673:20140224:182448.020 End of read_ipmi_sensor():NETWORK_ERROR
 10673:20140224:182448.020 Item [Zabbix server:ipmi.cpu[FAN 1]] error: error 0x10000c3 while reading threshold sensor
 10673:20140224:182448.020 End of get_value():NETWORK_ERROR
Show
Alexey Pustovalov added a comment - if IPMI sensor is located at the end of table of sensors, getting value can take about 40-50 seconds and sometimes can be failed with network error:
 10673:20140224:182441.012 In get_value() key:'ipmi.cpu[FAN 1]'
 10673:20140224:182441.012 In get_value_ipmi() key:'Zabbix server:ipmi.cpu[FAN 1]'
 10673:20140224:182441.012 In init_ipmi_host() host:'[10.100.52.28]:623'
 10673:20140224:182441.012 In get_ipmi_host() host:'[10.100.52.28]:623'
 10673:20140224:182441.012 End of get_ipmi_host():0x2f9bbf0
 10673:20140224:182441.013 End of init_ipmi_host():0x2f9bbf0
 10673:20140224:182441.013 In get_ipmi_sensor_by_id() sensor:'FAN 1@[10.100.52.28]:623'
 10673:20140224:182441.013 End of get_ipmi_sensor_by_id():0x307fcf8
 10673:20140224:182441.013 In read_ipmi_sensor() sensor:'FAN 1@[10.100.52.28]:623'
 10673:20140224:182448.020 In got_thresh_reading()
 10673:20140224:182448.020 got_thresh_reading() fail: [16777411] Unknown error 16777411
 10673:20140224:182448.020 End of got_thresh_reading():NETWORK_ERROR
 10673:20140224:182448.020 End of read_ipmi_sensor():NETWORK_ERROR
 10673:20140224:182448.020 Item [Zabbix server:ipmi.cpu[FAN 1]] error: error 0x10000c3 while reading threshold sensor
 10673:20140224:182448.020 End of get_value():NETWORK_ERROR
Hide
Michael Sphar added a comment -

I have had the same experience, that reducing the number of IPMI pollers to just one has stopped the frequent Network Error messages and gaps.

I had both my production server and a small test VM server, the production server was only polling a few IPMI items and a lot of other non-IPMI monitoring, and the test server was only polling a few IPMI items and doing no other monitoring. Both servers were showing frequent Network Errors and gaps from the items then going unsupported. I reduced the number of IPMI pollers to just one last week and have yet to see a Network Error warning since. Neither server showed all the IPMI pollers as busy.

This is with Zabbix 2.2.2.

Thinking it might have something to do with two different ipmi pollers polling the same device at the same time, I did a simple test where from two different hosts I issued an ipmitool sensor command to the same IPMI device. What I observed is that the resulting output from the IPMI is only sent to one device at a time. The effect I observe is that one ipmitool output starts scrolling while the other is paused for a few seconds, then the other starts scrolling and the first one pauses, and this goes back and forth a few times until both are complete.

Show
Michael Sphar added a comment - I have had the same experience, that reducing the number of IPMI pollers to just one has stopped the frequent Network Error messages and gaps. I had both my production server and a small test VM server, the production server was only polling a few IPMI items and a lot of other non-IPMI monitoring, and the test server was only polling a few IPMI items and doing no other monitoring. Both servers were showing frequent Network Errors and gaps from the items then going unsupported. I reduced the number of IPMI pollers to just one last week and have yet to see a Network Error warning since. Neither server showed all the IPMI pollers as busy. This is with Zabbix 2.2.2. Thinking it might have something to do with two different ipmi pollers polling the same device at the same time, I did a simple test where from two different hosts I issued an ipmitool sensor command to the same IPMI device. What I observed is that the resulting output from the IPMI is only sent to one device at a time. The effect I observe is that one ipmitool output starts scrolling while the other is paused for a few seconds, then the other starts scrolling and the first one pauses, and this goes back and forth a few times until both are complete.

People

Vote (2)
Watch (10)

Dates

  • Created:
    Updated: