The network connection is stable, this is confirmed by ping graphs. Ping frequency is 3 seconds, IPMI requests frequency is 300 seconds. IPMI library version is 2.0.16-7.el5. StartIPMIPollers=2, 3 hosts are monitored via IPMI and there are some other hosts monitored via agents, snmp and simple checks.
|
Seems to be the same problem as in ZBX-3188.
|
Aleksandrs Saveljevs, no this is not the same problem as ZBX-3188, I have no "host unreachable" errors in zabbix log. Besides that, all sensors work in my setup, the gaps in data happen periodically, after some time sensors data becomes reachable again, then after some time network error occurs, then it becomes ok again. Just see the graphs attached to my first post.
|
could it be that by polling ipmi too often it becomes slow, locks up or just applies some connection throttling ?
how many ipmi items you have ? do they all have the same interval ?
|
There are 3 hosts with 125 IPMI items each. Polling interval is set to 300 seconds for each item.
I'm using Zabbix 1.8.5 now and don't experience this problem any more.
I can't remember when the problem disappeared, it could be Zabbix update or changes in the IPMI template,
that I have done some time ago.
The only thing I can say for sure, is that I didn't change any settings on the IPMI devices.
Looking at the template in the attached ipmi_error_report.tgz archive, I can see that my current template is
definitely different from the old one. The old template had only 19 items.
|
Same erros with Zabbix 1.8.10 / 2.0.0 RC1 / 2.0.0 RC2
Installed new machine (Debian 6.0.4 -x64 / Virtual machine on VMware ESXi 5) and installed Zabbix 2.0 RC1 - Compiled with openipmi-2.0.19 (tried older version as well).
Zabbix 2.0 is monitoring just ONE host with ONE item (directly, no template) for testing and the errors in the zabbix_server.log appear.
(Interval: 15 sec, no flixible intervals)
I assumed that the BMC was too busy and made checks with openipmish (two checks per second).
Result: All requests were answered correct and in time.
Monitored Host: Dell PowerEdge R610 + R710 with iDRAC6 - Ver: 1.80 (also tested with Ver. 1.71)
-
- configure##
./configure --enable-server --enable-agent --with-mysql --enable-ipv6 --with-net-snmp --with-libcurl --with-ssh2 --with-ldap --enable-proxy --openipmi --prefix=/opt/zabbix
###
-
- zabbix_server.conf ##
StartPollers=5
StartIPMIPollers=5 # incremented step-by-step but no changes
###
-
- zabbix_server.log - Zabbix 2.0.0 RC1 ##
13292:20120327:113333.407 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113350.583 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:113406.933 resuming IPMI checks on host [F2-CN-01]: connection restored
13288:20120327:113425.415 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113440.949 resuming IPMI checks on host [F2-CN-01]: connection restored
13288:20120327:113504.423 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113519.964 resuming IPMI checks on host [F2-CN-01]: connection restored
13289:20120327:113555.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113610.980 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:113625.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113640.994 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:113649.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113705.008 resuming IPMI checks on host [F2-CN-01]: connection restored
13289:20120327:113718.027 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:113733.026 resuming IPMI checks on host [F2-CN-01]: connection restored
13288:20120327:113955.094 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114017.064 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:114033.070 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:114050.561 resuming IPMI checks on host [F2-CN-01]: connection restored
13289:20120327:114110.112 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114125.586 resuming IPMI checks on host [F2-CN-01]: connection restored
13288:20120327:114134.110 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114149.602 resuming IPMI checks on host [F2-CN-01]: connection restored
13289:20120327:114219.124 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114234.618 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:114640.743 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114702.663 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:114718.673 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:114736.178 resuming IPMI checks on host [F2-CN-01]: connection restored
13289:20120327:114755.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114810.189 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:114819.756 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:114834.205 resuming IPMI checks on host [F2-CN-01]: connection restored
13290:20120327:115155.014 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:115217.250 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:115233.257 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:115250.893 resuming IPMI checks on host [F2-CN-01]: connection restored
13292:20120327:115310.016 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:115325.904 resuming IPMI checks on host [F2-CN-01]: connection restored
13292:20120327:115333.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:115348.920 resuming IPMI checks on host [F2-CN-01]: connection restored
13290:20120327:115418.036 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:115433.941 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:115510.801 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:115525.959 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:115534.808 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:115549.975 resuming IPMI checks on host [F2-CN-01]: connection restored
13292:20120327:115940.013 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:120002.020 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:120018.026 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:120035.679 resuming IPMI checks on host [F2-CN-01]: connection restored
13290:20120327:120055.023 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:120110.695 resuming IPMI checks on host [F2-CN-01]: connection restored
13290:20120327:120119.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:120134.711 resuming IPMI checks on host [F2-CN-01]: connection restored
13292:20120327:120149.035 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:120204.725 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:120255.625 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:120317.749 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:120333.765 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:120351.272 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:120404.641 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:120419.283 resuming IPMI checks on host [F2-CN-01]: connection restored
13288:20120327:121655.018 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:121717.398 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:121732.402 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:121749.902 resuming IPMI checks on host [F2-CN-01]: connection restored
13288:20120327:121803.030 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:121818.918 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:121855.640 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:121910.935 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:121919.648 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:121934.950 resuming IPMI checks on host [F2-CN-01]: connection restored
13289:20120327:122249.007 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:122311.990 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:122327.995 IPMI item [System_Level] on host [F2-CN-01] failed: another network error, wait for 15 seconds
13271:20120327:122345.767 resuming IPMI checks on host [F2-CN-01]: connection restored
13291:20120327:122355.249 IPMI item [System_Level] on host [F2-CN-01] failed: first network error, wait for 15 seconds
13271:20120327:122410.781 resuming IPMI checks on host [F2-CN-01]: connection restored
###
-EDIT: 2012 Mar 28-
1. Converted VM from VMware to VirtualBox (Windows) on another host (win7) in another network segment (to exclude hypervisor, Host-OS, network connectivity from error source)
2. Compiled Zabbix 2.0.0 RC2 and updated the system
3. Added host and templates
Result:
-
- zabbix_server.log - Zabbix 2.0.0 RC2 ##
1539:20120328:123650.011 Starting Zabbix Server. Zabbix 2.0.0rc2 (revision 26343).
1539:20120328:123650.011 ****** Enabled features ******
1539:20120328:123650.011 SNMP monitoring: YES
1539:20120328:123650.012 IPMI monitoring: YES
1539:20120328:123650.012 WEB monitoring: YES
1539:20120328:123650.012 Jabber notifications: NO
1539:20120328:123650.012 Ez Texting notifications: YES
1539:20120328:123650.012 ODBC: NO
1539:20120328:123650.012 SSH2 support: YES
1539:20120328:123650.012 IPv6 support: YES
1539:20120328:123650.012 ******************************
1541:20120328:123650.068 server #2 started db watchdog #1
1540:20120328:123650.070 server #1 started configuration syncer #1
1548:20120328:123650.126 server #9 started trapper #1
1549:20120328:123650.128 server #10 started trapper #2
1550:20120328:123650.130 server #11 started trapper #3
1551:20120328:123650.158 server #12 started trapper #4
1544:20120328:123650.161 server #5 started poller #3
1542:20120328:123650.163 server #3 started poller #1
1545:20120328:123650.164 server #6 started poller #4
1543:20120328:123650.165 server #4 started poller #2
1546:20120328:123650.167 server #7 started poller #5
1547:20120328:123650.170 server #8 started unreachable poller #1
1552:20120328:123650.173 server #13 started trapper #5
1553:20120328:123650.179 server #14 started icmp pinger #1
1554:20120328:123650.185 server #15 started alerter #1
1555:20120328:123650.192 server #16 started housekeeper #1
1555:20120328:123650.192 executing housekeeper
1566:20120328:123650.204 server #17 started timer #1
1567:20120328:123650.206 server #18 started http poller #1
1569:20120328:123650.215 server #20 started history syncer #1
1570:20120328:123650.217 server #21 started history syncer #2
1571:20120328:123650.220 server #22 started history syncer #3
1572:20120328:123650.223 server #23 started history syncer #4
1579:20120328:123650.244 server #24 started escalator #1
1580:20120328:123650.247 server #25 started ipmi poller #1
1581:20120328:123650.250 server #26 started ipmi poller #2
1582:20120328:123650.253 server #27 started ipmi poller #3
1568:20120328:123650.262 server #19 started discoverer #1
1586:20120328:123650.273 server #29 started ipmi poller #5
1587:20120328:123650.275 server #30 started proxy poller #1
1539:20120328:123650.280 server #0 started [main process]
1585:20120328:123650.284 server #28 started ipmi poller #4
1592:20120328:123650.289 server #31 started self-monitoring #1
1555:20120328:123651.371 housekeeper deleted: 10190 records from history and trends, 500 records of deleted items, 0 events, 0 alerts, 0 sessions
1547:20120328:123655.299 temporarily disabling IPMI checks on host [F2-VH-01]: host unavailable
1580:20120328:123700.385 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: first network error, wait for 15 seconds
1547:20120328:123712.952 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:123712.967 temporarily disabling IPMI checks on host [F2-CN-01]: host unavailable
1581:20120328:123713.314 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1547:20120328:123715.974 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds
1547:20120328:123728.994 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:123731.006 IPMI item [FAN_MOD_4A_RPM] on host [F2-VH-02] failed: another network error, wait for 15 seconds
1586:20120328:123739.330 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1547:20120328:123746.025 temporarily disabling IPMI checks on host [F2-VH-02]: host unavailable
1547:20120328:123754.040 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:123758.285 enabling IPMI checks on host [F2-VH-01]: host became available
1585:20120328:123809.353 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1547:20120328:123815.477 enabling IPMI checks on host [F2-CN-01]: host became available
1586:20120328:123816.364 IPMI item [FAN_4_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1581:20120328:123816.364 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:123824.499 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:123831.521 resuming IPMI checks on host [F2-VH-01]: connection restored
1547:20120328:123831.529 resuming IPMI checks on host [F2-CN-01]: connection restored
1585:20120328:123839.381 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1581:20120328:123842.384 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1586:20120328:123846.388 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:123854.560 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:123857.572 resuming IPMI checks on host [F2-CN-01]: connection restored
1547:20120328:123901.582 resuming IPMI checks on host [F2-VH-01]: connection restored
1586:20120328:123911.420 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1582:20120328:123912.411 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1582:20120328:123917.457 IPMI item [FAN_MOD_1B_RPM] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:123926.620 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:123927.629 resuming IPMI checks on host [F2-CN-01]: connection restored
1547:20120328:123932.640 resuming IPMI checks on host [F2-VH-01]: connection restored
1585:20120328:123942.166 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1585:20120328:123946.191 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:123957.679 resuming IPMI checks on host [F2-CN-01]: connection restored
1547:20120328:124001.691 resuming IPMI checks on host [F2-VH-01]: connection restored
1586:20120328:124013.463 IPMI item [FAN_4_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1586:20120328:124013.477 IPMI item [FAN_1_RPM] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1586:20120328:124016.486 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:124028.718 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:124028.726 resuming IPMI checks on host [F2-CN-01]: connection restored
1547:20120328:124031.735 resuming IPMI checks on host [F2-VH-01]: connection restored
1586:20120328:124039.505 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1582:20120328:124042.515 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1586:20120328:124046.517 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:124049.716 enabling IPMI checks on host [F2-VH-02]: host became available
1547:20120328:124054.727 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:124057.741 resuming IPMI checks on host [F2-CN-01]: connection restored
1582:20120328:124059.529 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
1547:20120328:124101.759 resuming IPMI checks on host [F2-VH-01]: connection restored
1586:20120328:124109.537 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1580:20120328:124112.541 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1547:20120328:124114.776 resuming IPMI checks on host [F2-VH-02]: connection restored
1582:20120328:124116.547 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:124124.796 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:124127.807 resuming IPMI checks on host [F2-CN-01]: connection restored
1586:20120328:124129.554 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
1547:20120328:124131.818 resuming IPMI checks on host [F2-VH-01]: connection restored
1582:20120328:124141.569 IPMI item [FAN_2_RPM] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1586:20120328:124142.568 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1547:20120328:124144.832 resuming IPMI checks on host [F2-VH-02]: connection restored
1581:20120328:124146.017 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: another network error, wait for 15 seconds
1581:20120328:124147.024 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
1547:20120328:124157.854 resuming IPMI checks on host [F2-CN-01]: connection restored
1581:20120328:124159.046 IPMI item [Ambient_Temp] on host [F2-VH-02] failed: first network error, wait for 15 seconds
1547:20120328:124201.865 resuming IPMI checks on host [F2-CN-02]: connection restored
1547:20120328:124202.872 resuming IPMI checks on host [F2-VH-01]: connection restored
1581:20120328:124209.063 IPMI item [Ambient_Temp] on host [F2-CN-02] failed: first network error, wait for 15 seconds
1581:20120328:124212.072 IPMI item [Ambient_Temp] on host [F2-CN-01] failed: first network error, wait for 15 seconds
1547:20120328:124214.886 resuming IPMI checks on host [F2-VH-02]: connection restored
1581:20120328:124216.085 IPMI item [Ambient_Temp] on host [F2-VH-01] failed: first network error, wait for 15 seconds
...
####
|
It seems that the BMC gets too many requests/connections.
From time to time I get following message when running ipmitool:
Does Zabbix use a sdr cache ? This could increase the performance.
ipmitool offers this parameter:
BMC busy topic: http://old.nabble.com/possible-causes-for-%22ipmi_ctx_open_outofband%3A-BMC-busy%22-td31448014.html
|
Posted this problem on Dell Community:
http://en.community.dell.com/support-forums/servers/f/177/p/19442918/20078853.aspx#20078853
|
My colleague has done some testing on this issue, and he came to the conclusion that IPMI CPU is unable to handle all those requests. As he says, for each request to IPMI host Zabbix opens one separate connection and IBM System x IMM module is unable to handle all the requests. So he had to write a wrapper script that requests all IPMI items from the host at a time, stores them in a cache file, and gives items to Zabbix when it requests.
|
Thanks for your reply. Could you post the wrapper script here ?
What about caching the sdr query like ipmitool does whe using the parameter -s ?
I know that freeipmi automaticaly creates a cachefile of the sdr. But Zabbix uses openipmi.
For sure Zabbix's IPMI-Engine would have a better performace when using the caching option by default.
Chris
|
The script is rather simple, it just stores values in a local file with a timestamp. Then, when Zabbix requests a value, script examines the timestamp, and either renews its cache first, or just gives out data from cache, if it's recent enough.
|
This issue is covered by ZBXNEXT-1210, which is related to ZBXNEXT-98.
|
I'm experiencing the same network errors in the server log as Chris is above (running 2.0.2), trying to connect to a Dell PowerEdge 1950 (BMC) and PowerEdge R210 II (iDRAC 6 Express). Is there some way to make the IPMI poller more accommodating for slow devices?
|
There is a discussion going recently about fixing this one. We will report as soon as there is more information.
|
i have the same issue ... and its not related to DELL. I am using Supermicro IPMI to monitor RAM and Environment Temperature and i got the same issues:
24006:20120909:171011.011 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171033.326 IPMI item [P2-DIMM2B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171048.330 IPMI item [P2-DIMM3B_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171104.539 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:171111.023 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171126.547 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:171135.955 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171150.556 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:171559.993 IPMI item [Fan5] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
24007:20120909:171603.995 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171625.608 IPMI item [Fan6] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171641.611 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:171657.737 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:171718.004 IPMI item [P1-DIMM3A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:171733.748 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172329.683 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172351.819 IPMI item [Fan2] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:172407.825 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:172424.027 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172429.695 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172444.037 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:172453.725 IPMI item [P1-DIMM2B_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172508.047 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:172511.730 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172526.056 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172559.839 IPMI item [Fan2] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172614.069 resuming IPMI checks on host [Supermicro SC836]: connection restored
24006:20120909:172615.843 IPMI item [P1-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:172630.078 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:173101.923 IPMI item [Fan3] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173123.135 IPMI item [Fan3] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173139.139 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173155.263 resuming IPMI checks on host [Supermicro SC836]: connection restored
24007:20120909:173156.932 IPMI item [P2-DIMM2A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173211.665 resuming IPMI checks on host [Supermicro SC836]: connection restored
24005:20120909:173347.011 IPMI item [P1-DIMM1A_Temp] on host [Supermicro SC836] failed: first network error, wait for 15 seconds
23988:20120909:173409.694 IPMI item [P2-DIMM1A_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23988:20120909:173425.699 IPMI item [System_Temp] on host [Supermicro SC836] failed: another network error, wait for 15 seconds
23678:20120904:202202.840 Starting Zabbix Server. Zabbix 2.0.2 (revision 29214).
23678:20120904:202202.840 ****** Enabled features ******
23678:20120904:202202.840 SNMP monitoring: YES
23678:20120904:202202.840 IPMI monitoring: YES
23678:20120904:202202.840 WEB monitoring: NO
23678:20120904:202202.840 Jabber notifications: NO
23678:20120904:202202.840 Ez Texting notifications: NO
23678:20120904:202202.840 ODBC: NO
23678:20120904:202202.840 SSH2 support: NO
23678:20120904:202202.840 IPv6 support: NO
23678:20120904:202202.840 ******************************
23680:20120904:202202.900 server #1 started configuration syncer #1
23681:20120904:202202.900 server #2 started db watchdog #1
23682:20120904:202202.901 server #3 started poller #1
23683:20120904:202202.902 server #4 started poller #2
23684:20120904:202202.904 server #5 started poller #3
23685:20120904:202202.905 server #6 started poller #4
23686:20120904:202202.906 server #7 started poller #5
23678:20120904:202202.906 server #0 started [main process]
23704:20120904:202202.906 server #25 started ipmi poller #1
23687:20120904:202202.907 server #8 started unreachable poller #1
23705:20120904:202202.907 server #26 started ipmi poller #2
23706:20120904:202202.907 server #27 started ipmi poller #3
23707:20120904:202202.907 server #28 started proxy poller #1
23708:20120904:202202.908 server #29 started self-monitoring #1
23692:20120904:202202.910 server #13 started trapper #5
23693:20120904:202202.910 server #14 started icmp pinger #1
23698:20120904:202202.911 server #19 started discoverer #1
23697:20120904:202202.911 server #18 started http poller #1
23696:20120904:202202.912 server #17 started timer #1
23695:20120904:202202.912 server #16 started housekeeper #1
23695:20120904:202202.912 executing housekeeper
23694:20120904:202202.912 server #15 started alerter #1
23699:20120904:202202.913 server #20 started history syncer #1
23688:20120904:202202.913 server #9 started trapper #1
23689:20120904:202202.913 server #10 started trapper #2
23690:20120904:202202.913 server #11 started trapper #3
23691:20120904:202202.913 server #12 started trapper #4
23702:20120904:202202.914 server #23 started history syncer #4
23701:20120904:202202.914 server #22 started history syncer #3
23700:20120904:202202.914 server #21 started history syncer #2
23703:20120904:202202.915 server #24 started escalator #1
|
Same problem for me:
Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> Dell Remote Access Controller 5 A01 Firmware Version 1.60 (11.03.03) IP: 172.30.5.96
Zabbix server v2.0.2 --> Zabbix proxy v2.0.2 (revision 29214) --> iLO4 Firmware Version 1.05 ILOCZ22240991 IP: 172.30.5.98
1689:20120927:144113.549 resuming IPMI checks on host [172.30.5.98]: connection restored
1679:20120927:144123.137 Received configuration data from server. Datalen 7766
1709:20120927:144206.227 IPMI item [ipmi.ambient_temp] on host [172.30.5.96] failed: first network error, wait for 15 seconds
1679:20120927:144223.268 Received configuration data from server. Datalen 7766
1689:20120927:144228.939 resuming IPMI checks on host [172.30.5.96]: connection restored
1679:20120927:144323.726 Received configuration data from server. Datalen 7766
1711:20120927:144416.828 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: first network error, wait for 15 seconds
1679:20120927:144423.899 Received configuration data from server. Datalen 7766
1689:20120927:144432.106 IPMI item [ipmi.ambient_temp] on host [172.30.5.98] failed: another network error, wait for 15 seconds
1689:20120927:144457.405 resuming IPMI checks on host [172.30.5.98]: connection restored
1679:20120927:144524.008 Received configuration data from server. Datalen 7766
|
I know the solution for this issue:
as for me, if you are using LO-100 you must set password size to 16 bytes (not 20). After that monitoring of IPMI will start to work.
So, zabbix don't use ipmi 2.0 and I can't find where I can set it.
Print out of commands if you have password size set to 20 bytes:
ipmitool -H 10.145.1.129 -U admin -P admin chassis status
Invalid user name
Error: Unable to establish LAN session
Error sending Chassis Status command
ipmitool -I lanplus -H 10.145.1.129 -U admin -P admin chassis status
System Power : on
Power Overload : false
Power Interlock : inactive
Main Power Fault : false
Power Control Fault : false
Power Restore Policy : previous
Last Power Event :
Chassis Intrusion : inactive
Front-Panel Lockout : inactive
Drive Fault : false
Cooling/Fan Fault : false
Sleep Button Disable : allowed
Diag Button Disable : allowed
Reset Button Disable : allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : false
Reset Button Disabled: false
Power Button Disabled: false
So, where we can set parameters for ipmi-tools?
|
Hm, I found that if you set Authentication algorithm from "none" to "RMCP+" all is works fine.
|
I too had same problem (monitoring 5 hosts with around 10 items each), and was getting unsupported items intermittently every minute or so. Based on a suggestion from forums[1], I changed number of IPMI pollers to just one. Since then, there was no problem with getting IPMI values at all. This was on zabbix 2.0.3 at that time, and still works flawlessly on 2.05 with just one IPMI poller.
1. https://www.zabbix.com/forum/showpost.php?s=783bdc9aff7d3ea26999f74f4d223e59&p=118389&postcount=4
|
if IPMI sensor is located at the end of table of sensors, getting value can take about 40-50 seconds and sometimes can be failed with network error:
|
I have had the same experience, that reducing the number of IPMI pollers to just one has stopped the frequent Network Error messages and gaps.
I had both my production server and a small test VM server, the production server was only polling a few IPMI items and a lot of other non-IPMI monitoring, and the test server was only polling a few IPMI items and doing no other monitoring. Both servers were showing frequent Network Errors and gaps from the items then going unsupported. I reduced the number of IPMI pollers to just one last week and have yet to see a Network Error warning since. Neither server showed all the IPMI pollers as busy.
This is with Zabbix 2.2.2.
Thinking it might have something to do with two different ipmi pollers polling the same device at the same time, I did a simple test where from two different hosts I issued an ipmitool sensor command to the same IPMI device. What I observed is that the resulting output from the IPMI is only sent to one device at a time. The effect I observe is that one ipmitool output starts scrolling while the other is paused for a few seconds, then the other starts scrolling and the first one pauses, and this goes back and forth a few times until both are complete.
|
Same here with 2.2.6
I observed that I had an item with an invalid sensor id. Seems that if there is a problem with any one item, further processing just breaks.
I disabled all items that did not give a value and the problem disappears.
Note that this happens even if the sensor is listed and basically available, but simply doesn't provide a value (e.g. I have sensor FAN4 but no fan connected to it)!
|
This issue still exists in 2.4.5, and I can confirm that if you disable unavailable sensors it works without problems.
Looks like the handling of not available sensors is incorrect.
|
Disabling unavailable sensors did not help (zabbix_server v2.4.5) in my case - those sensors that had been successfully receiving data still had issues.
However, when I set IPMIPollers to 1, the issue disappeared.
|
Similar behaviour on my supermicro board monitored using latest zabbix server and agent from zabbix debian repository (2.4.7-1+jessie).
I was actually able to fix the issue by lowering the update interval of one IPMI item from 300s to 60s (all others ipmi items were kept to their 300s interval). If I set this ipmi item to 90s, the issue appears again.
Could be some ipmi session handling issue.
|
Same here with DELL PowerEdge R510.
It seems to be a know issue with OpenIPMI, see https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/ipmi :
"IPMI session inactivity timeout for LAN is 60 +/-3 seconds. [...] then the next IPMI check after the timeout expires will time out due to individual message timeouts, retries or receive error."
Reducing the check interval to 45 seconds fixed the problem for me.
The issue appears naturally more frequently in testing environments e.g. when checking only 1 item on a server.
|
Already fixed under ZBXNEXT-3386.
|
Generated at Sun May 25 08:58:17 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.