-
Incident report
-
Resolution: Won't fix
-
Major
-
None
-
1.8.8
-
RHEL 5.5, x86-64 Linux
I am monitoring the zabbix sub-processes in a graph, thankfully, or I would never know what was sticking. I noticed that items which should have been in maintenance were paging, and never even showed the orange link color on the dashboard. Careful inspection, and even maintenance period re-creation revealed that the problem persisted.
I looked at the subprocess graph and discovered that the timer process had gone into some loop and pegged at 0% idle (or 100% usage, depending on how you look at it). I tried killing the timer in hopes Zabbix would spawn a new one, but of course it shut down the whole operation and I did a restart.
I have included the graph so you can see what happened. The closest events of any interest in that 9:17AM area are these:
31516:20111024:091729.430 Item [REDACTED.com:vip.bytesIn_perConn.443] became not supported: Division by zero. Cannot evaluate expression [464/0]
31505:20111024:091733.207 Item [REDACTED.com:vip.bytesOut_perConn.443] became not supported: Division by zero. Cannot evaluate expression [5353/0]
These items are based on SNMP queries against an F5 big iron, but does a calculation against another number also harvested therefrom. It seems that Zabbix correctly deflected these and marked the items as not supported, so this may be a mere coincidence. There are no triggers associated with the above items.
We have never seen this bug before, so I presume it may have something to do with the optimizations included in 1.8.8. We noticed that our timer process averaged around 40% idle with 1.8.5 and now it averages 60% idle. We've definitely seen improvement, but we need our maintenance windows to work.
I'll be adding a trigger to catch this 0% idle condition on the timer process for the time being.