[ZBX-4268] Timer process gets stuck with 100% utilization Created: 2011 Oct 24 Updated: 2017 May 30 Resolved: 2012 Oct 16 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 1.8.8 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Major |
Reporter: | Aaron Mildenstein | Assignee: | Unassigned |
Resolution: | Won't fix | Votes: | 0 |
Labels: | maintenance, timer, triggers | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
RHEL 5.5, x86-64 Linux |
Attachments: |
![]() ![]() ![]() |
Description |
I am monitoring the zabbix sub-processes in a graph, thankfully, or I would never know what was sticking. I noticed that items which should have been in maintenance were paging, and never even showed the orange link color on the dashboard. Careful inspection, and even maintenance period re-creation revealed that the problem persisted. I looked at the subprocess graph and discovered that the timer process had gone into some loop and pegged at 0% idle (or 100% usage, depending on how you look at it). I tried killing the timer in hopes Zabbix would spawn a new one, but of course it shut down the whole operation and I did a restart. I have included the graph so you can see what happened. The closest events of any interest in that 9:17AM area are these: 31516:20111024:091729.430 Item [REDACTED.com:vip.bytesIn_perConn.443] became not supported: Division by zero. Cannot evaluate expression [464/0] These items are based on SNMP queries against an F5 big iron, but does a calculation against another number also harvested therefrom. It seems that Zabbix correctly deflected these and marked the items as not supported, so this may be a mere coincidence. There are no triggers associated with the above items. We have never seen this bug before, so I presume it may have something to do with the optimizations included in 1.8.8. We noticed that our timer process averaged around 40% idle with 1.8.5 and now it averages 60% idle. We've definitely seen improvement, but we need our maintenance windows to work. I'll be adding a trigger to catch this 0% idle condition on the timer process for the time being. |
Comments |
Comment by Aleksandrs Saveljevs [ 2011 Oct 25 ] |
The next time timer process is 100% busy, it would probably be useful to strace it to see whether it has just hanged at some operation or is doing a loop of some sort. The PID of the timer process can be found in the log file. For instance, in " 15195:20111019:111342.990 server #20 started timer #1" number "15195" is the required PID file. The command to run strace is then something like "strace -tt -T -s 128 -p 15195". |
Comment by richlv [ 2011 Oct 25 ] |
just a note that even if the startup messages in the logfile have been rotated away (with the default setting of 1mb logfile...), you should be able to strace the busiest zabbix process |
Comment by Max M [ 2011 Nov 07 ] |
I work with Aaron, I've attached the output of strace for the timer process that was running away here. |
Comment by Aleksandrs Saveljevs [ 2011 Nov 08 ] |
Apparently the timer process loops indefinitely when trying to fetch data from Oracle. The database server responds with "no data found" and "fetch out of sequence" errors. |
Comment by Aleksandrs Saveljevs [ 2011 Nov 08 ] |
We had a similar problem with looping when doing a fetch from the database in In that issue we added processing of a couple of Oracle error codes that we know to cause such looping. Apparently, there might be other codes, too, and ORA-01002 ("fetch out of sequence") might be one of them. We have added ORA-01002 to the list of known error codes in development branch svn://svn.zabbix.com/branches/dev/ZBX-4268 . Would it be possible for you to run this patched version (or patch Zabbix 1.8.8 that you are running with the diff from revision 23055) and see whether the problem occurs again? |
Comment by Aaron Mildenstein [ 2011 Nov 08 ] |
I've recompiled using the version of timer.c from svn revision 23058. It's running for a short time in our certification environment. If all is well, I'll push it to our production Zabbix server soon. |
Comment by Aleksandrs Saveljevs [ 2011 Nov 09 ] |
Oh, no, you do not need the new version of timer.c. You have to patch file src/libs/zbxdb/db.c by applying the diff of "svn di -c 23055 svn://svn.zabbix.com/branches/dev/ZBX-4268". I am attaching it here for convenience: see ora-01002.diff. |
Comment by Aleksandrs Saveljevs [ 2011 Nov 09 ] |
|
Comment by Aaron Mildenstein [ 2011 Nov 09 ] |
Thanks for the patch. I have applied it and am waiting to install it on our production boxes again. |
Comment by Aaron Mildenstein [ 2011 Nov 21 ] |
This issue is not fixed. The timer process still hits 100% busy, but seems to recover. However, I see messages like these right before the timer goes critical: 11942:20111121:114618.494 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690513,1,2,428,1321897578,1)] I did not have this problem with the unpatched version of 1.8.8. |
Comment by Aaron Mildenstein [ 2011 Nov 21 ] |
I have also noticed this new message, which I have not previously seen: 12148:20111121:115317.677 unsupported condition type [14] for condition id [0] |
Comment by Aleksandrs Saveljevs [ 2011 Nov 22 ] |
Thanks for the update! |
Comment by Alexei Vladishev [ 2012 Jul 28 ] |
Is there any news regarding this issue? |
Comment by Alexander Vladishev [ 2012 Oct 16 ] |
I'm closing the issue. Please reopen if the problem remained. |