[ZBX-4268] Timer process gets stuck with 100% utilization Created: 2011 Oct 24  Updated: 2017 May 30  Resolved: 2012 Oct 16

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 1.8.8
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Aaron Mildenstein Assignee: Unassigned
Resolution: Won't fix Votes: 0
Labels: maintenance, timer, triggers
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RHEL 5.5, x86-64 Linux


Attachments: PNG File ZabbixProcesses.png     File ora-01002.diff     File strace.out.gz    

 Description   

I am monitoring the zabbix sub-processes in a graph, thankfully, or I would never know what was sticking. I noticed that items which should have been in maintenance were paging, and never even showed the orange link color on the dashboard. Careful inspection, and even maintenance period re-creation revealed that the problem persisted.

I looked at the subprocess graph and discovered that the timer process had gone into some loop and pegged at 0% idle (or 100% usage, depending on how you look at it). I tried killing the timer in hopes Zabbix would spawn a new one, but of course it shut down the whole operation and I did a restart.

I have included the graph so you can see what happened. The closest events of any interest in that 9:17AM area are these:

31516:20111024:091729.430 Item [REDACTED.com:vip.bytesIn_perConn.443] became not supported: Division by zero. Cannot evaluate expression [464/0]
31505:20111024:091733.207 Item [REDACTED.com:vip.bytesOut_perConn.443] became not supported: Division by zero. Cannot evaluate expression [5353/0]

These items are based on SNMP queries against an F5 big iron, but does a calculation against another number also harvested therefrom. It seems that Zabbix correctly deflected these and marked the items as not supported, so this may be a mere coincidence. There are no triggers associated with the above items.

We have never seen this bug before, so I presume it may have something to do with the optimizations included in 1.8.8. We noticed that our timer process averaged around 40% idle with 1.8.5 and now it averages 60% idle. We've definitely seen improvement, but we need our maintenance windows to work.

I'll be adding a trigger to catch this 0% idle condition on the timer process for the time being.



 Comments   
Comment by Aleksandrs Saveljevs [ 2011 Oct 25 ]

The next time timer process is 100% busy, it would probably be useful to strace it to see whether it has just hanged at some operation or is doing a loop of some sort. The PID of the timer process can be found in the log file. For instance, in " 15195:20111019:111342.990 server #20 started timer #1" number "15195" is the required PID file. The command to run strace is then something like "strace -tt -T -s 128 -p 15195".

Comment by richlv [ 2011 Oct 25 ]

just a note that even if the startup messages in the logfile have been rotated away (with the default setting of 1mb logfile...), you should be able to strace the busiest zabbix process

Comment by Max M [ 2011 Nov 07 ]

I work with Aaron, I've attached the output of strace for the timer process that was running away here.

Comment by Aleksandrs Saveljevs [ 2011 Nov 08 ]

Apparently the timer process loops indefinitely when trying to fetch data from Oracle. The database server responds with "no data found" and "fetch out of sequence" errors.

Comment by Aleksandrs Saveljevs [ 2011 Nov 08 ]

We had a similar problem with looping when doing a fetch from the database in ZBX-3480. The problem occurred when Oracle database server was unavailable for some reason.

In that issue we added processing of a couple of Oracle error codes that we know to cause such looping. Apparently, there might be other codes, too, and ORA-01002 ("fetch out of sequence") might be one of them.

We have added ORA-01002 to the list of known error codes in development branch svn://svn.zabbix.com/branches/dev/ZBX-4268 . Would it be possible for you to run this patched version (or patch Zabbix 1.8.8 that you are running with the diff from revision 23055) and see whether the problem occurs again?

Comment by Aaron Mildenstein [ 2011 Nov 08 ]

I've recompiled using the version of timer.c from svn revision 23058. It's running for a short time in our certification environment. If all is well, I'll push it to our production Zabbix server soon.

Comment by Aleksandrs Saveljevs [ 2011 Nov 09 ]

Oh, no, you do not need the new version of timer.c. You have to patch file src/libs/zbxdb/db.c by applying the diff of "svn di -c 23055 svn://svn.zabbix.com/branches/dev/ZBX-4268". I am attaching it here for convenience: see ora-01002.diff.

Comment by Aleksandrs Saveljevs [ 2011 Nov 09 ]

ZBX-4335 looks similar.

Comment by Aaron Mildenstein [ 2011 Nov 09 ]

Thanks for the patch. I have applied it and am waiting to install it on our production boxes again.

Comment by Aaron Mildenstein [ 2011 Nov 21 ]

This issue is not fixed. The timer process still hits 100% busy, but seems to recover. However, I see messages like these right before the timer goes critical:

11942:20111121:114618.494 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690513,1,2,428,1321897578,1)]
11942:20111121:114618.878 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690514,1,1,428,1321897568,1)]
11849:20111121:114619.265 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690515,0,0,23922,1321897330,0)]
11849:20111121:114620.199 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690516,0,0,23964,1321897330,0)]
11849:20111121:114621.205 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690517,0,0,20974,1321897330,0)]
11942:20111121:114629.149 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690516,1,2,429,1321897589,1)]
11942:20111121:114629.598 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690517,1,1,429,1321897579,1)]
11849:20111121:114629.685 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690518,0,0,15280,1321897330,0)]
11849:20111121:114630.544 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690519,0,0,19457,1321897330,0)]
11849:20111121:114631.397 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690520,0,0,15415,1321897330,0)]
11849:20111121:114637.253 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690519,0,0,25254,1321897330,0)]
11849:20111121:114638.057 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690520,0,0,25255,1321897330,0)]
11849:20111121:114638.950 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690521,0,0,25257,1321897330,0)]
11849:20111121:114639.466 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690522,0,0,29184,1321897330,0)]
11849:20111121:114639.945 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690523,0,0,25258,1321897330,0)]
11849:20111121:114643.093 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690520,0,0,16310,1321897330,0)]
11944:20111121:114643.591 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690521,1,2,613,1321897603,1)]
11944:20111121:114643.987 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690522,1,1,613,1321897593,1)]
11849:20111121:114646.948 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690523,0,0,15493,1321897330,0)]
11849:20111121:114647.858 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690524,0,0,16354,1321897330,0)]
11729:20111121:114648.640 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690524,2,3,190,1321897608,1)]
11944:20111121:114654.291 [Z3005] query failed: [-1] ORA-00001: unique constraint (ZABBIX_DBUSER.SYS_C003455) violated [insert into events (eventid,source,object,objectid,clock,value) values (678690525,1,2,614,1321897614,1)]

I did not have this problem with the unpatched version of 1.8.8.

Comment by Aaron Mildenstein [ 2011 Nov 21 ]

I have also noticed this new message, which I have not previously seen:

12148:20111121:115317.677 unsupported condition type [14] for condition id [0]
12148:20111121:115317.739 unsupported condition type [14] for condition id [0]
12148:20111121:115317.759 unsupported condition type [14] for condition id [0]
12148:20111121:115317.796 unsupported condition type [14] for condition id [0]
12148:20111121:115317.818 unsupported condition type [14] for condition id [0]
12148:20111121:115317.840 unsupported condition type [14] for condition id [0]
12148:20111121:115317.861 unsupported condition type [14] for condition id [0]
12148:20111121:115317.883 unsupported condition type [14] for condition id [0]
12148:20111121:115317.903 unsupported condition type [14] for condition id [0]
12148:20111121:115318.064 unsupported condition type [14] for condition id [0]
12148:20111121:115318.096 unsupported condition type [14] for condition id [0]
12148:20111121:115318.115 unsupported condition type [14] for condition id [0]
12148:20111121:115318.134 unsupported condition type [14] for condition id [0]
12148:20111121:115318.175 unsupported condition type [14] for condition id [0]
12148:20111121:115318.195 unsupported condition type [14] for condition id [0]

Comment by Aleksandrs Saveljevs [ 2011 Nov 22 ]

Thanks for the update!

Comment by Alexei Vladishev [ 2012 Jul 28 ]

Is there any news regarding this issue?

Comment by Alexander Vladishev [ 2012 Oct 16 ]

I'm closing the issue. Please reopen if the problem remained.

Generated at Sun Apr 06 15:28:15 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.