[ZBX-4661] server crash when Oracle database is not available Created: 2012 Feb 15  Updated: 2017 May 30  Resolved: 2016 Nov 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 1.8.8
Fix Version/s: 2.0.20rc1, 2.2.16rc1, 3.0.6rc1, 3.2.2rc1, 3.4.0alpha1

Type: Incident report Priority: Blocker
Reporter: Oleksii Zagorskyi Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: crash, oracle, webmonitoring
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RHEL 5.4 + Oracle11GR2 + Zabbix 1.8.8


Attachments: Text File zabbix_server.log    
Issue Links:
Duplicate
is duplicated by ZBX-6726 Zabbix 2.0.6 server crash Closed
is duplicated by ZBX-4644 server can not survive some oracle-do... Closed

 Description   

Early today, when DataBase went down for backup process, all the zabbix_server's processes went down too.
Daily, our database is going down about 2 minutes to take an snapshot and today is the first time it happens.

zabbix_server.log attached
zabbix_server.conf is almost default (vanilla)

Zabbix_server: bb05b
Zabbix DB: bx05d

Seems problems in the webchecks, see zabbix_server.log and PID 1608



 Comments   
Comment by Oleksii Zagorskyi [ 2012 Feb 15 ]

very similar but seems another issue is ZBX-4644

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 04 ]

This crash may be still present in the current trunk. Imagine we lose connection to database during one of the "inner" DBselect()'s in process_httptests():

int	process_httptests(int httppoller_num, int now)
{
	...
	result = DBselect(...);

	while (NULL != (row = DBfetch(result)))
	{
		/* very big and complicated loop with more DBselect()'s in it */
	}
	...
	DBfree_result(result);    <--- double free statement handle
	...
}

We will DBclose() the connection and attempt to DBconnect() several times. In zbx_db_close() we free all handles. According to Oracle documentation when parent handle is freed children handles are freed automatically. Including statement handle associated with "outer" DBselect().

Comment by Oleksii Zagorskyi [ 2016 Oct 13 ]

Another case, maybe related:
no indication about OOM killer in syslog and dmesg.

In zabbix server log are many messages like:

 25027:20161012:224729.206 [Z3005] query failed: [-1] ORA-03113: end-of-file on communication channel
Process ID: 12833
Session ID: 801 Serial number: 16322 [update hosts set lastaccess=1476325922 where hostid=13852]
 25027:20161012:224729.206 slow query: 926.648444 sec, "update hosts set lastaccess=1476325922 where hostid=13852"

Here is copy-paste how server was stopped for unknown reason:

 24719:20161012:224730.444 [Z3005] query failed: [-1] ORA-03113: end-of-file on communication channel
Process ID: 12841
Session ID: 524 Serial number: 57807 [select i.itemid,f.functionid,f.function,f.parameter,t.triggerid from hosts h,items i,functions f,triggers t where h.hostid=i.hostid and i.itemid=f.itemid and f.triggerid=t.triggerid and h.status in (0,1) and t.flags<>2]
 24719:20161012:224730.444 slow query: 931.764048 sec, "select i.itemid,f.functionid,f.function,f.parameter,t.triggerid from hosts h,items i,functions f,triggers t where h.hostid=i.hostid and i.itemid=f.itemid and f.triggerid=t.triggerid and h.status in (0,1) and t.flags<>2"
 24719:20161012:224730.558 [Z3006] fetch failed: [100] OCI_NODATA
 24719:20161012:224730.558 no records in table 'config'
 24716:20161012:224730.600 One child process died (PID:24719,exitcode/signal:9). Exiting ...
 24716:20161012:224734.441 syncing history data...
 24716:20161012:224736.709 syncing history data done
 24716:20161012:224736.709 syncing trends data...
 24716:20161012:224915.770 slow query: 7.964045 sec, "select distinct itemid from trends_uint where <trim>
 24716:20161012:224921.529 syncing trends data done
 24716:20161012:224921.531 Zabbix Server stopped. Zabbix 3.0.3 (revision 60173).
Comment by Vladislavs Sokurenko [ 2016 Nov 02 ]

(1) Incorrect order when deallocating, currently environment handle is not deallocated last which could potentially introduce problems. Note this is happening on each database close, not only when database is not available.

From documentation:

Terminating the Application

An OCI application should perform the following steps before it terminates:

Delete the user session by calling OCISessionEnd() for each session.

Delete access to the data sources by calling OCIServerDetach() for each source.

Explicitly deallocate all handles by calling OCIHandleFree() for each handle.

Delete the environment handle, which deallocates all other handles associated with it.

vso RESOLVED in r63497:r63502

wiper In zbx_db_close() function - while it doesn't change anything, freeing results before Oracle handles would seem more logical.
REOPENED
vso RESOLVED in r63903

wiper CLOSED

Comment by Vladislavs Sokurenko [ 2016 Nov 02 ]

(2) When doing selects then handles are open for each query but the order of deallocation is not guaranteed. It is possible that on database connection failure environment handle will get deleted before child handles do.
We must ensure that children are always explicitly freed before parent so dangling pointer won't be freed when resource is automatically deallocated by parent deallocation
vso RESOLVED in r63497:r63502

wiper As the order of results is not important it would be better to use zbx_vector_ptr_remove_noorder() instead of zbx_vector_ptr_remove().
Also I'm inclining to put the results vector in zbx_oracle_db_handle_t structure (and maybe rename it to db_results).
REOPENED
vso RESOLVED in r63903

wiper CLOSED

Comment by Vladislavs Sokurenko [ 2016 Nov 03 ]

Fixed in:
svn://svn.zabbix.com/branches/dev/ZBX-4661

Note:
No logic was changed as to how Oracle selects, fetches data on database down conditions. Everything shall work the same after the fix except that server crash will no longer occur due to freeing of dangling pointer, because we ensure that all handles are freed and set to NULL before parent is deallocated.

Comment by Andris Zeila [ 2016 Nov 21 ]

(3) When closing database conncetion would be better to free all results (OCI_DBfree_result), not only the OCIstmt handles. This will require setting stmthp to NULL after freeing it in OCI_DBfree_result() function.

vso RESOLVED in r63903

wiper CLOSED

Comment by Andris Zeila [ 2016 Nov 21 ]

(4) Not related to this development, but we could fix it if we are already fixing Oracle related code. In zbx_db_fetch() function the column processing can be skipped if OCIStmtFetch2() didn't return OCI_SUCCESS.

vso RESOLVED in r63910

wiper CLOSED

Comment by Andris Zeila [ 2016 Nov 23 ]

Successfully tested

Comment by Vladislavs Sokurenko [ 2016 Nov 23 ]

Fixed conflicts in development branch:
svn://svn.zabbix.com/branches/dev/ZBX-4661-3.0

wiper Looks good

Comment by Vladislavs Sokurenko [ 2016 Nov 25 ]

Fixed in:
2.0.20rc1 r63955
2.2.16rc1 r63956
3.0.6rc1 r64050
3.2.2rc1 r64051
3.3.0 (trunk) r64052

Generated at Sat Apr 20 14:25:10 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.