[ZBX-8949] Possible deadlock on ids table on "housekeeper" row Created: 2014 Oct 24  Updated: 2017 May 30  Resolved: 2015 Jun 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Frontend (F), Server (S)
Affects Version/s: 2.2.7rc1, 2.4.1
Fix Version/s: 2.2.10rc1, 2.4.6rc1, 2.5.0

Type: Incident report Priority: Blocker
Reporter: Alexey Pustovalov Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: deadlock, housekeeper
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate

 Description   

It can happen because Zabbix frontend and server can use the same table row for updating at the same time:

31628:20141020:094957.796 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [update ids set nextid=nextid+14 where nodeid=0 and table_name='housekeeper' and field_name='housekeeperid']
zabbix_server [31628]: ERROR [file:db.c,line:999] Something impossible has just happened.

 31616:20141020:094930.468 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [update ids set nextid=nextid+14 where nodeid=0 and table_name='housekeeper' and field_name='housekeeperid']
zabbix_server [31616]: ERROR [file:db.c,line:999] Something impossible has just happened.


 Comments   
Comment by Andris Zeila [ 2015 May 13 ]

During item removal we are deleting from screens_items (also profiles) table by using non-indexed fields in where clause. With mysql this results in all table records being locked, which can easily lead to deadlocks.

To avoid it we should first select the corresponding identifiers (sreenitemid, profileid) and perform sql delete based on identifiers.

Comment by Andris Zeila [ 2015 May 18 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8949

Comment by Andris Zeila [ 2015 Jun 04 ]

Backported fixes to 2.2 branch (svn://svn.zabbix.com/branches/dev/ZBX-8949_2.2)

Comment by dimir [ 2015 Jun 04 ]

Here is the scenario wiper proposed (lld).

process1: deletes item1
- item1 is in graph1
- graph1 is not on any screen
- item1 is in screen1 as simple graph

process2: deletes item2
- item2 is in a graph2
- graph2 is in a screen2

What actually happens and in which order:

- process1: delete from screens_items by graphid - NOTHING TO DO
- process1: update ids ("housekeeper") - ids LOCKED
- process2: delete from screens_items by graphid - screens_items LOCKED
- process1: delete from screens_items by itemid - WAIT ON screens_items LOCK
- process2: update ids ("housekeeper") - WAIT ON ids LOCK (deadlock)

In order to organize that I have added the code to remove items from 2 different poller processes. Each removes one item. Also added some sleep() calls to ensure the needed order. This is what could be seen in the server log, so before the fix:

 26549:20150604:124320.090 server #3 started [poller #1]

 26550:20150604:124320.098 server #4 started [poller #2]
 [sleep 2]

 26549:20150604:124320.093 query [txnlev:1] [begin;]
 26549:20150604:124320.095 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid']
 [sleep 4]

 26550:20150604:124322.114 query [txnlev:1] [begin;]
 26550:20150604:124322.116 query [txnlev:1] [delete from screens_items where resourcetype=0 and resourceid=547;
 26550:20150604:124322.117 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid']

 26549:20150604:124324.096 query [txnlev:1] [delete from screens_items where resourcetype in (3,1) and resourceid=23662;

 26550:20150604:124324.098 query [txnlev:1] [delete from screens_items where resourcetype in (3,1) and resourceid=23663;
 26550:20150604:124324.099 query [txnlev:1] [commit;]

 26549:20150604:124324.137 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [delete from screens_items where resourcetype in (3,1) and resourceid=23662;
 26549:20150604:124324.137 query [delete from screens_items where resourcetype in (3,1) and resourceid=23662;
 26549:20150604:124324.137 query [txnlev:1] [rollback;]

After the fix:

 13969:20150604:150549.489 server #3 started [poller #1]

 13970:20150604:150549.483 server #4 started [poller #2]
 [sleep 2]

 13969:20150604:150549.576 query [txnlev:1] [begin;]
 13969:20150604:150549.599 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid']
 [sleep 4]

 13970:20150604:150551.575 query [txnlev:1] [begin;]
 13970:20150604:150551.578 query [txnlev:1] [delete from screens_items where screenitemid=77;
 13970:20150604:150551.595 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid']

 13969:20150604:150553.600 query [txnlev:1] [delete from screens_items where screenitemid=76;
 13969:20150604:150553.600 query [txnlev:1] [commit;]

 13970:20150604:150553.607 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid']
 13970:20150604:150553.608 query [txnlev:1] [delete from screens_items where screenitemid=78;
 13970:20150604:150553.608 query [txnlev:1] [commit;]
Comment by dimir [ 2015 Jun 04 ]

Tested. Please review my changes in r53954.

wiper thanks

Comment by Andris Zeila [ 2015 Jun 08 ]

Released in:

  • pre-2.2.10rc1 r53970
  • pre-2.4.6rc1 r53971
  • pre-2.5.0 r53972
Generated at Wed May 08 07:27:52 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.