[ZBX-8949] Possible deadlock on ids table on "housekeeper" row Created: 2014 Oct 24 Updated: 2017 May 30 Resolved: 2015 Jun 30 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Frontend (F), Server (S) |
Affects Version/s: | 2.2.7rc1, 2.4.1 |
Fix Version/s: | 2.2.10rc1, 2.4.6rc1, 2.5.0 |
Type: | Incident report | Priority: | Blocker |
Reporter: | Alexey Pustovalov | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | deadlock, housekeeper | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
Description |
It can happen because Zabbix frontend and server can use the same table row for updating at the same time: 31628:20141020:094957.796 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [update ids set nextid=nextid+14 where nodeid=0 and table_name='housekeeper' and field_name='housekeeperid'] zabbix_server [31628]: ERROR [file:db.c,line:999] Something impossible has just happened. 31616:20141020:094930.468 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [update ids set nextid=nextid+14 where nodeid=0 and table_name='housekeeper' and field_name='housekeeperid'] zabbix_server [31616]: ERROR [file:db.c,line:999] Something impossible has just happened. |
Comments |
Comment by Andris Zeila [ 2015 May 13 ] |
During item removal we are deleting from screens_items (also profiles) table by using non-indexed fields in where clause. With mysql this results in all table records being locked, which can easily lead to deadlocks. To avoid it we should first select the corresponding identifiers (sreenitemid, profileid) and perform sql delete based on identifiers. |
Comment by Andris Zeila [ 2015 May 18 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-8949 |
Comment by Andris Zeila [ 2015 Jun 04 ] |
Backported fixes to 2.2 branch (svn://svn.zabbix.com/branches/dev/ZBX-8949_2.2) |
Comment by dimir [ 2015 Jun 04 ] |
Here is the scenario wiper proposed (lld). process1: deletes item1 - item1 is in graph1 - graph1 is not on any screen - item1 is in screen1 as simple graph process2: deletes item2 - item2 is in a graph2 - graph2 is in a screen2 What actually happens and in which order: - process1: delete from screens_items by graphid - NOTHING TO DO - process1: update ids ("housekeeper") - ids LOCKED - process2: delete from screens_items by graphid - screens_items LOCKED - process1: delete from screens_items by itemid - WAIT ON screens_items LOCK - process2: update ids ("housekeeper") - WAIT ON ids LOCK (deadlock) In order to organize that I have added the code to remove items from 2 different poller processes. Each removes one item. Also added some sleep() calls to ensure the needed order. This is what could be seen in the server log, so before the fix: 26549:20150604:124320.090 server #3 started [poller #1] 26550:20150604:124320.098 server #4 started [poller #2] [sleep 2] 26549:20150604:124320.093 query [txnlev:1] [begin;] 26549:20150604:124320.095 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid'] [sleep 4] 26550:20150604:124322.114 query [txnlev:1] [begin;] 26550:20150604:124322.116 query [txnlev:1] [delete from screens_items where resourcetype=0 and resourceid=547; 26550:20150604:124322.117 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid'] 26549:20150604:124324.096 query [txnlev:1] [delete from screens_items where resourcetype in (3,1) and resourceid=23662; 26550:20150604:124324.098 query [txnlev:1] [delete from screens_items where resourcetype in (3,1) and resourceid=23663; 26550:20150604:124324.099 query [txnlev:1] [commit;] 26549:20150604:124324.137 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [delete from screens_items where resourcetype in (3,1) and resourceid=23662; 26549:20150604:124324.137 query [delete from screens_items where resourcetype in (3,1) and resourceid=23662; 26549:20150604:124324.137 query [txnlev:1] [rollback;] After the fix: 13969:20150604:150549.489 server #3 started [poller #1] 13970:20150604:150549.483 server #4 started [poller #2] [sleep 2] 13969:20150604:150549.576 query [txnlev:1] [begin;] 13969:20150604:150549.599 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid'] [sleep 4] 13970:20150604:150551.575 query [txnlev:1] [begin;] 13970:20150604:150551.578 query [txnlev:1] [delete from screens_items where screenitemid=77; 13970:20150604:150551.595 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid'] 13969:20150604:150553.600 query [txnlev:1] [delete from screens_items where screenitemid=76; 13969:20150604:150553.600 query [txnlev:1] [commit;] 13970:20150604:150553.607 query [txnlev:1] [update ids set nextid=nextid+7 where table_name='housekeeper' and field_name='housekeeperid'] 13970:20150604:150553.608 query [txnlev:1] [delete from screens_items where screenitemid=78; 13970:20150604:150553.608 query [txnlev:1] [commit;] |
Comment by dimir [ 2015 Jun 04 ] |
Tested. Please review my changes in r53954. wiper thanks |
Comment by Andris Zeila [ 2015 Jun 08 ] |
Released in:
|