[ZBX-11703] zabbix server may stop after failed query - ERROR: commit without transaction Created: 2017 Jan 13  Updated: 2018 Dec 14  Resolved: 2017 Feb 03

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 3.0.6
Fix Version/s: 3.0.8rc1, 3.2.4rc1, 3.4.0alpha1

Type: Problem report Priority: Blocker
Reporter: Oleksii Zagorskyi Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: deadlock, stopped
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate

 Description   

This zabbix server has many deadlock messages in log file, here for ~20 hours:

# grep "Lock wait timeout exceeded" zabbix_server_12_01.log 
 24383:20170111:184028.660 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=49297;
 24526:20170111:184459.097 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=49423;
 23921:20170111:184537.807 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=51567;
 24050:20170111:184556.526 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=52995;
 24804:20170111:184930.063 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=48267;
 24149:20170111:185331.951 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=59019;
 24122:20170111:185421.810 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=50982;
 24721:20170111:185533.073 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=63192;
 24288:20170111:185539.271 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=62481;
 24797:20170111:185604.309 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=58739;
 24705:20170111:185650.512 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=64196;
 24444:20170111:185709.946 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=52145;
 23878:20170111:185714.509 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=65274;
 23879:20170111:185917.687 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=61593;
 23756:20170111:185920.685 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=58759;
 24120:20170111:185939.689 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=63613;
 23848:20170111:185955.686 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [111] Connection refused' where hostid=63700;
 24176:20170111:190122.584 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=55279;
 23757:20170111:190142.437 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=65976;
 24099:20170111:190211.150 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=71992;
 24368:20170111:190333.953 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=65452;
 24294:20170111:190340.565 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=56928;
 24381:20170111:190435.956 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [select null from hosts where hostid=60609 for update]
  6794:20170112:102939.140 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=59460;
  6702:20170112:102942.141 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=48658;
  7091:20170112:102953.147 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=64896;
  7072:20170112:103000.273 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=69385;
  6549:20170112:103002.432 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=52009;
  7018:20170112:103003.432 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=74091;
  7052:20170112:103003.432 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=46770;
  7064:20170112:103006.762 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=68565;
  6907:20170112:103010.728 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1,error='' where hostid=85253;
  6770:20170112:103014.729 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=1 where hostid=50438;
  6817:20170112:103016.118 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set available=2,error='Get value from agent failed: cannot connect to [[[XX.XX.XX.XX]]:10050]: [113] No route to host' where hostid=81132;
  6858:20170112:103028.262 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [select null from hosts where hostid=109478 for update]

But critical point is that twice zabbix server stopped because of a rare condition:

 24381:20170111:190435.956 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [select null from hosts where hostid=60609 for update]
 24381:20170111:190435.956 slow query: 1001.288937 sec, "select null from hosts where hostid=60609 for update"
 24381:20170111:190435.978 [Z3005] query failed: [1048] Column 'itemid' cannot be null [insert into items_applications (itemappid,applicationid,itemid) values (11202011,97994,null),(11202012,97994,null);
]
 24381:20170111:190435.978 ERROR: commit without transaction. Please report it to Zabbix Team.
zabbix_server_mm: trapper #358 [processing data]: db.c:818: zbx_db_commit: Assertion `0' failed.
 18495:20170111:190436.241 One child process died (PID:24381,exitcode/signal:6). Exiting ...
 18495:20170111:190443.122 syncing history data...
 18495:20170111:190538.706 syncing history data done
 18495:20170111:190538.706 syncing trends data...
 18495:20170111:190557.797 syncing trends data done
 18495:20170111:190557.797 Zabbix Server stopped. Zabbix 3.0.6rc1 (revision 64124).
  6858:20170112:103028.262 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [select null from hosts where hostid=109478 for update]
  6858:20170112:103028.262 slow query: 1000.388938 sec, "select null from hosts where hostid=109478 for update"
  6858:20170112:103028.275 [Z3005] query failed: [1048] Column 'itemid' cannot be null [insert into items_applications (itemappid,applicationid,itemid) values (11228662,275523,null),(11228663,275523,null);
]
  6858:20170112:103028.275 ERROR: commit without transaction. Please report it to Zabbix Team.
zabbix_server_mm: trapper #384 [processing data]: db.c:818: zbx_db_commit: Assertion `0' failed.
   328:20170112:103028.280 One child process died (PID:6858,exitcode/signal:6). Exiting ...
   328:20170112:103030.391 syncing history data...
   328:20170112:103030.405 syncing history data done
   328:20170112:103030.405 syncing trends data...
   328:20170112:103141.040 syncing trends data done
   328:20170112:103141.040 Zabbix Server stopped. Zabbix 3.0.6rc1 (revision 64124).

This should be fixed that server will not stop.



 Comments   
Comment by Oleksii Zagorskyi [ 2017 Jan 13 ]

The physical server is running a few zabbix server daemon, of different sockets etc, but I think it should not be related to the issue.

Comment by Oleksii Zagorskyi [ 2017 Jan 13 ]

btw, just FYI, there is also one more unexpected error:

   25976:20170112:115501.922 [Z3005] query failed: [1062] Duplicate entry '7464973-1484197200' for key 'PRIMARY' [insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (1775036,1484197200,5,1,1,1),(7260389,1484197200,1,8,8,8),(2259063, .....
Comment by Sergejs Paskevics [ 2017 Jan 23 ]

I think ZBX-11587 can be related.

Comment by Sergejs Paskevics [ 2017 Jan 23 ]

ERROR: commit without transaction fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-11703
Deadlock is fixed in ZBX-11587.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 24 ]

I doubt it was intentional...

r65233 | sergejs | 2017-01-23 15:07:13 +0200 (Mon, 23 Jan 2017) | 1 line
Changed paths:
   A /branches/dev/ZBX-11703/trunk (from /trunk:65232)

.......... [ZBX-11703] created new development branch

Please fix the development branch

Comment by Sergejs Paskevics [ 2017 Jan 24 ]

Development branch is correct now.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 26 ]

(1) When there are no new items discovered and no items need update lld_items_save() will return successfully without locking host in DB. However, there may be applications to create or update and in case someone else deletes unlocked host there will be errors. For LLD triggers and graphs there is one *_save() function that does the locking just before cleanup. For items there are currently 3 *_save() steps and only the first does locking.

s.paskevics RESOLVED in r65332.

glebs.ivanovskis I think these procedures

	lld_item_links_populate(&item_prototypes, lld_rows, &items_index);
	lld_remove_lost_items(&items, lifetime, lastcheck);
	lld_remove_lost_applications(lld_ruleid, &applications, lifetime, lastcheck);

should be skipped if one of preceding steps failed. Makes no sense to remove things by hand, they must had been deleted automatically when parent LLD rule was deleted together with host.

s.paskevics RESOLVED in r65398.

glebs.ivanovskis Brilliant! CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 26 ]

(2) Shouldn't we goto clean; in lld_process_discovery_rule() if one of the following steps rolls back?

	lld_update_items(hostid, lld_ruleid, &lld_rows, &error, lifetime, now);
	lld_update_triggers(hostid, lld_ruleid, &lld_rows, &error);
	lld_update_graphs(hostid, lld_ruleid, &lld_rows, &error);

s.paskevics RESOLVED in r65332.

glebs.ivanovskis Looks good. CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 26 ]

And there may be more scenarios where users will discover LLD imperfections sooner or later.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 31 ]

(3) Please review my changes in r65380. There are mostly const qualifiers, but also a couple of comment fixes and a couple of more transparent checks.

s.paskevics Looks OK. CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 31 ]

(4) Would be nice to have more debug logging when such race conditions happen. It will make it easier to troubleshoot future problems, because, I guess, we won't be able to cover all corner cases within this ticket.

s.paskevics RESOLVED in r65398.

glebs.ivanovskis Thank you! Minor polishing in r65402. CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 31 ]

Please review my changes and make small logging and performance adjustments. Overall, everything looks good to me.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Jan 31 ]

Successfully tested!

Comment by Sergejs Paskevics [ 2017 Feb 01 ]

Fixed in

  • pre-3.0.8rc1 r65411,
  • pre-3.2.4rc1 r65414,
  • pre-3.3.0 (trunk) r65420.
Comment by Glebs Ivanovskis (Inactive) [ 2017 Feb 22 ]

(5) Sorry, I forgot to put one const qualifier in r65380.

In include/proxy.h:

void	lld_process_discovery_rule(zbx_uint64_t lld_ruleid, char *value, const zbx_timespec_t *ts);

In src/libs/zbxdbhigh/lld.c:

void	lld_process_discovery_rule(zbx_uint64_t lld_ruleid, const char *value, const zbx_timespec_t *ts)

glebs.ivanovskis Will be fixed in ZBX-12186.
CLOSED

Generated at Thu Apr 18 06:34:44 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.