[ZBX-12251] New values aren't counted towards trigger functions evaluation Created: 2017 Jun 01 Updated: 2024 Apr 10 Resolved: 2017 Dec 26 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 3.2.7rc1 |
Fix Version/s: | 4.0.0alpha1, 4.0 (plan) |
Type: | Problem report | Priority: | Blocker |
Reporter: | Dmitry Verkhoturov | Assignee: | Vladislavs Sokurenko |
Resolution: | Fixed | Votes: | 8 |
Labels: | backup, deadlock | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Zabbix 3.2.7rc1 (revision 68572) |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||||||||||||||
Team: | |||||||||||||||||||||||||||||||||
Sprint: | Sprint 9, Sprint 10, Sprint 11, Sprint 12, Sprint 13, Sprint 14, Sprint 15, Sprint 16, Sprint 17, Sprint 18, Sprint 19, Sprint 20, Sprint 21, Sprint 23 | ||||||||||||||||||||||||||||||||
Story Points: | 5 |
Description |
Hello, team. |
Comments |
Comment by Dmitry Verkhoturov [ 2017 Jun 01 ] |
Downgrading to Zabbix 3.2.6 (revision 67849) with restart of server to update binaries helped this situation: trigger I've used as example jumped to OK state. Seems like restart affected the problem, not the downgrade. |
Comment by patrik uytterhoeven [ 2017 Jun 05 ] |
i had this similar problem a while a go but in chat no one could confirm similar behaviour. We had a network crash and in zabbix many of the items in problem state would not revert back to ok state even in latest data we would receive new data. I tried to restart server do a cache reload nothing helped. Dont know if it helps but we are on: |
Comment by Dmitry Verkhoturov [ 2017 Jun 06 ] |
I found exactly same behavior in 3.2.5. |
Comment by Kevin Daudt [ 2017 Jun 06 ] |
Noticing something similar in 3.2.3. Manually closing the problem fixes it. Looking at this issue, I do notice that same SNMP items look switched (values for stable OIDs suddenly changing). Note sure if due to monitored host or Zabbix, but the triggers that got into a problem state are not solving anymore even if the data says it should). |
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jun 06 ] |
There was an issue with triggers stuck in PROBLEM in the very infancy of 3.2: |
Comment by Kevin Daudt [ 2017 Jun 06 ] |
The issue is not a lack of events (I see the events, and can manually close them once enabled for this trigger) |
Comment by Dmitry Verkhoturov [ 2017 Jun 07 ] |
I've uploaded txt file with debug info from history syncer on new value which came to problem item. I hope it will shed some light on the problem. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jun 08 ] |
Few configuration tips, if I may
This way you'll save some time needed to keep everything in sync. Now back to the serious business. Paskal, which timezone is your Zabbix server situated in? Is it UTC? If it is, then trigger lastchange from this line 82112:20170605:172013.809 cannot find open problem events for triggerid:3414987, lastchange:1496683209 is just 4 seconds before the log message was printed: $ TZ='' date -d @1496683209 Mon Jun 5 17:20:09 UTC 2017 In my opinion, this makes it highly unlikely that housekeeper is involved in this story. This is active check. Is proxy involved in collecting this item or is this host monitored directly by server? But then this line... 28333:20170607:055648.877 In zbx_process_trigger() triggerid:3414987 value:0(0) new_value:3 value:0(0) means TRIGGER_VALUE_OK and TRIGGER_STATE_NORMAL, new_value:3 means TRIGGER_VALUE_NONE. Looks like you are suffering from |
Comment by Dmitry Verkhoturov [ 2017 Jun 08 ] |
Zabbix 3.2.5 (revision 67445) with Re "Which timezone is your Zabbix server situated in?" - it is UTC. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jun 08 ] |
Thank you for the answers! Just in case, only Zabbix server version matters, no need to experiment with frontend/agents/proxies. What is very suspicious is that server logs current trigger value as TRIGGER_VALUE_OK, that's why, in server's mind, this trigger does not need resolving. But, as you say, frontend shows this trigger in PROBLEM. And the value of trigger in DB is 1 which is TRIGGER_VALUE_PROBLEM. |
Comment by Dmitry Verkhoturov [ 2017 Jun 10 ] |
New debug info from 3.2.6 attached, old one renamed. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jun 11 ] |
Do you still have those logs (unfiltered)? |
Comment by Vladislavs Sokurenko [ 2017 Jun 12 ] |
Could you please be so kind and check if you had similar configuration ? I couldn't reproduce the issue right away, it might need more investigation. What I have tried is zabbix agent active item with key
vfs.file.time["{$FILE}"]
Refresh rate of 5 seconds And following trigger expression, while everything else in trigger is as by default. {test server:vfs.file.time["{$FILE}"].now(0)} - {test server:vfs.file.time["{$FILE}"].last(0)} > 30 Problem occurs when file is not changed for 30 seconds and resolves when I change the file. |
Comment by Dmitry Verkhoturov [ 2017 Jun 12 ] |
vso, I'm terribly sorry - please, ignore 3.2.5 debug log. As glebs.ivanovskis wrote, it's likely affected by |
Comment by Glebs Ivanovskis (Inactive) [ 2017 Jun 12 ] |
Now I don't think that I think our best shot is to provide a patch with additional sanity checks and very specific logging and roll it out on a live installation where issue already takes place. |
Comment by Dmitry Verkhoturov [ 2017 Jun 14 ] |
Around the time problem happened first I've enabled SQL backup on Zabbix MySQL master every 3h, heaviest part of which looked like that: TABLES_TO_IMPORT=($(mysql -e 'show tables from zabbix where Tables_in_zabbix not like "history%" and Tables_in_zabbix not like "trends%" and Tables_in_zabbix != "sessions";' | tail -n +2)) mysqldump --skip-add-drop-table --no-create-info --no-autocommit --skip-disable-keys --skip-add-locks --lock-all-tables --databases zabbix --tables "${TABLES_TO_IMPORT[@]}" Effectively it meant following options for mysqldump: --opt --lock-all-tables --no-autocommit --no-create-info --skip-add-drop-table --skip-add-locks --skip-disable-keys Because of lock-all-tables option in case of long query being in progress database stopped all writes to affected tables and Zabbix writes stopped at all — we restarted MySQL after that couple of times before finding the problem. Now I changed it to run only once a day without a lock, by removing lock-all-tables and adding single-transaction. If that was a real root cause, after my changes and zabbix-server restart problem won't appear at all. UPD: I've didn't got any reports of the problem after fixing backup, maybe that's really was the problem. Previous update seems false positive, trigger fixed itself after a little period. |
Comment by richlv [ 2017 Jun 22 ] |
this might be unrelated (and 2.4 is not supported), but i might be seeing the same symptoms with 2.4.6. data is collected and stored in the database, but triggers on a host are stuck in a problem state. |
Comment by richlv [ 2017 Jun 22 ] |
could this also be the same as |
Comment by Dmitry Verkhoturov [ 2017 Jun 23 ] |
I'm not using proxy, but it seems similar judjing by little amount of data |
Comment by Vladislavs Sokurenko [ 2017 Jun 27 ] |
If you are able to reproduce issue then please try enabling history sincer log only.
zabbix_server -R log_level_increase="history syncer"
Or if you did not delete the log then please search for function zbx_process_trigger() I don't see the end result of it in the log attached unfortunately. Could you please also check if you had messages like this 20762:20170627:185502.383 ignoring query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (67,0,0,13572,1498578886,458560744,0); ] within failed transaction |
Comment by Vladislavs Sokurenko [ 2017 Jun 27 ] |
Reproduced on trunk, database failure during saving of events is not handled correctly. Steps: test server:trap1.last()}=204 3. Make trigger go in problem state ./bin/zabbix_sender -z 127.0.0.1 -p 10051 -s "test server" -k trap1 -o 204
4. Try to resolve by executing command below, but when log says stop database now do restart of database for example sudo systemctl restart mariadb ./bin/zabbix_sender -z 127.0.0.1 -p 10051 -s "test server" -k trap1 -o 0
5. Observe that trigger is in problem state and message show is 20762:20170627:185502.383 ignoring query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (67,0,0,13572,1498578886,458560744,0); ] within failed transaction 6. Try to resolve it again ./bin/zabbix_sender -z 127.0.0.1 -p 10051 -s "test server" -k trap1 -o 0
Expected: Actual: |
Comment by Vladislavs Sokurenko [ 2017 Jun 27 ] |
for anyone experiencing issues similar to this, please apply attached patch query_fail_debug.patch for 3.2 to see if any queries are failed and ignored, it will only increase logging if there are database errors, so this should not slow down server. |
Comment by Dmitry Verkhoturov [ 2017 Jun 27 ] |
Regarding ignoring query - I didn't found any such entries in my log, sorry. Problem happened long before I found it and checked debug log. query_fail_debug.patch - I will apply it and reproduce problem state of DB if you really need me to check it, if you are sure what root cause is found - I'll choose not break my production monitoring once more. vso I see that there was a deadlock in the same place that I can reproduce and then ignoring query and failed transaction, also this confirms your words that by removing lock-all-tables the issue is gone.
28285:20170607:060059.964 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction
Paskal I'm very glad you reproduced the problem, please tell me if any need will be needed for creating a fix on that. vso yes please do: ------------------------ LATEST DETECTED DEADLOCK ------------------------ 2017-06-28 15:18:55 0x7f8c33030700 *** (1) TRANSACTION: TRANSACTION 98361978762, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 8 lock struct(s), heap size 1136, 5 row lock(s), undo log entries 2 MySQL thread id 94485929, OS thread handle 140238525007616, query id 7749298672 mon-aws-fr-1.iponweb.iponweb.net 10.32.1.158 zabbix updating UPDATE profiles SET value_str='screens.php', type=3 WHERE userid=248 AND idx='web.paging.lastpage' *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 1113 page no 244 n bits 256 index PRIMARY of table `zabbix`.`profiles` trx id 98361978762 lock_mode X locks rec but not gap waiting *** (2) TRANSACTION: TRANSACTION 98361978729, ACTIVE 0 sec fetching rows mysql tables in use 1, locked 1 561 lock struct(s), heap size 73936, 87323 row lock(s), undo log entries 6 MySQL thread id 94486037, OS thread handle 140240127985408, query id 7749298324 hostname 10.32.4.4 zabbix updating DELETE FROM profiles WHERE idx='web.favorite.graphids' AND source='itemid' AND value_id='6923163' *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 1113 page no 244 n bits 256 index PRIMARY of table `zabbix`.`profiles` trx id 98361978729 lock_mode X *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 1113 page no 1294 n bits 168 index PRIMARY of table `zabbix`.`profiles` trx id 98361978729 lock_mode X waiting *** WE ROLL BACK TRANSACTION (1) vso thanks, unfortunately this is some other deadlock related to frontend, not the one that could have caused problems in triggers. Paskal They didn't happen to me for a long time, I'll return base to bad condition after building server with your patch and enabling deadlocks logging for MySQL. Paskal To make sure we're on one wave: I'm patching my zabbix-server, enabling innodb_print_all_deadlocks, and waiting for trouble to come. When it comes, I give you data on problem trigger from Zabbix and MySQL log, is it correct? vso do you experience issue after changing your mysqldump query ? if yes, then please provide it so we can find a reason for deadlock, but it's currently clear that zabbix does not handle deadlock correctly and it should be fixed. Paskal I am not experiencing deadlocks which trigger that problem after changing mysqldump query and I'll better leave my system that way. If you know the problem and know how to fix it, I'm happy and don't need anything else. Paskal Vladislav, do you need any additional information from me? Please write explicitly, do I need to try to reproduce the bug and give you logs or it's not necessary and ball on your side. vso Thank you, no additional information is needed. Changing mysqldump not to lock tables should have resolved deadlocks for you. We will see what can be done to handle deadlocks in Zabbix server. |
Comment by Kevin Daudt [ 2017 Jun 28 ] |
We are facing similar issues (seeing "query failed" messages in our zabbix_server.log during backups), but we do not have a solution yet (we are using xtrabackup, not mysqldump). vso do you see deadlock message ? kdaudt Yes: 2106:20170625:230340.113 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction Please next time call SHOW ENGINE INNODB STATUS command so exact cause of deadlock can be known. This is the output: ------------------------ LATEST DETECTED DEADLOCK ------------------------ 2017-06-13 01:00:02 7fbf570c3700 *** (1) TRANSACTION: TRANSACTION 66811503829, ACTIVE 0 sec starting index read mysql tables in use 362, locked 362 LOCK WAIT 780 lock struct(s), heap size 79400, 418 row lock(s) MySQL thread id 100730068, OS thread handle 0x7fc043e38700, query id 21017990923 nlmgmt016 10.60.17.233 zabbix updating delete from trends_uint where itemid=200200000570970 and clock<1497314251 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 6951 page no 8 n bits 328 index `PRIMARY` of table `zabbix`.`trends_uint` /* Partition `p201706070000` */ trx id 66811503829 lock_mode X waiting Record lock, heap no 246 PHYSICAL RECORD: n_fields 8; compact format; info bits 0 0: len 8; hex 0000b614b1eb7b50; asc {P;; 1: len 4; hex d93f2b00; asc ?+ ;; 2: len 6; hex 000f8e46a3f3; asc F ;; 3: len 7; hex a9000040233c87; asc @#< ;; 4: len 4; hex 8000000c; asc ;; 5: len 8; hex 0000000000000000; asc ;; 6: len 8; hex 0000000000000000; asc ;; 7: len 8; hex 0000000000000001; asc ;; Note that the date says 2017-06-13, while the latest log entry in the server log for deadlock is 2017-06-28 |
Comment by Rostislav Palivoda [ 2017 Jul 24 ] |
Discussed with vso - dues to high risk change, will merge to trunk in release 4.0. Fix version changed to 4.0 plan. |
Comment by Vladislavs Sokurenko [ 2017 Aug 09 ] |
Fixed in development branch: |
Comment by Andris Zeila [ 2017 Aug 30 ] |
Successfully tested. Please review & test r71871 - history table upadates were moved to separate transaction because of vso reviewed/tested. |
Comment by Elvar [ 2017 Oct 23 ] |
I am not certain but I think I may be experiencing this same issue in 3.4.1, 3.4.2, and 3.4.3. I am randomly seeing situations with items not recovering despite their recovery conditions being met. Can anyone tell me how I can verify whether or not it is this bug? Or maybe it is related to https://support.zabbix.com/browse/ZBX-12923 which I just opened today. Suggestions? vso out of bonds error makes whole transaction to fail so trigger is updated in cache only, not in database, though fixing wiper I was thinking about it - we probably should not update cache if the transaction faills with unrecoverable error. |
Comment by Andris Zeila [ 2017 Nov 23 ] |
Successfully tested |
Comment by Vladislavs Sokurenko [ 2017 Nov 23 ] |
Fixed in
|
Comment by Andrey Melnikov [ 2017 Dec 15 ] |
r74914 have off-by-one error which allow server store items with history=0 in database. Index: src/libs/zbxdbcache/dbcache.c =================================================================== --- src/libs/zbxdbcache/dbcache.c (revision 75695) +++ src/libs/zbxdbcache/dbcache.c (working copy) @@ -2007,7 +2007,7 @@ continue; } - if (0 == items->history) + if (0 == item->history) h->flags |= ZBX_DC_FLAG_NOHISTORY; if ((ITEM_VALUE_TYPE_FLOAT != item->value_type && ITEM_VALUE_TYPE_UINT64 != item->value_type) || vso yes, thanks allot for reporting this. |
Comment by Naruto [ 2017 Dec 16 ] |
Any plans for a 3.4 fix? It's affected too, I experienced this behavior after a brief database unavailability. vso I am sorry but it's not planned for 3.4 |
Comment by Vladislavs Sokurenko [ 2017 Dec 18 ] |
Fixed in:
Fixed possibility of trigger status getting stuck in PROBLEM or OK state due to transaction error in history synchronization, added deadlock to a list of recoverable errors. |
Comment by Vladislavs Sokurenko [ 2017 Dec 19 ] |
Attached patch for 3.2.6 which is subset of fixes for 4.0 |
Comment by Dmitry Verkhoturov [ 2017 Dec 19 ] |
Paskal will fix be included into 3.2 branch? Otherwise I could publish it in https://github.com/zabbix/zabbix-patches for broader visibility and easier applying. vso I am sorry but 3.2 is no longer supported, feel free to publish. Paskal Thank you very much! Patch included and fixed for latest available version (3.2.10). Paskal Vladislavs, is there any chance you could give us same patch for v3.4? vso unfortunately I don't have the patch, I should make one if needed, but alpha 4.0.0alpha1 is out perhaps you can upgrade to it. |
Comment by Leandro [ 2018 Aug 21 ] |
Hi everyone, I'm facing a similar problem like the one reported here. My installation is:
------------------------
LATEST DETECTED DEADLOCK
------------------------
180802 18:18:10
*** (1) TRANSACTION:
TRANSACTION 1074DEB8, ACTIVE 31 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 2
MySQL thread id 14, OS thread handle 0x7fa52814e700, query id 131804 localhost zabbix update
insert into problem (eventid,source,object,objectid,clock,ns) values (17819200,0,0,13473,1533251858,881253939)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 407298 n bits 1096 index `c_problem_2` of table `zabbix`.`problem` trx id 1074DEB8 lock_mode X insert intention waiting
*** (2) TRANSACTION:
TRANSACTION 1074D1CF, ACTIVE 276 sec fetching rows
mysql tables in use 2, locked 2
775 lock struct(s), heap size 113080, 210408 row lock(s), undo log entries 24189
MySQL thread id 838, OS thread handle 0x7fa407110700, query id 124045 localhost zabbix Sending data
delete from events where clock<1533137125 and events.source=3 and events.object=0 and not exists (select null from problem where events.eventid=problem.eventid or events.eventid=problem.r_eventid)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 407298 n bits 1096 index `c_problem_2` of table `zabbix`.`problem` trx id 1074D1CF lock mode S
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 223560 n bits 312 index `PRIMARY` of table `zabbix`.`events` trx id 1074D1CF lock_mode X waiting
*** WE ROLL BACK TRANSACTION (1)
Could this deadlock be a cause for alarm condition stuck in trigger status, along with the errors I have in zabbix log?
1688:20180819:053122.134 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [insert into problem (eventid,source,object,objectid,clock,ns) values (19331940,3,0,16474,1534674631,176287537),(19331941,3,0,16475,1534674631,176287537),(19331942,3,0,16476,1534674631,176287537),(19331943,3,0,16477,1534674631,176287537),(19331944,3,0,16478,1534674631,176287537),(19331945,3,0,16479,1534674631,176287537),(19331946,3,0,16480,1534674631,176287537),(19331947,3,0,16481,1534674631,176287537),(19331948,3,0,16482,1534674631,176287537),(19331949,3,0,16483,1534674631,176287537),(19331950,3,0,16484,1534674631,176287537),(19331951,3,0,16485,1534674631,176287537),(19331952,3,0,16486,1534674631,176287537),(19331953,3,0,16487,1534674631,176287537),(19331954,3,0,16488,1534674631,176287537);
]
1688:20180819:053122.134 slow query: 50.756534 sec, "insert into problem (eventid,source,object,objectid,clock,ns) values (19331940,3,0,16474,1534674631,176287537),(19331941,3,0,16475,1534674631,176287537),(19331942,3,0,16476,1534674631,176287537),(19331943,3,0,16477,1534674631,176287537),(19331944,3,0,16478,1534674631,176287537),(19331945,3,0,16479,1534674631,176287537),(19331946,3,0,16480,1534674631,176287537),(19331947,3,0,16481,1534674631,176287537),(19331948,3,0,16482,1534674631,176287537),(19331949,3,0,16483,1534674631,176287537),(19331950,3,0,16484,1534674631,176287537),(19331951,3,0,16485,1534674631,176287537),(19331952,3,0,16486,1534674631,176287537),(19331953,3,0,16487,1534674631,176287537),(19331954,3,0,16488,1534674631,176287537);
"
Is it currently another way to prevent this issue to appear, without doing a full reinstallation to pre-4.0.0alpha1? I have configured many of mysql settings suggested to avoid slow queries. They are still there, though. Thanks for your support, L |
Comment by Vladislavs Sokurenko [ 2018 Aug 21 ] |
Hello, it should be solved by |
Comment by Leandro [ 2018 Aug 21 ] |
ok, Is patch applied to pre-3.4.10rc1 r80725 also ported forward to 3.4.12? Will it be in the last stable 3.4 release? Thank you |
Comment by Vladislavs Sokurenko [ 2018 Aug 21 ] |
Yes it is also available forward. |