[ZBXNEXT-4108] Ability to search problems by trigger name (Z4) Created: 2017 Sep 18  Updated: 2024 Apr 10  Resolved: 2017 Dec 18

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: API (A), Frontend (F), Server (S)
Affects Version/s: None
Fix Version/s: 4.0.0alpha1, 4.0 (plan)

Type: Change Request Priority: Trivial
Reporter: Rostislav Palivoda Assignee: Andris Zeila
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File monitoring-problems-sort-after.png     PNG File monitoring-problems-sort-before.png     PDF File performance-measurements.pdf     PNG File widget-problems-sort-after.png     PNG File widget-problems-sort-before.png     XML File zbx_export_templates(1).xml    
Issue Links:
Causes
causes ZBXNEXT-4792 Make {ITEM.LASTVALUE1} useful again i... Closed
causes ZBX-14344 Monitoring -> Problems Filter doesn't... Closed
causes ZBX-14359 Change macro in default action subjec... Closed
causes ZBX-15129 dbupgrade does not replace {TRIGGER.N... Closed
Team: Team A
Team: Team A
Sprint: Sprint 17, Sprint 18, Sprint 19, Sprint 20, Sprint 21, Sprint 22, Sprint 23
Story Points: 8

 Description   

Currently, problem and event names are generated on the fly in the front-end and on server side. It introduces severe performance issues and makes impossible seeing historical information about problems especially when trigger name changes or trigger name contains macros. The proposal leads to a better separation of triggers and problems, improves performance (however size of tables problem/events will be larger) and will maintain historical problem names.



 Comments   
Comment by Vitaly Zhuravlev [ 2017 Sep 18 ]

As a side effect, this would make a much easier to generate reports using external tools such as JasperReports and so on

Comment by Miks Kronkalns [ 2017 Oct 05 ]

Frontend and API:

Phase SP
API changes 1
Frontend changes 1
Testing 1
Total 3
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 16 ]

Resolved in svn://svn.zabbix.com/branches/dev/ZBXNEXT-4108 (server side)

The implementaton span from commit r73305 to r73511 and based on what was implemented is divided in different "sections": The feature implementation (database patches and server change), the introduction of the new macro EVENT.NAME and other minor diffs.
Check the single small and split commits for an easy review.

The server was modified for storing the new 'name' field for both the problem and events table just before flushing a series of events to the database. Some modifications were performed for the correct expansion of the new EVENT.NAME macro.
The server side main change are:

  • The database upgrade patch, that will populate existing events and problems with default values and create the two new columns.
  • The introduction of the EVENT.NAME macro, that will expand to either the trigger name with macros expanded (in trigger-based notifications), or to an error message (for internal notifications). More details are in the specification.

To test:

  • Trigger based and internal notifications and actions, check that EVENT.NAME expand properly for all type of messages (problem and recovery) as per spec.
  • Database upgrade patch, check if an existing database can be updated properly (It mean simulating an upgrade of version), trunk should not work actually as there was other database patches implemented in the meantime.
  • Try to identify if the logic of expansion for EVENT.NAME is correct in the problem -> resolved flow for a trigger or internal event. Identify what value EVENT.NAME SHOULD have instead of what it effectively mean.
Comment by Miks Kronkalns [ 2017 Oct 18 ]

Currently I used name = 'search query' due the performance reasons. See attached file (performance-measurements.pdf) to see how much of speed can be lost changing it to LIKE.

Frontend RESOLVED in ^/branches/dev/ZBXNEXT-4108 r73616, r73627, r74580

Comment by Miks Kronkalns [ 2017 Oct 18 ]

(1) No translation string changes.

iivs CLOSED

Comment by Rostislav Palivoda [ 2017 Nov 02 ]

Please test server side - wiper

Comment by Andris Zeila [ 2017 Nov 08 ]

(4) [S] Changes to the tables in database upgrade must be split into separate patches - as small as possible. For example DBpatch_3050000() should be split into two patches - one for events table the other for problem table.

abs RESOLVED in 74357
Split the tables modification patch in two. One for problem and one for events

wiper CLOSED

Comment by Andris Zeila [ 2017 Nov 08 ]

(5) [S] Missing filter by object type (object=0) in

"update events set name='%s' where objectid=%d and source=%d"
"update problem set name='%s' where objectid=%d and source=%d"

abs RESOLVED in r74364

Added the additional filter by "object" for EVENT_OBJECT_TRIGGER (0)
for the events and problem update queries

wiper CLOSED

Comment by Andris Zeila [ 2017 Nov 08 ]

(6) [S] As all existing internal trigger events will have the same name they can be updated with one sql statement similarly to internal item events.

abs RESOLVED in r74368

Moved the internal trigger event and problem updates outside of the
loop. Added also the filter by object type that was missing
(EVENT_OBJECT_TRIGGER) for a better filtering.

wiper CLOSED

Comment by Andris Zeila [ 2017 Nov 08 ]

(7) DBbegin_multiple_update() must be used when using multiple update statements in one patch.

abs RESOLVED in r74391

Changed the multiple updates inside while loops for using the
DBbegin_multiple_update() as suggested. tested with randomly
created events and problems for multiple triggers and for
internal events.

wiper CLOSED

Comment by Andris Zeila [ 2017 Nov 08 ]

(8) It might be better to explicitly set event names to empty strings instead of leaving them NULL and relying that NULL fields will be converted to strings somewhere during insertion process.

abs RESOLVED in r74393

Explicitly pass the empty string ("") to the add_event() calls that must not store
an error message in the database instead of NULL.
Also, documented how the error field of add_event must be used properly in
the comments section of the function description, for allowing other developers
to do the right thing.

wiper Passing empty error string and using it as name would work for internal OK events, though it might be a bit strange. However for discovery and autoregistration events event name would still be NULL. Maybe the simplest way would be using ZBX_NULL2EMPTY_STR() macro for event name in db_insert parameters.
REOPENED

abs RESOLVED in r74402 and r74405

Reverted the previous change to the original behavior in r74402 and use
ZBX_NULL2EMPTY_STR() during the zbx_db_insert_add_values() calls
for the insertion of events and problems.

wiper CLOSED

Comment by Andrea Biscuola (Inactive) [ 2017 Nov 09 ]

(9) All the update patches need to be splitted in separated chunks.

abs RESOLVED in r74390

All the database upgrades are performed one by one now in different
functions, for every table.
The main driver for this change was discussion with wiper and also
some testing. While I was at it it became crystal clear that a user facing
an issue during an upgrade, have no easy way to point us to the right
path. Without this, debugging can prove really difficult. So, even if we
were thinking to wait for sasha opinion, I decided to commit
it in branch.

wiper CLOSED

Comment by Andris Zeila [ 2017 Nov 09 ]

(10) DBpatch_3050006, DBpatch_3050007 must escape trigger name before updating events/problem table.

abs RESOLVED in r74400

Escape the descriptions through DBdyn_escape_string(), no size
problems here as all the involved database fields can contain up
to 2048 characters.

wiper Fixed memory leak. Not related to this issue, but also changed the default internal event names to match style of trigger/item error messages.
Please review r74432

abs Looks OK. CLOSED

Comment by Andris Zeila [ 2017 Nov 10 ]

Server side tested

Comment by Ivo Kurzemnieks [ 2017 Nov 23 ]

(18) [D] API documentation must be updated.

Miks.Kronkalns Updated API examples and object desriptions in:

RESOLVED

iivs CLOSED

Comment by Andrea Biscuola (Inactive) [ 2017 Dec 04 ]

Released in

  • pre-4.0.0alpha1 (trunk) r75329
Comment by Andrea Biscuola (Inactive) [ 2017 Dec 05 ]

vso

Please assign to who should verify the fixes

Comment by Andrey Melnikov [ 2017 Dec 05 ]

r75329 broke event description - now problem widget show events with unresolved macros.

> select * from events where name like "%{%" order by clock desc limit 15;
+---------+--------+--------+----------+------------+-------+--------------+-----------+---------------------------------------+
| eventid | source | object | objectid | clock      | value | acknowledged | ns        | name                                  |
+---------+--------+--------+----------+------------+-------+--------------+-----------+---------------------------------------+
| 6064644 |      0 |      0 |    29304 | 1512495673 |     0 |            0 | 197491002 | Disc sg1 tempearture {ITEM.LASTVALUE} |
| 6064601 |      0 |      0 |    29304 | 1512493873 |     1 |            0 | 856409455 | Disc sg1 tempearture {ITEM.LASTVALUE} |
| 6064592 |      0 |      0 |    27446 | 1512492959 |     0 |            0 | 629279695 | Ping loss detected on {HOST.NAME}     |
| 6064591 |      0 |      0 |    26791 | 1512492953 |     0 |            0 | 499233859 | Ping loss detected on {HOST.NAME}     |
| 6064590 |      0 |      0 |    26795 | 1512492952 |     0 |            0 | 587624767 | Ping loss detected on {HOST.NAME}     |
| 6064589 |      0 |      0 |    27447 | 1512492899 |     0 |            0 | 573972479 | Ping loss is too high on {HOST.NAME}  |
| 6064588 |      0 |      0 |    26792 | 1512492893 |     0 |            0 | 240886016 | Ping loss is too high on {HOST.NAME}  |
| 6064587 |      0 |      0 |    26796 | 1512492892 |     0 |            0 | 526952880 | Ping loss is too high on {HOST.NAME}  |
| 6064582 |      0 |      0 |    27447 | 1512492599 |     1 |            0 | 861435755 | Ping loss is too high on {HOST.NAME}  |
| 6064583 |      0 |      0 |    27446 | 1512492599 |     1 |            0 | 861435755 | Ping loss detected on {HOST.NAME}     |
| 6064580 |      0 |      0 |    26792 | 1512492593 |     1 |            0 | 457567907 | Ping loss is too high on {HOST.NAME}  |
| 6064581 |      0 |      0 |    26791 | 1512492593 |     1 |            0 | 457567907 | Ping loss detected on {HOST.NAME}     |
| 6064578 |      0 |      0 |    26796 | 1512492592 |     1 |            0 | 836207547 | Ping loss is too high on {HOST.NAME}  |
| 6064579 |      0 |      0 |    26795 | 1512492592 |     1 |            0 | 836207547 | Ping loss detected on {HOST.NAME}     |
| 6064563 |      0 |      0 |    29304 | 1512490873 |     0 |            0 |  38660506 | Disc sg1 tempearture {ITEM.LASTVALUE} |
+---------+--------+--------+----------+------------+-------+--------------+-----------+---------------------------------------+
15 rows in set (0.43 sec)

And how currently see in web interface triggers with

{ITEM.LASTVALUE}

macros?

Comment by Andrea Biscuola (Inactive) [ 2017 Dec 06 ]

lynxchaus

When we started to implement this, there was a discussion on how
to solve the problem of the database upgrade. It was decided to
store the name for the current events without macros expanded.
The main problem in doing so is that expanding macros during
a database upgrade is not possible at the moment, being zabbix not
fully up and running with the needed resources.

Comment by Andrey Melnikov [ 2017 Dec 06 ]

this change removed trigger description/event description expanding and now all old events in widget shows as 'Disc sg1 tempearture ITEM.LASTVALUE'.
New events stored with expanded macros 'Disc sg1 tempearture 40' but this break functionality of ITEM.LASTVALUE - event always show first (ITEM.VALUE) when trigger switched on.

Comment by Andrea Biscuola (Inactive) [ 2017 Dec 06 ]

lynxchaus

Regarding the change in behavior of ITEM.LASTVALUE, I went on to check it
and effectively it was not completely correct.
It's fixed in the development branch and it will be merged in trunk once
verified.

Thanks for pointing it out.

Comment by Andris Zeila [ 2017 Dec 11 ]

(29) [S] The post database upgrade event/problem name update is implemented in svn://svn.zabbix.com/branches/dev/ZBXNEXT-4108_2

It basically supersedes server fixed in svn://svn.zabbix.com/branches/dev/ZBXNEXT-4108. I will review and port any relevant commits to the new branch shortly.

Some rough performace data - 1m of events were converted in 2m 20s. During conversion 17mb of shared memory were used to cache historical (uint64) data.

vso CLOSED

Comment by Andris Zeila [ 2017 Dec 12 ]

Released in:

  • pre-4.0.0alpha1 r75720
Comment by Andrey Melnikov [ 2017 Dec 12 ]

In real life upgrading tables takes AGE.
For example:

 2525:20171212:175227.373 completed 21% of event name update
  2525:20171212:175227.373 In substitute_simple_macros() data:'Ping loss ({#ITEM.VALUE})', type=16
  2525:20171212:175227.373 End substitute_simple_macros() data:'Ping loss ({#ITEM.VALUE})'
  2525:20171212:175227.373 query [txnlev:1] [select eventid,source,object,objectid,clock,value,acknowledged,ns,name from events where source=0 and object=0 and objectid=26613 order by eventid]
  2525:20171212:175227.387 In substitute_simple_macros() data:'Ping loss ({ITEM.VALUE})', type=16
  2525:20171212:175227.387 In DBitem_value()
  2525:20171212:175227.387 In get_N_itemid() expression:'({TRIGGER.VALUE}=0 and {51863}>33) or ({TRIGGER.VALUE}=1 and {51864}>0)' N_functionid:1
  2525:20171212:175227.387 End of get_N_itemid():SUCCEED
  2525:20171212:175227.387 query [txnlev:1] [select value_type,valuemapid,units from items where itemid=102185]
  2525:20171212:175227.387 In zbx_vc_get_value() itemid:102185 value_type:0 timestamp:1456407361.374212810
  2525:20171212:175227.387 In zbx_history_get_values() itemid:102185 value_type:0 start:1456407360 count:0 end:1513090347
  2525:20171212:175227.387 query [txnlev:1] [select clock,ns,value from history where itemid=102185 and clock>1456407360 and clock<=1513090347]
  2525:20171212:175507.765 End of zbx_history_get_values():SUCCEED values:268761
  2525:20171212:175507.790 In zbx_history_get_values() itemid:102185 value_type:0 start:0 count:1 end:1456407360
  2525:20171212:175507.790 query [txnlev:1] [select clock,ns,value from history where itemid=102185 and clock>0 and clock<=1456407360 order by clock desc limit 1]
  2525:20171212:175507.806 End of zbx_history_get_values():SUCCEED values:0
  2525:20171212:175507.806 In zbx_history_get_values() itemid:102185 value_type:0 start:1513082760 count:0 end:1513082761
  2525:20171212:175507.806 query [txnlev:1] [select clock,ns,value from history where itemid=102185 and clock=1513082761]
  2525:20171212:175507.807 End of zbx_history_get_values():SUCCEED values:1
  2525:20171212:175507.835 End of zbx_vc_get_value():FAIL cache_used:1
  2525:20171212:175507.835 End of DBitem_value():FAIL
  2525:20171212:175507.835 cannot resolve macro '{ITEM.VALUE}'
  2525:20171212:175507.835 End substitute_simple_macros() data:'Ping loss (*UNKNOWN*)'
  2525:20171212:175507.835 In zbx_vc_clean()
  2525:20171212:175507.835 End of zbx_vc_clean()
  2525:20171212:175507.835 In substitute_simple_macros() data:'Ping loss ({#ITEM.VALUE})', type=16
  2525:20171212:175507.835 End substitute_simple_macros() data:'Ping loss ({#ITEM.VALUE})'
MariaDB [zabbix]> select eventid,source,object,objectid,clock,value,acknowledged,ns,name from events where source=0 and object=0 and objectid=26613 order by eventid;
+---------+--------+--------+----------+------------+-------+--------------+-----------+--------------------------+
| eventid | source | object | objectid | clock      | value | acknowledged | ns        | name                     |
+---------+--------+--------+----------+------------+-------+--------------+-----------+--------------------------+
| 5303507 |      0 |      0 |    26613 | 1456407361 |     0 |            0 | 374212810 | Ping loss ({ITEM.VALUE}) |
+---------+--------+--------+----------+------------+-------+--------------+-----------+--------------------------+
1 row in set (0.02 sec)

One event in table, but server fetch ALL values from table (268761) - for what ?
Server runs with disabled events housekeeping and enabled data housekeeping (365 day). I think, zbx_history_get_values() logic is totally broken.

vso Thank you for your report, so it was caching 656 days of history to calculate item value at the time of the event and it took 3 minutes, this does not look good, this issue also looks similar to ZBX-13152

Comment by Andrey Melnikov [ 2017 Dec 12 ]

so it was caching 656 days of history to calculate item value at the time of the event and it took 3 minutes, this does not look good

Standard rotational SATA drives in RAID-1 set.
main problem - too optimistic (read - foolish) assumption in vch_item_cache_value() if item not present in cache - attempt to cache it from requested time to now. This work in normal situation, when all requested data near to time(NULL), and break - if not.
Second problem - cache itself.

I'm slightly hacked valuecache and upgrade process on same database took:

  4579:20171212:222245.160 query [txnlev:0] [select taskid from task where type=5 and status=1]
  4579:20171212:222245.179 query [txnlev:1] [begin;]
  4579:20171212:222245.180 starting event name update forced by database upgrade
  4579:20171212:222245.180 query [txnlev:1] [select count(*) from triggers]
  4579:20171212:222245.181 query [txnlev:1] [select triggerid,description,expression,priority,comments,url,recovery_expression,recovery_mode,value from triggers order by triggerid]
  4579:20171212:222245.186 In substitute_simple_macros() data:'Processor load is too high on {HOST.NAME}', type=16
.....
  4579:20171212:222314.374 event name update completed
  4579:20171212:222314.374 query [txnlev:1] [delete from task where taskid=1]
  4579:20171212:222314.602 query [txnlev:1] [commit;]

30 seconds.

Comment by Vladislavs Sokurenko [ 2017 Dec 12 ]

That's great, what did you do ?

Comment by Andris Zeila [ 2017 Dec 13 ]

This work in normal situation, when all requested data near to time(NULL), and break - if not.

Yes, this is know design flaw. Normally it works okayish, but can cause problems (mostly - wasted memory usage) with large timeshift ranges in trigger functions.

The cache was used to improve processing for next events of the same trigger. However in hindsight such situation (when there are enough events generated by one trigger to justify value caching) is quite rare and it would be better to turn value cache off (which can be done manually in configuration files for now).

Comment by Andris Zeila [ 2017 Dec 14 ]

Did some 'stress' testing with 1m events, 1m history and trigger having 2 functions and description having {ITEM.VALUE1}, {ITEM.VALUE2} macros. Event/problem update took ~10 minutes (i7 cpu, ssd).

Comment by Andris Zeila [ 2017 Dec 14 ]

Released in:

  • pre-4.0.0alpha1 r75866

Note that the previous release incorrectly expanded {ITEM.VALUEN} macros for N>1. If somebody have already applied the update and wants to have the event/problem names recalculated - it can be forced with following steps:

  1. stop the server
  2. manually add post initialization task in database ( insert into task values(1,5,1,0,0,null); )
  3. start the server again
Comment by MATSUDA Daiki [ 2018 Oct 16 ]

document has a typo.

https://www.zabbix.com/documentation/4.0/manual/introduction/whatsnew400#problem_name_generation

Now problem and event names are stored directly in the event and problem tables at the moment when an

correct is 'events and problem tables'.

Miks.Kronkalns Thank you! I have fixed it.

Generated at Thu Apr 18 08:01:24 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.