[ZBXNEXT-3051] Count of actions has a significant impact on event processing Created: 2015 Nov 23  Updated: 2016 Apr 01  Resolved: 2016 Jan 19

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Server (S)
Affects Version/s: 2.2.10
Fix Version/s: 3.0.0beta1

Type: Change Request Priority: Major
Reporter: Marc Assignee: Unassigned
Resolution: Fixed Votes: 14
Labels: actions, conditions, history, performance, synchronization
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by ZBX-9532 VMware connection issues may apparent... Closed

 Description   

The count of actions respectively action conditions may have a significant impact on performance of history syncers.

Simplified illustration how the processing of events in connection with actions takes place:

for each event do
    for each action do
        for each action condition do
            select_trigger_condition()

In this process flow are quite some database queries involved.
While this is unlikely an issue in small Zabbix environments it may become a disaster in larger environments.

On a Zabbix installation with:

  • a base line of ~1.8K NVPS,
  • having ~100 actions
  • with ~1K conditions

I was able to bring the system out of service in a few minutes by just adding 4 events per second in addition to the ambient noise.

In opposite to that I was no more able to affect the service after having all actions disabled. Even not by generating many times more events per second.

How about caching most relevant information needed for action processing in memory of Zabbix server?



 Comments   
Comment by Oleksii Zagorskyi [ 2015 Nov 23 ]

Just fyi ZBX-4357

Comment by Marc [ 2015 Nov 25 ]

Some statistics from a 50 seconds system call trace of one history syncer being affected by this issue:

  • 16,808 database queries in total.
  • 2.1229ms query time in average.
  • 10.974ms maximum query time.
  • 0.185ms minimum query time.
  • 35.6813 seconds total query time.

Query count has been derived from count of sendto(select ...) system calls.
Query time has been derived from poll() system call time appearing after sendto(select ...) system calls.

Edit:
For the sake of information. Summary report of count time, calls, and errors for each system call of the very same trace.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.48    0.132029           8     17085           poll
  4.65    0.006706           0     17122           sendto
  2.19    0.003155           0     17086           recvfrom
  1.68    0.002430           0     91964           semop
  0.00    0.000000           0        38           write
  0.00    0.000000           0        38           open
  0.00    0.000000           0        38           close
  0.00    0.000000           0        42           stat
  0.00    0.000000           0        38           fstat
  0.00    0.000000           0        38           mmap
  0.00    0.000000           0        38           munmap
  0.00    0.000000           0         6           rt_sigaction
  0.00    0.000000           0        12           rt_sigprocmask
  0.00    0.000000           0         5           nanosleep
  0.00    0.000000           0        11           times
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.144320                143562           total
Comment by Raymond Kuiper [ 2015 Dec 02 ]

Perhaps an 'actions cache' might be a solution?

Comment by Marc [ 2015 Dec 02 ]

To me this must not necessarily be a dedicated cache. From my point of view this is configuration data and the existing configuration cache appears to me to be an appropriate place.

I don't suspect the count of actions to be the major/only issue but the count of conditions and the matter of fact that most of condition checks result in SQL queries.
E.g. see mentioned select_trigger_condition() function.

Finally I think all related information (action + conditions) which currently lead to SQL queries in the act of history synchronization should be cached.

Comment by Oleksii Zagorskyi [ 2015 Dec 02 ]

I'm personally not very sure that internal cache for actions should help here a lot.
Current config cache is useful because it contains data of many gigabytes (as for db tables) in big installations.

I don't think that tables related to actions are so huge, even with ~100 actions etc.
So, IMO, most optimization can go to bulk processing of actions and code optimization.

Comment by Marc [ 2015 Dec 02 ]

zalex_ua, the issue is not related to the payload resp. size of configuration data which indeed is very low.
I see the problem rather in doing repetitive SQL queries to that extend.

I mean just do the math:
Here a single SQL query takes 2ms what I'd consider as quite fast for:

  • sending the query over the network to the database
  • letting the query optimizer analyze the query
  • evaluating the result set
  • receiving the result set over the network

Now this query has been done ~16,000 times in 50 seconds what sums up to ~32 seconds just for doing SQL queries.
Querying such kind of information from memory I'd expect to be a magnitude faster than 2ms each.

Comment by Marc [ 2015 Dec 19 ]

Btw, when proposing of "[...] caching most relevant information needed for action processing in memory [...]",
then I had also information necessary to do select_trigger_condition() in mind. This is probably not that obvious from the issue description.

In fact the database queries made there are the most time consuming part in the whole chain during history synchronization respectively action processing.

Considering a scenario of having only 100 actions with 1000 conditions in total respectively 10 conditions per action in average for simplicity reasons, then the count of SQL queries to issue may be distributed like this over the event process chain:

  • 1 SQL query to get enabled actions on an event
  • 100 SQL queries get conditions of the actions
  • 700 SQL queries to evaluate action conditions
    (assuming 3 non-SQL related conditions per action)

It's definitely not my intention to say that caching actions and conditions only is not worth to!
But if there is a chance to also cache information to avoid at least some more database queries in condition evaluations, then this may further improve performance significantly.

Personally I'd order the condition types from most worthy to consider for caching to least worthy as follows:

  1. CONDITION_TYPE_MAINTENANCE
    expected to be part of almost each action
  2. CONDITION_TYPE_HOST_GROUP
    likely the major criteria to differentiate between teams and responsibilities
  3. CONDITION_TYPE_TRIGGER
    possibly major criteria to pin dedicated triggers (most likely on template level)
  4. CONDITION_TYPE_HOST_TEMPLATE
  5. CONDITION_TYPE_APPLICATION
  6. CONDITION_TYPE_HOST
  7. CONDITION_TYPE_EVENT_ACKNOWLEDGED
    actually very worthy to consider but expected to be rather difficult to implement
Comment by Andris Zeila [ 2016 Jan 07 ]

I created ZBXNEXT-3086 regarding action condition evaluation. This issue will deal only with action and action condition caching in configuration cache for action processing.

Comment by Andris Zeila [ 2016 Jan 11 ]

Specifications at https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-3051

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3051

Comment by Sandis Neilands (Inactive) [ 2016 Jan 15 ]

(1) In configuration cache for actions we save only actionid, eventsource, evaltype, formula rows from the actions table. Fetching the rest of the rows is not necessary.

sandis.neilands RESOLVED in r57682.

wiper CLOSED

Comment by Sandis Neilands (Inactive) [ 2016 Jan 15 ]

(2) When documenting don't forget to mention the effect of CacheUpdateFrequency configuration parameter.

wiper CLOSED.

Comment by Sandis Neilands (Inactive) [ 2016 Jan 18 ]

Successfully tested.

  • 350 hosts each with 3 items.
  • Add trigger to the template.
  • 3 actions match all events generated from the trigger (e.g. 350 * events when the trigger fires).

The performance is still limited by DB access elsewhere (see ZBXNEXT-3086).

Comment by Andris Zeila [ 2016 Jan 19 ]

Released in:

  • pre-3.0.0beta1 r57792
Comment by Andris Zeila [ 2016 Jan 19 ]

(3) Documentation:

sasha CLOSED

Comment by richlv [ 2016 Jan 19 ]

should the 'alpha' above be changed to 'beta' now ?

Comment by Oleksii Zagorskyi [ 2016 Apr 01 ]

It caused a regression, see ZBX-10608.

Generated at Sat Apr 27 00:48:39 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.