[ZBX-10547] Incorrect SLA calculation Created: 2016 Mar 17  Updated: 2024 Apr 10  Resolved: 2017 Oct 10

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.11, 2.4.7
Fix Version/s: 3.0.12rc1, 3.2.9rc1, 3.4.3rc1, 4.0.0alpha1, 4.0 (plan)

Type: Problem report Priority: Blocker
Reporter: Alexey Pustovalov Assignee: Vladislavs Sokurenko
Resolution: Fixed Votes: 0
Labels: itservices
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File services.png    
Issue Links:
Duplicate
is duplicated by ZBX-6463 Zabbix IT Services shows incorrect SLA Closed
is duplicated by ZBX-5930 Incorrect SLA calculation Closed
is duplicated by ZBX-11615 Sequential order in SERVICEALARMS tab... Closed
is duplicated by ZBX-12147 SLAs for some ITServices wrong Closed
Team: Team A
Sprint: Sprint 17, Sprint 18
Story Points: 4

 Description   

If a service has a few child with linked triggers, then it is possible to have a few PROBLEM records in service_alarms table. It can happen when these triggers switch to PROBLEM state during short time period.



 Comments   
Comment by Aleksandrs Saveljevs [ 2016 Nov 14 ]

Could you please share a bit more information? For instance, how is this service configured? What SLA algorithm does it use? Do you have a reliable way of reproducing the problem? If you did any investigation, is the fact that service_alarms are out of order significant?

Comment by Alexey Pustovalov [ 2016 Nov 14 ]

1. problem if at least one is in problem
2. many triggers + many actions = long history syncer processing data
3. yes, these service_alarms are not closed then.

Comment by Aleksandrs Saveljevs [ 2016 Nov 14 ]

Does it happen with multiple history syncers only? That is, is it known whether these alarms were generated by a single history syncer or different ones?

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

According to dotneft, these alarms were generated by different history syncers:

| 2016-03-10 14:22:09     |          41052 |        50 | 1457612529 |     0 |
| 2016-03-10 14:24:08     |          41056 |        50 | 1457612648 |     5 |
| 2016-03-10 14:25:17     |          41062 |        50 | 1457612717 |     0 |
| 2016-03-10 14:26:56     |          41066 |        50 | 1457612816 |     5 |
| 2016-03-10 14:37:04     |          41070 |        50 | 1457613424 |     0 |
| 2016-03-15 09:02:15     |          41305 |        50 | 1458025335 |     5 | <--
| 2016-03-15 09:02:17     |          41314 |        50 | 1458025337 |     5 | <--
| 2016-03-15 09:02:18     |          41296 |        50 | 1458025338 |     5 | <--
| 2016-03-15 09:02:20     |          41288 |        50 | 1458025340 |     5 | <--
| 2016-03-17 11:49:42     |          41397 |        50 | 1458208182 |     0 |
+-------------------------+----------------+-----------+------------+-------+
Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

Do these extra alarms cause any practical issues or we simply would not want to have extra consecutive rows with the same "value" in the database, if possible? Note that we can legitimately have consecutive rows with different values for "value" in case triggers of different severities become PROBLEM.

Comment by Alexey Pustovalov [ 2016 Nov 15 ]

such records are not closed then, only one. if I remember correctly.

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

Hm, what does it mean to "close a record"? In the example above, doesn't servicealarmid=41397 "close" all rows with 41305, 41314, 41296, and 41288? It should not matter whether they have the same "value" or a different one.

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

Documenting the current conjecture on the cause: we calculate IT services inside DBupdate_itservices(), which is called from process_events(), which is called from DCsync_history(). Inside DCsync_history(), it is called outside of a history cache lock, but inside a transaction. Therefore, multiple history syncers can be updating the same non-leaf IT services simultaneously (but not the same leaf services due to trigger locking) and they will not see each other's changes, depending on transaction visibility settings.

Comment by Alexey Pustovalov [ 2016 Nov 15 ]

I think it does not close all of them, because the final result shows incorrect results.

Comment by Aleksandrs Saveljevs [ 2016 Nov 21 ]

sasha had an idea that we can make history syncers see each other's uncommitted changes to IT services, but it does not seem to be possible in PostgreSQL (https://www.postgresql.org/docs/9.3/static/sql-set-transaction.html ):

The SQL standard defines one additional level, READ UNCOMMITTED. In PostgreSQL READ UNCOMMITTED is treated as READ COMMITTED.

In MySQL, there also does not seem be any guarantee that we will immediately see uncommitted changes (http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html ):

READ UNCOMMITTED

SELECT statements are performed in a nonlocking fashion, but a possible earlier version of a row might be used.

Comment by Aleksandrs Saveljevs [ 2016 Nov 23 ]

It seems that the problem is even simpler. While we could try to protect against parallel updating of IT services, there is still a problem of delayed data (e.g., from proxies). As the current IT service processing goes, service state from delayed data will still be computed on top of the current service state, also resulting in out of order entries.

Comment by Aleksandrs Saveljevs [ 2016 Nov 23 ]

As another idea, we could try updating only leaf services and compute non-leaf service states in the frontend at runtime (i.e., during report generation).

This, however, has the downside that if service configuration changes (e.g., a leaf service gets attached to a different service), then historical state representation becomes inaccurate. So we could try limiting how service configuration is allowed to change or store service link validity information.

For instance, suppose at time point t=0 we have the following configuration:

A
-- B
-- C
D

Then, at time point t=5 we change it to the following:

A
-- B
D
-- C

The contents of "service_links" table then become:

A - B [0, ...]
A - C [0, 5]
D - C [5, ...]

Old links can then be removed by the housekeeper.

However, the above approach does not version "services.algorithm" field, for instance, but it may allow to solve the delayed data problem.

Comment by Aleksandrs Saveljevs [ 2016 Nov 23 ]

Another idea by sasha: do not populate "service_alarms" at all, but compute IT service states based on "events".

Comment by Andris Zeila [ 2017 Oct 06 ]

Successfully tested

Comment by Vladislavs Sokurenko [ 2017 Oct 06 ]

Fixed in:

  • pre-3.0.12rc1 r73261
  • pre-3.2.9rc1 r73263
  • pre-3.4.3rc1 r73265
  • pre-4.0.0alpha1 (trunk) r73266
Comment by Vladislavs Sokurenko [ 2017 Oct 06 ]

Fixed IT services calculation in parallel transactions not seeing each other changes when calculating common parent service

While SLA calculation is guarded by a semaphore, still different history syncers would not see each other changes to IT service tree due to different transactions running simultaneously, this is OK for different triggers, but causes problems when those triggers have common parent, as they cannot calculate parent it service correctly.

SLA calculation have been moved to a separate transaction so that when semaphore is locked then there can be only one transaction updating IT service tree.

Generated at Fri Apr 26 02:17:01 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.