[#ZBX-10547] Incorrect SLA calculation

[ZBX-10547] Incorrect SLA calculation Created: 2016 Mar 17 Updated: 2024 Apr 10 Resolved: 2017 Oct 10
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	2.2.11, 2.4.7
Fix Version/s:	3.0.12rc1, 3.2.9rc1, 3.4.3rc1, 4.0.0alpha1, 4.0 (plan)

Type:

Problem report

Priority:

Blocker

Reporter:

Alexey Pustovalov

Assignee:

Vladislavs Sokurenko

Resolution:

Fixed

Votes:

Labels:

itservices

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Attachments:

services.png

Issue Links:

Duplicate
is duplicated by	~~ZBX-6463~~	Zabbix IT Services shows incorrect SLA	Closed
is duplicated by	~~ZBX-5930~~	Incorrect SLA calculation	Closed
is duplicated by	~~ZBX-11615~~	Sequential order in SERVICEALARMS tab...	Closed
is duplicated by	~~ZBX-12147~~	SLAs for some ITServices wrong	Closed

Team:

Team A

Sprint:

Sprint 17, Sprint 18

Story Points:

Description

If a service has a few child with linked triggers, then it is possible to have a few PROBLEM records in service_alarms table. It can happen when these triggers switch to PROBLEM state during short time period.

Comments

Comment by Aleksandrs Saveljevs [ 2016 Nov 14 ]

Could you please share a bit more information? For instance, how is this service configured? What SLA algorithm does it use? Do you have a reliable way of reproducing the problem? If you did any investigation, is the fact that service_alarms are out of order significant?

Comment by Alexey Pustovalov [ 2016 Nov 14 ]

1. problem if at least one is in problem
2. many triggers + many actions = long history syncer processing data
3. yes, these service_alarms are not closed then.

Comment by Aleksandrs Saveljevs [ 2016 Nov 14 ]

Does it happen with multiple history syncers only? That is, is it known whether these alarms were generated by a single history syncer or different ones?

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

According to dotneft, these alarms were generated by different history syncers:

| 2016-03-10 14:22:09     |          41052 |        50 | 1457612529 |     0 |
| 2016-03-10 14:24:08     |          41056 |        50 | 1457612648 |     5 |
| 2016-03-10 14:25:17     |          41062 |        50 | 1457612717 |     0 |
| 2016-03-10 14:26:56     |          41066 |        50 | 1457612816 |     5 |
| 2016-03-10 14:37:04     |          41070 |        50 | 1457613424 |     0 |
| 2016-03-15 09:02:15     |          41305 |        50 | 1458025335 |     5 | <--
| 2016-03-15 09:02:17     |          41314 |        50 | 1458025337 |     5 | <--
| 2016-03-15 09:02:18     |          41296 |        50 | 1458025338 |     5 | <--
| 2016-03-15 09:02:20     |          41288 |        50 | 1458025340 |     5 | <--
| 2016-03-17 11:49:42     |          41397 |        50 | 1458208182 |     0 |
+-------------------------+----------------+-----------+------------+-------+

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

Do these extra alarms cause any practical issues or we simply would not want to have extra consecutive rows with the same "value" in the database, if possible? Note that we can legitimately have consecutive rows with different values for "value" in case triggers of different severities become PROBLEM.

Comment by Alexey Pustovalov [ 2016 Nov 15 ]

such records are not closed then, only one. if I remember correctly.

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

Hm, what does it mean to "close a record"? In the example above, doesn't servicealarmid=41397 "close" all rows with 41305, 41314, 41296, and 41288? It should not matter whether they have the same "value" or a different one.

Comment by Aleksandrs Saveljevs [ 2016 Nov 15 ]

Documenting the current conjecture on the cause: we calculate IT services inside DBupdate_itservices(), which is called from process_events(), which is called from DCsync_history(). Inside DCsync_history(), it is called outside of a history cache lock, but inside a transaction. Therefore, multiple history syncers can be updating the same non-leaf IT services simultaneously (but not the same leaf services due to trigger locking) and they will not see each other's changes, depending on transaction visibility settings.

Comment by Alexey Pustovalov [ 2016 Nov 15 ]

I think it does not close all of them, because the final result shows incorrect results.

Comment by Aleksandrs Saveljevs [ 2016 Nov 21 ]

sasha had an idea that we can make history syncers see each other's uncommitted changes to IT services, but it does not seem to be possible in PostgreSQL (https://www.postgresql.org/docs/9.3/static/sql-set-transaction.html ):

The SQL standard defines one additional level, READ UNCOMMITTED. In PostgreSQL READ UNCOMMITTED is treated as READ COMMITTED.

In MySQL, there also does not seem be any guarantee that we will immediately see uncommitted changes (http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html ):

READ UNCOMMITTED

SELECT statements are performed in a nonlocking fashion, but a possible earlier version of a row might be used.

Comment by Aleksandrs Saveljevs [ 2016 Nov 23 ]

It seems that the problem is even simpler. While we could try to protect against parallel updating of IT services, there is still a problem of delayed data (e.g., from proxies). As the current IT service processing goes, service state from delayed data will still be computed on top of the current service state, also resulting in out of order entries.

Comment by Aleksandrs Saveljevs [ 2016 Nov 23 ]

As another idea, we could try updating only leaf services and compute non-leaf service states in the frontend at runtime (i.e., during report generation).

This, however, has the downside that if service configuration changes (e.g., a leaf service gets attached to a different service), then historical state representation becomes inaccurate. So we could try limiting how service configuration is allowed to change or store service link validity information.

For instance, suppose at time point t=0 we have the following configuration:

A
-- B
-- C
D

Then, at time point t=5 we change it to the following:

A
-- B
D
-- C

The contents of "service_links" table then become:

A - B [0, ...]
A - C [0, 5]
D - C [5, ...]

Old links can then be removed by the housekeeper.

However, the above approach does not version "services.algorithm" field, for instance, but it may allow to solve the delayed data problem.

Comment by Aleksandrs Saveljevs [ 2016 Nov 23 ]

Another idea by sasha: do not populate "service_alarms" at all, but compute IT service states based on "events".

Comment by Andris Zeila [ 2017 Oct 06 ]

Successfully tested

Comment by Vladislavs Sokurenko [ 2017 Oct 06 ]

Fixed in:

pre-3.0.12rc1 r73261
pre-3.2.9rc1 r73263
pre-3.4.3rc1 r73265
pre-4.0.0alpha1 (trunk) r73266

Comment by Vladislavs Sokurenko [ 2017 Oct 06 ]

Fixed IT services calculation in parallel transactions not seeing each other changes when calculating common parent service

While SLA calculation is guarded by a semaphore, still different history syncers would not see each other changes to IT service tree due to different transactions running simultaneously, this is OK for different triggers, but causes problems when those triggers have common parent, as they cannot calculate parent it service correctly.

SLA calculation have been moved to a separate transaction so that when semaphore is locked then there can be only one transaction updating IT service tree.

Generated at Wed Oct 29 07:55:07 EET 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-10547] Incorrect SLA calculation Created: 2016 Mar 17 Updated: 2024 Apr 10 Resolved: 2017 Oct 10