[ZBX-16543] Service with algorithm "if all children have PROBLEM", and problems with negative duration Created: 2019 Aug 20 Updated: 2023 Oct 07 |
|
Status: | Postponed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | API (A) |
Affects Version/s: | 4.2.4 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | João Carvalho | Assignee: | Zabbix Development Team |
Resolution: | Unresolved | Votes: | 1 |
Labels: | ITServices, algorithm, negative, problem, services | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||
Sprint: | Sprint 57 (Oct 2019), Sprint 58 (Nov 2019), Sprint 59 (Dec 2019), Sprint 60 (Jan 2020), Sprint 61 (Feb 2020), Sprint 62 (Mar 2020), Sprint 63 (Apr 2020), Sprint 64 (May 2020), Sprint 65 (Jun 2020), Sprint 66 (Jul 2020), Sprint 67 (Aug 2020) | ||||||||||||||||||||
Story Points: | 0.5 |
Description |
A complex service with at least two dependencies reports a false problem if:
When the recovery for the false problem occurs (recovery before the problem), if all other triggers are in problem, Zabbix considers that there was a recovery, therefore, all dependencies where in problem simultaneously. This behavior was observed in three services, with two dependencies each. The correct behavior was observed in a service with four dependencies. In this instance, one dependency reported a real problem, and another reported a false one. The remaining two didn't report any problem. The result was a correct calculation. Had the remaining two reported problems before the false recovery, a false an incorrect calculation would've been observed. In all of these cases, the algorithm is "Problem, if all children have problems".
Steps to reproduce (if possible):
Result: Unfortunately, I've corrected the database before taking a screenshot. This is what I can show.
Example with two dependencies. A problem was reported with a duration of 38:45 minutes. Simple service 1. A real problem reported at 00:09:54. Simple service 2 - A false problem with a recovery at 00:48:49. Expected: The complex service should've ignored the false one.
Suggestion: It appears the complex service calculation is querying the dependencies trigger's events. It should instead be querying the dependencies service alarms, since this table appears to be correctly ignoring the trigger's false problems. This seems to be a simpler solution, than having the calculation ignore the trigger's false problems. |
Comments |
Comment by Edgar Akhmetshin [ 2019 Aug 21 ] |
Hello João, Thank you for reporting the issue! Checked it and confirmed. Regards, |
Comment by João Carvalho [ 2019 Oct 16 ] |
Hi, Can you give us an estimate of when you expect to have this issue fixed? Thank you! Regards, João Carvalho |
Comment by Andrejs Tumilovics [ 2019 Oct 17 ] |
I've managed to reproduce described scenario.
So, per my observations, Zabbix correctly manages service states. |
Comment by João Carvalho [ 2019 Oct 17 ] |
I believe that's the case. I didn't realize that the PROBLEM was raised immediately. In that case, you might have to do a retroactive correction, like I used to do before the events where marked as negative (v3.x.x).
But still, how come the simple service is correctly ignoring the negative problems? When does this happen? Is it immediate? There should be a way to do the same for the complex service. It should feed itself on the dependencies positive events. I noticed that the simple services add time to the total time of day (ex.: 1d 0h 1m). Is this part of the solution?
Thank you for the quick response! |
Comment by João Carvalho [ 2019 Oct 17 ] |
Hi there,
Sorry for spamming the thread, but I'm a bit cofused! I've only realized it now, that the Status has changed to 'resolved', and the Resolution is still 'unresolved'.
The Status tooltip says: "A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed."
Are you waiting for my response? A confirmation that the problem was correctly reprodused, or fixed? Do I need to do anything?
Anyway, It's my perception that the issue isn't fixed. If I understood it correctly, you're saying that the issue isn't in the complex service. It's in the simple one, or elsewhere. Be it so. The issue still exists. If you're intention is to open a new thread do fix the root problem, and close this one, I have nothing to say. But the issue doesn't seem fixed. I didn't see a solution being presented.
As for the root problem, I still find it suspicious that the simple services handle the negative problems correctly, but the complex ones don't. For this reason, I don't agree with the answer:
First of all, this usually happens with outdated Zabbix Agents in Windows hosts (older that 4.2.x). Some of these older agents send future timestamps, regardless of the real system time. Restarting the agent solves this for a time. After a while, it starts happening again.
This as been fixed in the newer agents, and in the trigger events' evaluation. One solution is to update all agents, and we've been doing so. But we still have some old ones. Making sure that the host system time is correctly sincronized is also a good policy.
Unfortunately, this isn't the only situation in which negative events are created. I've seen this hapen with ICMP simple checks. Don't know why. Latency, I suppose... And we do have some services dependent on ICMP triggers.
It seems to me that the root of the problem lies in the complex services calculation. If you say that it doesn't need fixing, and you believe you'll find the root of the problem elsewhere, than I think you need to identify the cause before closing this thread. But that's just my opinion.
If this reads to you like a rant, I'm sorry! It's not my intention. I'm just trying to help. Thank you for your time!
Regards, João Carvalho |
Comment by Andrejs Tumilovics [ 2019 Oct 18 ] |
Hi joao.g.carvalho. By the way, similar time synchronization issues were discussed on forum: Thank you for helping with issue investigation. |
Comment by Andrejs Tumilovics [ 2019 Oct 18 ] |
joao.g.carvalho |
Comment by João Carvalho [ 2019 Oct 18 ] |
@Andrejs Tumilovics It is very much possible. I'd have to check on that. I might do so, if I find the time. Probably not today. |
Comment by Andrejs Tumilovics [ 2019 Oct 18 ] |
Feature proposalProblem: Feature proposal: Drawbacks: Network latency will not be corrected. Either way it's not corrected for passive checks. |
Comment by Andrejs Tumilovics [ 2019 Oct 18 ] |
Time correction between Server, Proxy and agent has been removed in Zabbix version 4.0 (documentation).
Currently we are discussing a new feature, which should help with time synchronization. For now, we may only suggest how to identify nodes with out of sync. time. |
Comment by dimir [ 2019 Oct 28 ] |
I think we fully discussed time adjustment in -- I mean, the negative problem duration exactly indicates the problem, that the time on a monitored host is out-of-sync. By doing adjustments we hide this particular issue. Who knows what other problems might appear because the user forgot to turn on ntpd. Zabbix is the monitoring utility and yes, we should support monitoring of out-of-sync times, but not try to fix something that is not related to monitoring. We could as well start handling situations like some service reporting data in e. g. incorrect value type. If we still decide to do the adjustments, I propose to have a separate option for that, disabled by default. To me, this is Won't Fix and voting should be done on ZBXNEXT-3298 . |
Comment by João Carvalho [ 2019 Oct 28 ] |
Hi there, First of all, thank you for you analisys. Anyway, I think we're getting 'off track' in this thread. The issue isn't the trigger's behavior. For all I know, the issue regarding agent's data with wrong timestamps has been correctly addressed and resolved. If either monitored server's or proxy's time is out of scync, this shouldn't be hidden. It reveals that de NTP is out of sync. By the way, in the case that originated this thread, the negative problem was triggered by an outdated agent in a Windows server. The NTP wasn't out of sync. It is a know fact that zabbix agents prior to the 4.2.x version, in Windows servers, would often send data with an incorrect timestamp (usually over 5 minutes ahead of the system time).
Let me redirect the focus of the discussion to issue for which this thread was created. The simple service correctly identifies a negative problem and therefore ignores it. On complex services, if the algorithm is "if all children have PROBLEM", the negative problem will cause a incorrect evaluation.
What exactely does the simple service do when a negative problem is encoutered? Why can't this be replicated to the complex services?
May I suggest a solution? Before creating a problem event on the complex service, check if the event has a negative duration. If possible, the event should've never been created in the first place.
Another solution is to do a retro-active correction. Check these conditions every time a trigger reports a resolution:
Regards, João Carvalho |
Comment by Andrejs Tumilovics [ 2019 Oct 29 ] |
joao.g.carvalho Just to clarify, when you say "correct calculation", you mean SLA calculation, right? But, if we're talking about service status indicator, then, negative time problem is also shown with trigger severity in Monitoring -> Problems, until it's resolved (for simple service). However, according to your description I understood that "false problems" do not change simple service status. |
Comment by João Carvalho [ 2019 Oct 29 ] |
atumilovics, yes I meant SLA calculation. I didn't realize the problem was fixed. Sorry if I prolonged this discussion for longer than necessary.
Anyway, in -- I guess we'll have to test it to find out. Could we get a more detailed description on what was done in -- |
Comment by Andrejs Tumilovics [ 2019 Oct 29 ] |
Also, documentation was updated according to that change. |
Comment by dimir [ 2019 Oct 29 ] |
It must be checked how to accomplish this. Is changing the schema required? One idea is to remove original record that was added to service_alarms table when the problem was generated. A record to zabbix server log file must be a good clear one so people could be able to monitor such cases, just an idea: 14698:20191023:190213.269 the alarm of service (serviceid: 1) was ignored because the problem generated by trigger (triggerid: 16064) has negative duration (from 1572364444 till 1572363333)
|
Comment by João Carvalho [ 2019 Nov 07 ] |
Hi there, I noticed the Status chaged to Manual Test Failed I thought that Are you testing a diferent solution?
Regards, João Carvalho |
Comment by Andrejs Tumilovics [ 2019 Nov 07 ] |
|
Comment by João Carvalho [ 2019 Nov 08 ] |
Hi,
It appears that the issue was fixed. On the issue header, it says that the fix had been applied to 4.4.2rc1. This release isn't available, so I assume that it's a test build, and that the fix will be implemented on the next 4.4 release. We're currently on version 4.2, and we were planning on updating to 4.4. We where hoping to have this fixed when we update Zabbix. Can you give us an estimate on when the next release will be available? Thank you!
Regards, João Carvalho |
Comment by João Carvalho [ 2019 Nov 08 ] |
Also, Are you planning a 4.2 release? Sorry for the spam!
João Carvalho |
Comment by Vjaceslavs Bogdanovs [ 2019 Nov 08 ] |
joao.g.carvalho, 4.2 is not supported anymore. You can find release policy here: https://www.zabbix.com/life_cycle_and_release_policy |
Comment by João Carvalho [ 2019 Nov 08 ] |
Thanks for the feedback! Didn't know about the Life Cycle & Release Policy. So, next relaese is due in March? LTS-5.0? |
Comment by Vjaceslavs Bogdanovs [ 2019 Nov 13 ] |
joao.g.carvalho you can find our roudmap here: https://www.zabbix.com/roadmap#v5_0 |
Comment by Miks Kronkalns [ 2020 Mar 24 ] |
Problem description and proposed solution available here: ZBX-16543.pdf Here is an image visually displaying 4 tables described in attached PDF: Some comments about image:
wiper: From the examples my guess would be that the problem is because of data not coming in chronological order. When server recalculates parent service status it uses current children status. If the event timestamp is in past it can lead to wrong parent status. In that case possible solution would be to load service state historical data from service_alarms when loading service tree. Then the correct parent service state at specified timestamp could be calculated. palivoda Why the data arrives not in chronological order? wiper: There could be completely valid case when a parent service has children based on triggers from different proxies and one proxy was offline for some time. So yes, not good. |
Comment by Vladislavs Sokurenko [ 2020 Jul 22 ] |
Could you please be so kind and let us know if you are still experiencing the problem joao.g.carvalho, thank you! |
Comment by João Carvalho [ 2020 Jul 29 ] |
Hi, sorry for the late reply. The conditions for the abnormal behavior are very specific, and they seldomly occur. Even thoug this is rare, we have 12 services on which it can happen.
As a measure to minimise the probability of another event, we've made an effort to update every Zabbix agent to 4.2.1, or above. Outdated agents on Windows machines can collect data with an incorrect timestamp, causing PROBLEMS with negative duration. In turn, these PROBLEMS with negative duration can trigger an incorrect service evaluation. Even so, network latency alone can cause a PROBLEM with negative duration. It sometimes happens, and we see it all the time. Besides, some older SO's aren't compatible with the newer agent versions. We won't be able to update every single agent.
To answer your question, That we've noticed, we haven't had any such incident again. I guess that the required conditions haven't been met again. But it can still happen at any time. We're not "experiencing the problem", but we still have the issue. |