[ZBX-3469] when snmp host is unavailable, all triggers change to unknown Created: 2011 Jan 27  Updated: 2019 Jun 04  Resolved: 2011 Jul 22

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 1.8.4
Fix Version/s: 1.8.6, 1.9.5 (alpha)

Type: Incident report Priority: Major
Reporter: matthias zeilinger Assignee: dimir
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

solaris, server 1.8.x


Issue Links:
Duplicate

 Description   

since i use zabbix server 1.8.x:
if the server enables/disables a snmp host, all trigger on this host changes there state to unknown
log line entry: 26251:20110127:075101.855 Disabling SNMP host [atvl2uajas017]



 Comments   
Comment by richlv [ 2011 Mar 02 ]

most likely only triggers not using time based functions should change their state upon host becoming unreachable

Comment by dimir [ 2011 Jun 02 ]

This problem is reproducible however it is not only related to snmp but any item type. E. g. if you have snmp and Zabbix agent items for the host and you stop Zabbix agent snmp item trigger will become UNKNOWN for some time too. This period of time is short (around a second) but still it's a bug.

Comment by richlv [ 2011 Jun 02 ]

duration of unknowns probably depend on the amount of items being monitored and their intervals. often unknown state can be observed for 30 seconds or so, even for triggers with nodata() function

Comment by matthias zeilinger [ 2011 Jun 03 ]

i saw that in zabbix 2.0 the "unknown" trigger state isn´t used, so i think this problem is fixed, but could you please test.

if yes, i will wait for the new version.

Comment by dimir [ 2011 Jun 03 ]

In latest 1.8 it's reproducible. The fix is awaiting review and testing.

Comment by dimir [ 2011 Jun 03 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-3469 .

Comment by richlv [ 2011 Jun 05 ]

(2) does the fix also solve the following scenario ?

agent.ping item is used along with some other items and nodata() trigger is created against it. if host becomes unavailable, this trigger goes into an unknown state for a brief period, which we do not expect to happen.

<dimir> Nope, it's not. Will add that fix shortly.

<dimir> RESOLVED in commit r20088

<richlv> what is the new logic ? if we have 3 passive agent items and multiple triggers with different functions (nodata(), last(), last()+nodata(), time()...), which triggers will become unknown ?
<dimir> When host is disabled triggers of the same type as failed item excluding the ones using time functions become UNKNOWN.

<richlv> thanks. to clarify...
1. are time based functions used from timer.c ? is the list duplicated anywhere for this purpose or do we reuse the same list ?
<dimir> Ye, it is exactly that list and we should probably define it in one place.
<richlv> i don't like "should" - does that mean we are currently duplicating this list ?
<dimir> Well, for some reason sasha decided not to define it in one place just now. But I think he would change his mind as soon as we get the clear picture how to handle this.
2. what if a trigger has multiple functions, only one of them being time based ?
<dimir> As discussed with sasha we think the logic should be: do not set UNKNOWN if there is at least one non-timebased.
3. what is a trigger is referencing both snmp and agent items, and only snmp times out ?
<dimir> Same here: do not set UNKNOWN if there is at least one item of different type from the one that failed.

<richlv> otherwise sounds reasonable... so far. see below.

<dimir> This is how we see it, would be nice to know your visions on the matter. Now a few more questions.
1. Let's say we implemented handling of the 3rd case, when we have a trigger referencing snmp and agent items. snmp times out, we do not set to UNKNOWN. Now agent times out and the trigger should get UNKNOWN state. But it doesn't because it's referencing snmp and it's a different type. In order to handle that the check should be even more complicated. Or am I wrong?
2. As this looks rather complicated and hard to implement would it be better to leave it as it is for now and rollback the time functions?

There must be an easier way to handle all this.

<richlv> if a trigger references one item with nodata() and another with avg(60), will it only become unknown when avg() item is missing data for 60 seconds ?
as for referencing agent item + snmp...
what about only checking whether any of the referenced items of the same type is used by a time based function ? more complicated, but not insanely so.

<dimir> Regarding nodata() + avg(60), I guess so. As far as I know currently any trigger referencing UNKNOWN item becomes UNKNOWN (which is handled in nextchecks as I understood). As for the latter case I think that is a good solution, yes.

Should we handle all that in a different ZBX which will fix handling unknown status of triggers or are you comfortable of doing it here?

<dimir> RESOLVED in r20173 . The logic is as follows: set UNKNOWN for triggers that reference item of the same type as failed one which does not reference a timebased function.

<sasha> CLOSED

Comment by Alexander Vladishev [ 2011 Jun 05 ]

Successfully tested!

Comment by dimir [ 2011 Jun 22 ]

Thanks to sasha here is the new logic defined.

For failed item set all affected triggers to UNKNOWN.

Do not set UNKNOWN if any of the following conditions are true:

  • trigger uses at least one time based function
  • trigger contains at least one "active item" (see below)

An item is considered active if all next conditions are true:

  • item.status = ITEM_STATUS_ACTIVE
  • item host.status = HOST_STATUS_MONITORED
  • (for zabbix items) item host.available = HOST_AVAILABLE_TRUE
  • (for snmp items) item host.snmp_available = HOST_AVAILABLE_TRUE
  • (for ipmi items) item host.ipmi_available = HOST_AVAILABLE_TRUE
Comment by richlv [ 2011 Jun 28 ]

what's the current status of this issue ? where is it planned to be merged ?

Comment by dimir [ 2011 Jun 29 ]

Yep, the fix is ready I just haven't tested it yet. Some customer issues interrupted it. Will test/commit today.

Comment by dimir [ 2011 Jul 08 ]

Let's try to make the logic more clear:

Set trigger status to UNKNOWN if all are true:

  • trigger's item status ACTIVE
  • trigger's item type same as failed one
  • trigger does not reference time-based function
  • trigger status ENABLED
  • trigger's host same as failed one
  • trigger's host status MONITORED
  • trigger does not reference "active" item

An item is considered "active" if all are true:

  • item status ACTIVE
  • item's host status MONITORED
  • item's trigger references time-based function
    OR
    item is of different type AND it's host is available
Comment by dimir [ 2011 Jul 08 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-3469 .

Comment by dimir [ 2011 Jul 08 ]

FYI: Useful SQL statement for testing, to see how trigger status changes (in this case condition is triggerid>12999, see the end of the statement):

select t.description,case t.value when 0 then 'OK' when 1 then 'PROBLEM' else 'UNKNOWN' END as value,i.description as item,h.hostid,h.host,case h.available when 0 then 'UNKNOWN' when 1 then 'TRUE' else 'FALSE' end as avail,case h.snmp_available when 0 then 'UNKNOWN' when 1 then 'TRUE' else 'FALSE' end as snmp_avail,case ipmi_available when 0 then 'UNKNOWN' when 1 then 'TRUE' else 'FALSE' end as ipmi_avail from items i,functions f,triggers t,hosts h where i.itemid=f.itemid and f.triggerid=t.triggerid and i.hostid=h.hostid and i.status=0 and not i.key_ like 'status' and i.type in (0) and t.status=0 and h.status=0 and t.triggerid>12999;

 

Comment by Alexander Vladishev [ 2011 Jul 19 ]

Successfully tested!

Comment by dimir [ 2011 Jul 22 ]

Fixed in 1.8 r20732:20738, trunk r20751.

Comment by dimir [ 2011 Oct 25 ]

Let's try to make the logic even more clear. Let's say an item MYITEM returns error. There is a trigger associated with it. We set that trigger status to UNKNOWN if ALL are true:

  • MYITEM status ACTIVE
  • trigger does not reference time-based function
  • trigger status ENABLED
  • trigger and MYITEM reference the same host
  • trigger host status MONITORED
  • trigger does NOT reference an item that has ALL true:
    • item status ACTIVE
    • item host status MONITORED
    • item trigger references time-based function
      OR
      item and MYITEM types differ AND item host status AVAILABLE
Generated at Wed Apr 24 20:15:00 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.