Uploaded image for project: 'ZABBIX FEATURE REQUESTS'
  1. ZABBIX FEATURE REQUESTS
  2. ZBXNEXT-444

Improvement processing of event logs when Agent start and stop (value "lastlogsize" optimization)

XMLWordPrintable

      As I promised - this is my third serious post about the work Zabbix agent with the Windows event logs.
      Here we will focus on how the agent begins to work (after launch) and exits (stops and restarts).
      There will also be given to proposals for improving the principles of the first added to the system Items with key eventlog[XXXXXX].
      I know about adding in version 2.0 for eventlog[] key new feature:
      mode - one of all (default), skip (skipping processing of older data)

      But this feature is definitely indicates the agent to handle only those log entries that occurred after starting the agent. And I do not want to miss the events that occurred during the period when the Agent is not working. For me it basically and I use this opportunity will not, though for others it may be very useful.

      As in its previous reports ZBXNEXT-443 and ZBX-2651 I have drawn attention to the work of the agent in filtering messages on the agent side. As an example, take the absolutely real problem:

      Example ? 1. We must catch rare messages from the Security event log - events "Account lockout" on the domain controller.
      This here is such a key:
      eventlog[Security,,"Success Audit",, 644]
      Here the tremendous flow of events - up to a million events per day and the event lock account's can occur for example every few days.
      Note - my friend at work has Security eventlog at 2 GB size - about 6 million records of events.

      Example ? 2. It is necessary to catch the event of the results of disk checking at boot (chkdsk - Checking file system)
      eventlog[Application,,, Winlogon,1001]
      This flow of events is much smaller than in the Security log, but events may appear very-very rarely - for example once a year (but it's very important event !!!).

      Specificity of both examples is that the events occur between very large count of other events.
      What happens when an agent found the last coincidence - it sends a message to the server and the server write in the database table "items" in the field "lastlogsize" special number of the last received event. This number - is a special marker number of the event log, which is visually not see anywhere else.
      Next, place a few days, and during this time in the Security log is made of several million records and we may need to reboot the server or restart the Zabbix agent.
      When you run the agent, it receives a list of active checks and will process several million events for a key from my Example ?1 - and this is very bad. This loss of CPU resources and a great time until the agent reaches the end of the eventlog, and most importantly - the agent in this case completely unpredictable - it can affect the operation of other checks. By the way on this issue in the forum users discussion exist.

      The same for the Example ?2. For example a workstation which is boot every day. So every morning, the agent will process the entire System eventlog such as a few years. Is bad !!! And if a few similar checks ?

      In short what I am - propose made changes to the work of the agent:
      When the agent makes a stop, then it must send a messages to the server from each Items with the type of active checks (or may be only from log, logrt, eventlog), with the value lastlogsize. To do this, the agent need only a few milliseconds before the final stop of the work.
      Will this be a special message that does not write in the history of events in data base, or will it be normal (hard coded) message, which will be write in the history of events - you decide. I basically like second variant - will be seen when the agent stopped - it might be useful. In fact, for all Items the value lastlogsize will be the same within a journal and correspond to the last eventlog record.
      As a result, at the next run the agent will continue from the place - where he stopped. That is, the agent will not check what had already checked before.

      Another innovation, which I propose (again, the future of mode -> skip is not for us !!!):
      Set a force in creating an Item and allow the user to specify at any moment, that next time, when an agent receives a list of active checks - let him go at the end of the journal, and will not send any existing entries.
      Why did I propose: the forum has many users posts about how users are experiencing the torment in doubt on whether the agent starts to send the whole of huge logs.
      I almost on 100 % am assured that it is not necessary for anybody. It is better to when you first !!! add an Item - not the entire log is sent.
      I think this can be done without changing the database schema. For example when creating a new Item to set lastlogsize = 99999999999 for example - very large. Agent, for its part should handle this case as well as mode -> skip. That's it. Nothing else is needed. All this can be done in 1.8 branch.
      You can do even better - when configure the Item give the chance to the user to forcibly set this attribute (lastlogsize = 99999999999) at any time. For example - if an agent long time not handles the event log (Item has been Disabled a long time) and that the next time when agent refresh the list of active check it is not re-checks the eventlog. It is useful then? Of course!
      All told, probably true, also for keys log[] and logrt[].

      Well, that's all my friends. This was my third and last message. Next will be about the triggers of the Event Log
      Sorry that no pictures - next time.
      As always sorry for my English (original Russian text attached)

      Specification: https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-444

            Unassigned Unassigned
            zalex_ua Oleksii Zagorskyi
            Votes:
            6 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: