[ZBX-2651] Eventlog processing on Agent side. Serious experiments about performance and sequence sending. Created: 2010 Jul 05  Updated: 2017 May 30  Resolved: 2015 Feb 08

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Oleksii Zagorskyi Assignee: Unassigned
Resolution: Won't fix Votes: 5
Labels: eventlog, logmonitoring
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File Create_Customized_Alarm_Eventlog.rar     Text File Eventlog processing on Agent side (RU version).txt     PNG File One_Item_ID444_or_555.png     PNG File Two_Item_ID444+555.png     PNG File processor_load_differenses.png    

 Description   

I spent a few very accurate and serious experiments with processing Windows Eventlog.
Maybe someone else it may seem madness, but I was interesting and useful
I'll describe how I did it - maybe it will be useful to someone. And then I give my opinions and suggestions to improvement.
I created an custom Eventlog «Alarm» and filled them with events on a particular algorithm.
To create a custom Eventlog in the registry need to add a branch with the name of the eventlog on the patch:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Eventlog\ Alarm]
"Retention"=dword:00000000
"MaxSize"=dword:07ff0000
Two additional key determines the size and mode of rotation of the eventlog.
Custom eventlog was created to eliminate the influence of other possible factors and for test in the future.

Then, using the bat file, I filled the event log. As a result, the eventlog has 100200 events in a clear sequence of a particular algorithm. (the eventlog was filled with 18 minutes
This structure of the created eventlog, which shows how it is formed:

2010.Jun.28 22:26:20 FireSRC Information 555 100000 range END. Text for ALARM with ID 555
2010.Jun.28 22:26:20 FireSRC Error 333 1000 - in 100000 range. Text for ALARM with ID 333
2010.Jun.28 22:26:20 FireSRC Error 333 999 - in 100000 range. Text for ALARM with ID 333
2010.Jun.28 22:26:20 FireSRC Error 333 998 - in 100000 range. Text for ALARM with ID 333
..................................................................
2010.Jun.28 22:08:16 FireSRC Error 333 2 - in 2000 range. Text for ALARM with ID 333
2010.Jun.28 22:08:16 FireSRC Error 333 1 - in 2000 range. Text for ALARM with ID 333
2010.Jun.28 22:08:16 FireSRC Warning 444 2000 range START. Text for ALARM with ID 444
2010.Jun.28 22:08:16 FireSRC Information 555 1000 range END. Text for ALARM with ID 555
2010.Jun.28 22:08:16 FireSRC Error 333 1000 - in 1000 range. Text for ALARM with ID 333
2010.Jun.28 22:08:15 FireSRC Error 333 999 - in 1000 range. Text for ALARM with ID 333
...........................................................
2010.Jun.28 22:08:02 FireSRC Error 333 2 - in 1000 range. Text for ALARM with ID 333
2010.Jun.28 22:08:02 FireSRC Error 333 1 - in 1000 range. Text for ALARM with ID 333
2010.Jun.28 22:08:02 FireSRC Warning 444 1000 range START. Text for ALARM with ID 444

The basic principle - through every thousand ordinary event (EventID 333) are repeated several distinctive events (EventID 444,555) which we will filter on the Zabbix-agent side.

Bat file and reg file, you can take from the attached file.
I think this little How-To may be useful to those who need to verify a Item key and complex Triggers in the real world forcibly creating random events in the event logs and observing Zabbix is working correctly or not.

Then I created a few different keys and made experiments.
So:
The first dimension - performance (speed) reading the eventlog with a few Item with filtration for the EventID on the agent side.

Immediately, I note that if the agent cofig define DebugLevel = 4, then the speed of processing eventlog catastrophic falls, so the speed need to check without the debug level!

All parameters of the agent, which may affect performance - defaulted, but one exception MaxLinesPerSecond = 1000. This is done to better express the difference in the speed of the agent works.

All Items have attribute Update interval (in sec)=1.

Thus, first experiment: two Items with keys:
eventlog [Alarm,,,, 444]
eventlog [Alarm,,,, 555]
Agent processed 100200 events and the sent to server 200 events for 1 min. 30 sec.
picture "Two_Item_ID444+555.png"

Second experiment: single Item with key:
eventlog [Alarm,,,, 444 | 555]
Agent processed 100200 events and the sent to server 200 events for 1 min. 10 sec.
picture "One_Item_ID444_or_555.png"

That is significantly faster than in the previous example. This processor is Core2Duo 3.0Ghz load also smaller (see picture "processor_load_differenses.png")

Another dimension - the sequence of construction and sending events by agent.

As you can see in the picture "Two_Item_ID444+555.png" events were generated and transmitted to the server is not in the same sequence as they were created on the Windows Host. They are generated as the agent read them (two different Items) and transmit to the server. This is not quite right !!!

It is suggested that the idea: to make that - when the agent asks and receives from the server the list of active checks, it groups the entire Items for each Eventlog in separate groups and when parsing of the eventlog will process new event through the elements in this group in a single pass !!! Thus will be fulfilled the real sequence of events from the agent side within each unique journal!. Also on the idea will be improved performance.

Example realistic Items for single Windows Host:
eventlog[System,,,,26|6009]
eventlog[System,,,Warning,1007|3019|24]
eventlog[System,,,Error]
must be formed into one group.
Only need to decide what to do with the attribute of the Items - Update interval (in sec). Perhaps it should also be taken as a criterion for forming groups. In this case, the documentation should give a recommendation - using multiple items for a log - set the same Update interval (in sec).

I also want to prove that in the real environment may be situations where I described the remark is relevant.
For example - Zabbix agent work a long time without connection to the server and could not send events. After the restoration of connection, he will handle the log and send events with great speed (as in my first experiment).
Or example - host a significant amount of time is Not monitored by Zabbix server in the web-interfaceand then back to Monitored, etc.... If the Zabbix administrator wants to see list of events for a few elements - it does not receive a reliable eventlog with the correct order of occurrence of windows-events.
In other words, the Administrator shall not think about the fact that the events could come in the wrong sequence, even in very rare situations, it should always be sure that the events came with right sequence!
The work of triggers in these circumstances I'd rather say nothing - that is another story.

Although the experiment was made for the Item key eventlog[], but all told, probably true, also for keys log[] and logrt[].
Thank you for your attention.

Sorry for my English (original Russian text attached)



 Comments   
Comment by Anthony [ 2010 Jul 06 ]

Присоединяюсь к предыдущему оратору: прошу рассмотреть вопрос об усилении модуля мониторинга Windows-логов.
Эта тема весьма актуальна. Все остальные модули не вызывают претензий, а вот eventlog[] всё же нужно доводить до совершенства =).

Спасибо за внимание.
Zabbix4ever.

Comment by Alexander Vladishev [ 2015 Feb 08 ]

Mass of improvements with log-monitoring was introduced in latest versions of Zabbix. Please try it!
Also, soon to be finalized ZBXNEXT-444.

I close the issue.

Comment by Oleksii Zagorskyi [ 2015 Feb 08 ]

An idea I described here is actually closer to ZBXNEXT project.
The idea is: if there are several [event]log* items for the same log file - do a single log file read and precess it several required times.
Of course it would require items grouping by update interval at least. For last years we stated to do something like that for icmp and snmp checks on zabbix server side.
The idea will not be resolved by the ZBXNEXT-444 or any other issue.

Meanwhile I don't have objections to close current issue

Generated at Fri Apr 19 12:29:12 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.