[ZBXNEXT-444] Improvement processing of event logs when Agent start and stop (value "lastlogsize" optimization) Created: 2010 Jul 07  Updated: 2015 Dec 11  Resolved: 2015 Feb 20

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Agent (G), Proxy (P), Server (S)
Affects Version/s: 1.8.2
Fix Version/s: 2.5.0

Type: Change Request Priority: Major
Reporter: Oleksii Zagorskyi Assignee: Unassigned
Resolution: Fixed Votes: 6
Labels: eventlog, logmonitoring, windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File Improve processing of event logs when Agent start and stop (value 'lastlogsize' optimization).txt    
Issue Links:
Duplicate
is duplicated by ZBXNEXT-1866 Zabbix agent re-reads the log file fr... Closed
is duplicated by ZBXNEXT-91 latlogsize sending to server Closed
is duplicated by ZBX-8866 logrt[] item becomes unsupported afte... Closed
is duplicated by ZBX-8869 log/logrt items status doesn't change... Closed
is duplicated by ZBX-10153 When specifying logrt skip to item ke... Closed
is duplicated by ZBX-3662 lastlogsize is not sent from server t... Closed
is duplicated by ZBX-9507 log item stuck at "Not Supported" Closed

 Description   

As I promised - this is my third serious post about the work Zabbix agent with the Windows event logs.
Here we will focus on how the agent begins to work (after launch) and exits (stops and restarts).
There will also be given to proposals for improving the principles of the first added to the system Items with key eventlog[XXXXXX].
I know about adding in version 2.0 for eventlog[] key new feature:
mode - one of all (default), skip (skipping processing of older data)

But this feature is definitely indicates the agent to handle only those log entries that occurred after starting the agent. And I do not want to miss the events that occurred during the period when the Agent is not working. For me it basically and I use this opportunity will not, though for others it may be very useful.

As in its previous reports ZBXNEXT-443 and ZBX-2651 I have drawn attention to the work of the agent in filtering messages on the agent side. As an example, take the absolutely real problem:

Example ? 1. We must catch rare messages from the Security event log - events "Account lockout" on the domain controller.
This here is such a key:
eventlog[Security,,"Success Audit",, 644]
Here the tremendous flow of events - up to a million events per day and the event lock account's can occur for example every few days.
Note - my friend at work has Security eventlog at 2 GB size - about 6 million records of events.

Example ? 2. It is necessary to catch the event of the results of disk checking at boot (chkdsk - Checking file system)
eventlog[Application,,, Winlogon,1001]
This flow of events is much smaller than in the Security log, but events may appear very-very rarely - for example once a year (but it's very important event !!!).

Specificity of both examples is that the events occur between very large count of other events.
What happens when an agent found the last coincidence - it sends a message to the server and the server write in the database table "items" in the field "lastlogsize" special number of the last received event. This number - is a special marker number of the event log, which is visually not see anywhere else.
Next, place a few days, and during this time in the Security log is made of several million records and we may need to reboot the server or restart the Zabbix agent.
When you run the agent, it receives a list of active checks and will process several million events for a key from my Example ?1 - and this is very bad. This loss of CPU resources and a great time until the agent reaches the end of the eventlog, and most importantly - the agent in this case completely unpredictable - it can affect the operation of other checks. By the way on this issue in the forum users discussion exist.

The same for the Example ?2. For example a workstation which is boot every day. So every morning, the agent will process the entire System eventlog such as a few years. Is bad !!! And if a few similar checks ?

In short what I am - propose made changes to the work of the agent:
When the agent makes a stop, then it must send a messages to the server from each Items with the type of active checks (or may be only from log, logrt, eventlog), with the value lastlogsize. To do this, the agent need only a few milliseconds before the final stop of the work.
Will this be a special message that does not write in the history of events in data base, or will it be normal (hard coded) message, which will be write in the history of events - you decide. I basically like second variant - will be seen when the agent stopped - it might be useful. In fact, for all Items the value lastlogsize will be the same within a journal and correspond to the last eventlog record.
As a result, at the next run the agent will continue from the place - where he stopped. That is, the agent will not check what had already checked before.

Another innovation, which I propose (again, the future of mode -> skip is not for us !!!):
Set a force in creating an Item and allow the user to specify at any moment, that next time, when an agent receives a list of active checks - let him go at the end of the journal, and will not send any existing entries.
Why did I propose: the forum has many users posts about how users are experiencing the torment in doubt on whether the agent starts to send the whole of huge logs.
I almost on 100 % am assured that it is not necessary for anybody. It is better to when you first !!! add an Item - not the entire log is sent.
I think this can be done without changing the database schema. For example when creating a new Item to set lastlogsize = 99999999999 for example - very large. Agent, for its part should handle this case as well as mode -> skip. That's it. Nothing else is needed. All this can be done in 1.8 branch.
You can do even better - when configure the Item give the chance to the user to forcibly set this attribute (lastlogsize = 99999999999) at any time. For example - if an agent long time not handles the event log (Item has been Disabled a long time) and that the next time when agent refresh the list of active check it is not re-checks the eventlog. It is useful then? Of course!
All told, probably true, also for keys log[] and logrt[].

Well, that's all my friends. This was my third and last message. Next will be about the triggers of the Event Log
Sorry that no pictures - next time.
As always sorry for my English (original Russian text attached)

Specification: https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-444



 Comments   
Comment by richlv [ 2010 Jul 07 ]

i'd say here are only two suggestions

1. sending of lastlogsize to server upon agent shutdown (and i'd also say - periodic sending of lastlogsize even if no new incoming messages for server have appeared);
2. ability to make mode=skip work only once, and ability to force mode=skip/all easily (i suppose last one can be done by editing an item, but maybe there's something to improve here).

this is also slightly similar to ZBXNEXT-91

Comment by Oleksii Zagorskyi [ 2010 Jul 20 ]

I noticed that when changing the item key is the value lastlogsize set to 0. This makes sense for example if the user changes the key, such as a switch from the log Application to log System.

But for example in the case of the key changes from 'eventlog[Application,,,,16|66]' to 'eventlog[Application,,,,16|66|703]', ie we "extend the mask" for events capture, is no need to set 'lastlogsize=0'. Setting 'lastlogsize=0' instruct the Agent to re-read the entire log and re-sending all the events with ID=16,66 and 703, which incidentally we do not need also.

Here, in this case very much would help option which I ask for in paragraph № 2 (in comment from Richlv).
For example it will checkbox "Do not process existing log in next time" in Item editing form. This checkbox will manage the value lastlogsize while saving the Item.

Then, documentation can make a point - if you change the key and do not want to re-sending entire log, then check the checkbox ...".."
This is useful stuff, it should be done quickly as possible

zalex_ua as stated below, it's not true starting from 2.0.0, so -> CLOSED.

Comment by Oleksii Zagorskyi [ 2010 Aug 27 ]

I've already forgotten, and again caught the fact that when changing a item key 'lastlogsize' is reset to 0.
The result - a few hundred emails in the box .
In this regard, i have an idea - if key is present in log, logrt, eventlog and it was edited then need to check whether the user has check a checkbox "Do not process existing log in next time", and if not checked, then issue a warning to agree save changes.
If it does not do - even those who are constantly working with logs will forget about reseting 'lastlogsize' and as a result of receiving hundreds of emails that's unlikely as I do.

This warning will sound like this:
The entire log will be reread again. Continue?

zalex_ua as stated below, it's not true starting from 2.0.0, so -> CLOSED!

Comment by Oleksii Zagorskyi [ 2012 Aug 27 ]

Just another issue which similar to this one in a similar aspect: the agent should report "something" to the server before the correct stop.
ZBXNEXT-1375

Comment by Oleksii Zagorskyi [ 2012 Dec 19 ]

Here is a description of some not very clear case.

Suppose we have a key:
log[/var/log/somelog,unique-text]

The text "unique-text" appears VERY rarely in the log.
As a consequence of specific log rotation the log file can be missing for, say, 5 minutes.

As a result the item becomes visually not supported in frontend, but actually after 10 minutes (refrech unsupported) it continue to be REALLY monitored by agent.

The item will return to supported state (visually) only if it will find a line with text "unique-text" in the log. But all this time (hours+, days+) it will look in frontend as notsupported and it's ABSOLUTELY NOT CLEAR WHAT'S GOING ON.

I have no idea how it could be improved.

Comment by Oleksii Zagorskyi [ 2013 Aug 22 ]

The problem about updating "lastlogsize" after changing item key also reported in ZBXNEXT-1866 (linked) and it has a patch.

Comment by Oleksii Zagorskyi [ 2014 Jul 02 ]

Rich once said in IRC and now I checked it - starting from frontend 2.0.0 the "lastlogsize" column is not reseted to 0 if we update item key.

Comment by Aleksandrs Saveljevs [ 2014 Oct 23 ]

Specification available at https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-444 .

Comment by dimir [ 2014 Dec 02 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-444 .

Comment by dimir [ 2014 Dec 02 ]

(1) One question. Currently agent would send meta information update with every new line in the log file. E. g.

$ echo -e "foo\nbar\nbaz" >> /tmp/my.log

Would generate 3 meta information update packets, each with 4 bytes change in lastlogsize, no matter if the line matched given pattern or not. Is that correct behavior or we would want to send 1 update packed with 12 bytes change in lastlogsize?

zalex_ua "3 meta information update packets" looks redundant to me. Would be better to send "1 update packed with 12 bytes change".
Then we need to select an upper limit. Maybe send it not less ofter than MaxLinesPerSecond key parameter ?

<dimir> So I guess that would be something like MetaUpdateLines, number of lines required to send meta information update packet.

<dimir> We decided to send meta information update at most once per log check. If a user sets delay to 10 seconds I guess he is OK with having one packet every 10 seconds even if there is no match.

zalex, if you feel there is a need for configuration option to control that process perhaps you could create a zbxnext?

<richlv> users commonly set (and we suggest them to) log monitoring item interval to 1 second. it's not a big deal (or at least it does not seem to be so currently), but maybe a new zbxnext to send metainfo only every 10 or so seconds would be useful

asaveljevs Quoting the specification:

Active agent will send "lastlogsize", "mtime" and "state" updates at the same time as it usually sends historical data.

In Zabbix agent configuration file, this roughly corresponds to BufferSend configuration option.

It has not been explicitly stated in the specification, but the way I imagined how it would work is that at most one meta information update for a single log item would be sent per BufferSend seconds (only in case something has changed). If there are any technical difficulties achieving this behavior, please describe them.

<dimir> RESOLVED in r50989

asaveljevs CLOSED

Comment by dimir [ 2014 Dec 02 ]

OK, just had conversation with sasha and we agreed that this is not good. In case of no match only one meta information update packet should be sent by agent. Will add this to spec.

Comment by dimir [ 2014 Dec 04 ]

(2) I noticed one more thing when working with this issue. Agent would not send state change of NOTSUPPORTED log item before the size or mtime of log file changes. E. g. scenario:

  • the log is monitored, lastlogsize is 10
  • log file permissions are set to not allow zabbix agent user read access
  • log item becomes NOTSUPPORTED
  • log file permissions grant read access back to zabbix agent user
  • agent will not send state change until size or mtime of the log file is changed

Should we change agent behavior to send log item state change as soon as there is no error any more?

<richlv> that would be very good, but what data would we send if none of those values changes ?

asaveljevs Yes, we should send "state" change, even if "lastlogsize" and "mtime" have not changed. Specification is explicit about it:

If it does not have any data to send, or "lastlogsize" and "mtime" increased further after last data was acquired, or "state" has changed, it sends "lastlogsize", "mtime" and "state" update with "value" JSON tag omitted (note that if "state" tag is omitted, it defaults to "0").

<dimir> RESOLVED in r51075

asaveljevs If server's "Refresh unsupported items" setting is sufficiently low (e.g., 1 second) so that unsupported items are always included in the active check list, then the problem described above by dimir is still observed. REOPENED.

<dimir> RESOLVED in r51717

asaveljevs "goto" was unnecessary, because it did not jump over anything. RESOLVED in r51721.

<dimir> Agreed. CLOSED

Comment by richlv [ 2014 Dec 04 ]

(3) spec says this should resolve ZBXNEXT-91, too - once finalised, that should be explicitly tested and that issue closed

<dimir> Yep, that is already there, in the dev branch. But I'll note about it resolved explicitly bit later.

<dimir> RESOLVED in r50976

asaveljevs CLOSED

Comment by dimir [ 2014 Dec 05 ]

(4) [F] Seems when changing log item key (even e. g. logrt[*] -> log[*]) lastlogsize is not updated in the items table. Was that always like that? Should we consider resetting lastlogsize, mtime and state when item key is changed?

zalex_ua read all comments above
I think we need to leave it as it is now, e.g. don't update lastlogsize.

<dimir> Right, I was also thinking that would be the best choice.

CLOSED

Comment by dimir [ 2014 Dec 08 ]

FYI, new feature request, respect "refresh unsupported" by active agent: ZBXNEXT-2633 .

Comment by Aleksandrs Saveljevs [ 2014 Dec 15 ]

(5) Currently, in function proxy_get_history_data(), we always send "lastlogsize" and "mtime". Would it make sense to only send these for log items?

<dimir> Seems like rather question to sasha. As for me, I think it's good to avoid sending useless data.

<dimir> sasha agreed on this idea so I'm going to implement it.

<dimir> RESOLVED in r51239

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

(6) In src/zabbix_agent/active.c, there is a new function:

static int	metric_ready_to_process(const ZBX_ACTIVE_METRIC *metric)
{
	if (ITEM_STATE_NORMAL != metric->state)
	{
		if (ITEM_STATE_NOTSUPPORTED != metric->state || 0 == metric->refresh_unsupported)
			return FAIL;
	}

	return SUCCEED;
}

If metric->state is neither ITEM_STATE_NORMAL nor ITEM_STATE_NOTSUPPORTED, then what is it?

<dimir> I made it so for possible new states in the future. If those appear those would mean not ready by default. And we would have a single entry point where to consider those, not go throughout the code and adjust checks there.

RESOLVED

asaveljevs I am not aware of any place where we protect against hypothetical code changes in the future - code should reflect its current state. In this particular case, if we add a new item state (which is currently not being planned), we would still have to go through the code and check every place, including this one. Moreover, the change assumes that the hypothetical item state that we will add will mean "not ready for processing", but it might as well be the other way round.

asaveljevs Perhaps discussable further with sasha. REOPENED.

<dimir> Agreed. RESOLVED in r51372

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

(7) In parse_list_of_checks(), we can return from the function without destroying "added_metrics" vector. In the current implementation, this does not leak memory, but from the coding style perspective, this is not good.

<dimir> RESOLVED in r51202

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

(8) In parse_list_of_checks(), where we remove active checks that were not received, we can erroneously skip some, because after zbx_vector_ptr_remove() variable "i" should be the same on the next iteration. Also, if the metric order is not important, consider using zbx_vector_ptr_remove_noorder() for efficiency.

<dimir> RESOLVED in r51204

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

(9) The following comment was reformatted, but the line is too long now:

if (NULL != mtime)	/* will be null for "eventlog" and "log" and the value will be 0, only "logrt" matters */
	el->mtime = *mtime;

While fixing that, it would be good to paraphrase it to be more understandable.

<dimir> I agree on rephrasing, but the length is exactly 120 characters, which is not over limit?

RESOLVED in r51206

asaveljevs Based on r51205:

$ grep will.be.0 src/zabbix_agent/active.c | wc -L
121

asaveljevs Removed the comment completely in r51310. It was not necessary. RESOLVED.

<dimir> CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

(10) I wonder why the following change was done in process_eventlog_check() function:

Index: src/zabbix_agent/active.c
===================================================================
--- src/zabbix_agent/active.c	(revision 51138)
+++ src/zabbix_agent/active.c	(working copy)
@@ -1047,7 +1114,6 @@
 		goto out;
 	}
 
-	s_count = 0;
 	p_count = 0;
 	lastlogsize = metric->lastlogsize;
 

This seems to leave "s_count" variable uninitialized.

<dimir> Unintentional changes, don't know how that could happen. RESOLVED in r51205

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

(11) Please review stylistic suggestions in r51203.

<dimir> Looks great, thanks! CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 29 ]

(12) In process_active_checks() function, in case an item becomes not supported, we call process_value() with persistent=0. However, when a meta information update is needed, we call process_value() with persistent=1. Shouldn't persistent be 0 is the latter case, too?

<dimir> Agreed. RESOLVED in r51365

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Dec 29 ]

(13) Zabbix server crashes in the following scenario. A log[] item was created with a non-existing file. Zabbix agent correctly reported it to the server as follows:

{
    "clock": 1419859792,
    "data": [
        {
            "clock": 1419859789,
            "host": "Zabbix server",
            "key": "log[/tmp/logfile.txt]",
            "ns": 311836170,
            "state": 1,
            "value": "Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory"
        }
    ],
    "ns": 312502657,
    "request": "agent data"
}

After some time, the agent sent another report that the item is still not supported:

{
    "clock": 1419859910,
    "data": [
        {
            "clock": 1419859909,
            "host": "Zabbix server",
            "key": "log[/tmp/logfile.txt]",
            "ns": 341259525,
            "state": 1
        }
    ],
    "ns": 341554742,
    "request": "agent data"
}

This, I believe, should not have happened. Also note the absence of the error message.

It is at this point that the server crashed with the following backtrace:

 31039:20141229:153150.341 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...
 ...
 31039:20141229:153150.343 15: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](zbx_strlen_utf8_nchars+0x67) [0x49dde5]
 31039:20141229:153150.343 14: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](zbx_db_strlen_n+0x23) [0x4e3c74]
 31039:20141229:153150.343 13: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data]() [0x470ad7]
 31039:20141229:153150.343 12: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](dc_add_history+0x41) [0x470c4e]
 31039:20141229:153150.343 11: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](process_mass_data+0x6f3) [0x4c7ecd]
 31039:20141229:153150.343 10: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](process_hist_data+0x6e4) [0x4c8dc5]

<dimir> My bad, the second packet shouldn't have been sent.

RESOLVED in r51369

asaveljevs The diff looks like it only changes agent code. Therefore, it is still possible to crash the server by sending bad data. REOPENED.

<dimir> Currently we do not handle the case when proxy/server receives unsupported item without an error message. It seems to me such handling shouldn't be introduced within the scope of this issue.

RESOLVED

asaveljevs Before implementing this issue, server and proxy did not crash when they received an item without an error message. So you plan to introduce a way for any malicious attacker to crash the server within the scope of this issue? No, it will not do. Such messages should be rejected. REOPENED.

<dimir> My bad, you are totally right. RESOLVED in r51405,r51442

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ]

(14) Suppose we are monitoring a log file and fill it as follows:

$ echo -ne 'line 1\nline 2\nline 3\nline 4\nline 5\n' > /tmp/logfile.txt

Zabbix agent then sends the following two messages with BufferSend seconds in between:

  5675:20150108:120757.049 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 1","lastlogsize":7,"clock":1420711677,"ns":49700463}],"clock":1420711677,"ns":49771574}]
  5675:20150108:120802.053 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 2","lastlogsize":14,"clock":1420711677,"ns":50627725},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 3","lastlogsize":21,"clock":1420711677,"ns":50790296},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 4","lastlogsize":28,"clock":1420711677,"ns":50911200},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 5","lastlogsize":35,"clock":1420711677,"ns":51042818}],"clock":1420711682,"ns":53654876}]

It can be seen that the first line is sent in a separate message, while the rest of the lines are sent in the second message. The agent should probably send all lines in one message.

asaveljevs Although that seems to be the behavior in latest 2.2 and current trunk, so fixing this is optional.

asaveljevs Discussed with sasha: will create a new ZBX to fix in trunk only, if not fixed here.

asaveljevs Moved out to ZBX-9229. WON'T FIX.

Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ]

(15) If a file has 0 size (think ZBXNEXT-91), then the agent sends data without "lastlogsize":

 14710:20150108:124238.062 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","clock":1420713757,"ns":62134475}],"clock":1420713758,"ns":62382604}]

This happens to work, but I do not think this is according to protocol. The agent should include "lastlogsize" of 0.

<dimir> RESOLVED in r51500

asaveljevs "lastlogsize" and "mtime" are now sent for every item, but it would probably be better to send them for log items only. REOPENED.

<dimir> Send lastlogsize and mtime only for active items of type log. Proposing to use flags for active items buffer elements. RESOLVED in r51565

asaveljevs Commit in r51617 renames ZBX_FLAG_ELEMENT_PERSISTENT to ZBX_BUFFER_ELEMENT_PERSISTENT and moves these flag constants to a different (more appropriate) header. Please take a look. RESOLVED.

<dimir> I have yet renamed the set of flags to ZBX_METRIC_FLAG_* . It seems to sound better along with the new flags that I have added. Also the type of metric is now stored in flags and we do not need to strncmp() item keys during every check. Please check part of the commit in r51646. RESOLVED.

asaveljevs It is possible to define ZBX_METRIC_FLAG_LOG differently, saving bit space, as is done in src/libs/zbxdbhigh/lld_item.c, for instance. RESOLVED in r51650 and r51653.

<dimir> CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ]

(16) Suppose we are monitoring two items: log[/tmp/logfile.txt] and log[/tmp/logfile.txt,SENDME]. We then create a new file as follows:

$ echo -ne 'line a\n!SENDME!\nline b\n' > logfile.txt

Zabbix agent will then send the following:

 15134:20150108:125915.436 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"!SENDME!","lastlogsize":16,"clock":1420714755,"ns":435783934}],"clock":1420714755,"ns":435903828}]
 15134:20150108:125920.441 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","lastlogsize":23,"clock":1420714755,"ns":436784750},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line a","lastlogsize":7,"clock":1420714755,"ns":437154119},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"!SENDME!","lastlogsize":16,"clock":1420714755,"ns":437278892},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line b","lastlogsize":23,"clock":1420714755,"ns":437403214}],"clock":1420714760,"ns":440947476}]

This seems to be correct. However, in "Latest data" we see only "line a" and "line b" for the first item and "!SENDME!" and an empty value for the second item.

<dimir> RESOLVED in r51520

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ]

(17) Perform the same scenario as 16, wait a second, and then remove "logfile.txt". Zabbix agent will send the following, which is correct:

 15134:20150108:130950.879 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"!SENDME!","lastlogsize":39,"clock":1420715390,"ns":879491221}],"clock":1420715390,"ns":879613275}]
 15134:20150108:130955.885 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","lastlogsize":46,"clock":1420715390,"ns":880508379},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line a","lastlogsize":30,"clock":1420715390,"ns":880879926},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"!SENDME!","lastlogsize":39,"clock":1420715390,"ns":881004172},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line b","lastlogsize":46,"clock":1420715390,"ns":881127530},{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715394,"ns":885074194},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715394,"ns":885253293}],"clock":1420715395,"ns":885555347}]

However, the SENDME item stays supported in the frontend. Moreover, even when the agent sends the following lines later, the SENDME items is still supported in the frontend:

 15134:20150108:131051.899 sending [{"request":"active checks","host":"Zabbix server"}]
 15134:20150108:131056.902 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715456,"ns":902500077}],"clock":1420715456,"ns":902665723}]
 15134:20150108:131101.904 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715456,"ns":903478041}],"clock":1420715461,"ns":904777423}]

<dimir> I tried two times and couldn't reproduce it. Could you please give more details, e. g. how long to wait before removing the file etc.

asaveljevs Might have been fixed by (16). CANNOT REPRODUCE.

<dimir> Actually I tried it before deploying fix for (16).

Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ]

(18) Please review my suggestions in r51450.

<dimir> CLOSED

Comment by Oleksii Zagorskyi [ 2015 Jan 09 ]

Check ZBX-9195 before closing current issue.

Comment by Aleksandrs Saveljevs [ 2015 Jan 13 ]

(19) Suppose we are monitoring a file using log[/tmp/logfile.txt] and we create this log file using the following command:

$ echo -ne 'line\n' > /tmp/logfile.txt

Agent sends the following:

[{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line","lastlogsize":5,"mtime":0,"clock":1421156171,"ns":932857930}],"clock":1421156172,"ns":933155449}

This is correct. If we now remove the file, then agent sends:

{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":5,"mtime":0,"clock":1421156181,"ns":938771840}],"clock":1421156182,"ns":939035400}

This is correct. However, if we now recreate the file using the same command above, then we get the following:

{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line","lastlogsize":5,"mtime":0,"clock":1421156241,"ns":953114015}],"clock":1421156241,"ns":953163636}
{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","lastlogsize":5,"mtime":0,"clock":1421156241,"ns":953593188}],"clock":1421156246,"ns":956022902}

Two messages are sent, the second one being redundant.

<dimir> Let's send state along with list of active checks. RESOLVED in r51568,r51570

asaveljevs The solution is suspicious, because it changes server and agent communication protocol without a specification. This is undesirable.

asaveljevs Also, the following in function add_check() does not look good:

	...
	metric->state = ITEM_STATE_NORMAL;
	metric->refresh_unsupported = 0;
	metric->lastlogsize = lastlogsize;
	metric->mtime = mtime;
	metric->state = state;
	...

asaveljevs As described by dimir, the above solution was implemented to address the following scenario:

  • item is not supported;
  • lastlogsize is zero;
  • agent is stopped;
  • file is recreated with zero size;
  • agent is started;
  • list of active checks is acquired;
  • file is checked, but lastlogsize has not changed and therefore nothing is sent to the server;
  • item remains not supported on the server.

asaveljevs It is suggested that when an item is first received from the server after agent startup, its update is always sent to the server. REOPENED.

<dimir> Introduced new metric flag: ZBX_METRIC_FLAG_NEW which allows always sending at least meta information update packet after agent startup. This fixes situation when unsupported log item changes state without changing meta information during restart. RESOLVED in r51646.

asaveljevs Unfortunately, the problem was not solved: two messages are still sent in the above scenario. REOPENED.

<dimir> Sorry, there was one fault in the logic of determining if we need to send meta information update packet. RESOLVED in r51673

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 16 ]

(20) According to the specification, only logrt[] should send "mtime". Currently, all log items send it.

<dimir> RESOLVED in r51674

asaveljevs CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 19 ]

(21) Please review changes in r51682, r51684, r51685, and r51694. In particular, r51684 updates to the latest trunk and resolves conflicts.

<dimir> CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 19 ]

(22) It is possible to crash Zabbix server by sending in the following JSON using misc/debug/sender.pl:

{
    "clock": 1421658777,
    "data": [
        {
            "clock": 1421658771,
            "host": "Zabbix server",
            "key": "LOG[\"/tmp/logfile.txt\"]",
            "lastlogsize": 108,
            "ns": 944843829
        }
    ],
    "ns": 948056387,
    "request": "agent data"
}

Here, item type is set to "Character" and JSON does not include "value" tag (like in a meta update).

Crash backtrace:

 25503:20150119:112742.799 === Backtrace: ===
 25503:20150119:112742.800 16: zabbix_server: trapper #2 [processing data](print_fatal_info+0x117) [0x491e88]
 25503:20150119:112742.800 15: zabbix_server: trapper #2 [processing data]() [0x492221]
 25503:20150119:112742.800 14: /lib/x86_64-linux-gnu/libc.so.6(+0x35180) [0x7fc335912180]
 25503:20150119:112742.800 13: zabbix_server: trapper #2 [processing data](zbx_replace_invalid_utf8+0x3b1) [0x49f956]
 25503:20150119:112742.800 12: zabbix_server: trapper #2 [processing data](set_result_type+0x2d8) [0x452525]
 25503:20150119:112742.800 11: zabbix_server: trapper #2 [processing data](process_mass_data+0x7a5) [0x4c9429]
 25503:20150119:112742.800 10: zabbix_server: trapper #2 [processing data](process_hist_data+0x6ea) [0x4ca275]
 25503:20150119:112742.800 9: zabbix_server: trapper #2 [processing data]() [0x43c3f0]
 25503:20150119:112742.800 8: zabbix_server: trapper #2 [processing data]() [0x43d3eb]
 25503:20150119:112742.800 7: zabbix_server: trapper #2 [processing data]() [0x43d9f0]
 25503:20150119:112742.800 6: zabbix_server: trapper #2 [processing data](trapper_thread+0x1a9) [0x43db9b]
 25503:20150119:112742.800 5: zabbix_server: trapper #2 [processing data](zbx_thread_start+0x37) [0x492cf5]
 25503:20150119:112742.800 4: zabbix_server: trapper #2 [processing data](MAIN_ZABBIX_ENTRY+0x587) [0x41a0e1]
 25503:20150119:112742.800 3: zabbix_server: trapper #2 [processing data](daemon_start+0x2ab) [0x49140c]
 25503:20150119:112742.800 2: zabbix_server: trapper #2 [processing data](main+0x268) [0x419b58]
 25503:20150119:112742.800 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc3358feb45]
 25503:20150119:112742.800 0: zabbix_server: trapper #2 [processing data]() [0x413119]

<dimir> RESOLVED in r51703

asaveljevs A better place for the condition seems to be a bit lower, so that more frequent conditions are executed first. RESOLVED in r51704.

<dimir> CLOSED

Comment by Aleksandrs Saveljevs [ 2015 Jan 20 ]

(23) As pointed out by sasha, we have missed one important thing in the specification: we have made it so that proxy sends "lastlogsize" and "mtime" to the server, but not the other way round. By itself, sending of meta fields from proxy to server is only useful if the agent is later switched from proxy to server. However, this is less likely. The real motivation behind sending of meta fields from proxy to server was that, if proxy database is removed, proxy should get correct configuration from server.

So the general idea is that server should now send meta fields to proxy (for log items only?) and proxy should act in a way similar to the agent - if it already has configuration data for these items, it should ignore these fields. Otherwise, it should create a new item with these meta values.

<dimir> Will send lastlogsize and mtime for all items. RESOLVED in r51743, r51746

sasha Please take a look at my solution in r52136.

<dimir> Looks great! CLOSED

Comment by Oleksii Zagorskyi [ 2015 Jan 29 ]

(24) yesterday I discovered one edge case (not related to current development) and want to discuss it.
Agent is 2.4.3, server is 2.2.7
We consider an eventlog[] key.

When a host goes to maintenance (without data collection), server still responds with list of agent's active checks.
Agent continue to monitor these active items and tries to send values to server but server rejects such values:

  3376:20150129:020223.064 JSON before sending [{"request":"agent data","data":[{"host":"core2","key":"eventlog[Test]","value":"RandomID(8238)","lastlogsize":9,"timestamp":1422489740,"source":"TestSRC2","severity":4,"eventid":70,"clock":1422489742,"ns":182718694}],"clock":1422489743,"ns":204912928}]
  3376:20150129:020223.064 JSON back [{
	"response":"success",
	"info":"processed: 0; failed: 1; total: 1; seconds spent: 0.000098"}]
  3376:20150129:020223.064 In check_response() response:'{
	"response":"success",
	"info":"processed: 0; failed: 1; total: 1; seconds spent: 0.000098"}'

Because of the reject, server for next request of list of active checks still responds with previous "lastlogsize":8 for the eventlog key.

  3376:20150129:020247.299 sending [{"request":"active checks","host":"core2"}]
  3376:20150129:020247.299 before read
  3376:20150129:020247.314 got [{
	"response":"success",
	"data":[
		{
			"key":"eventlog[Test]",
			"delay":5,
			"lastlogsize":8,
			"mtime":0},
		{
			"key":"system.uptime",
			"delay":30,
			"lastlogsize":0,
			"mtime":0}]}]

After the maintenance is finished there can be two different cases:

a) if agent is still running and found new eventlog value - it sends it to server with "lastlogsize":10 and everything is fine.
b) agent was restarted and during start it got "lastlogsize":8 from server and sent the value which was created during host maintenance, which is bad of course.

Question is - will be the b) case "fixed" in current development?
I suppose yes, but would want to be sure

<dimir> Interesting use case, thanks. So, is the proper solution here to allow meta information update for hosts in maintenance mode?

zalex_ua hard question. I don't think we need to change something in server code related to maintenance.
I'm not sure how current ready development will affect on this use case.
If agent is sending fresh "lastlogsize" to server periodically and if after the maintenance ends the lastlogsize" will be accepted by server - this is enough, IMO.

<dimir> As richlv mentioned, but what if hundreds of events (in the eventlog) were generated while host in maintenance. Do you find it normal that after maintenance period all these alerts will be triggered? I'd still consider updating meta information during host maintenance. This probably sounds scary, but if you think about it, the behavior is logical.

zalex_ua of course accepting the meta information even during maintenance is the most "robust" approach. I'd don't have any objections to do it, but it would requires server code changes and I'm not sure you would happy to do that.
Note that eventlog records created during maintenance will not be accepted by server so will not trigger any alerts after maintenance - case a).
For case b) - If active agent is sending the meta information to server periodically for sure - it's enough, IMO.
Note - I'm not sure how it will work according to current development as all discussions above are very complex

<dimir> It's not that complex actually. We already have a check where we skip incoming data if it is for host in maintenance status with no data collection. We would just add an exception for this case, if it's a log item we'd drop the value (if it exists) and continue processing the data (in other words convert real value packet to meta information update packet).

zalex_ua ok, feel free to do it in such a way.
Just interesting, in this case - what will be returned in response to agent for the value: processed or failed ?

<dimir> In this case "processed" would be returned. I think it sounds right. That will introduce a small change though, before, in such case a user would get "failed", after this fix he'd get "processed".

zalex_ua things become more and more complex , heeeh.
ok, go ahead

<dimir> Discussed it with sasha. The proposed solution is still kind of "hack". If agent is down during the maintenance we'd still collect all the events generated during maintenance period. Moreover, we can not depend on the maintenance period because some log messages do not contain timestamps.

So the conclusion is that we will not fix this issue in this task as we do not see a simple and clear solution here. The problem of collecting events generated during host maintenance if agent gets restarted afterwords will still exist. Feel free to open another issue.

CLOSED

Comment by dimir [ 2015 Feb 11 ]

Fixed in pre-2.5.0 (r52202).

Comment by dimir [ 2015 Feb 11 ]

(25) [D] Document new feature (a hint: search for "meta information").

RESOLVED

<richlv> thanks for the protocol docs. some notes :

  • proxy docs now have several sections as "TBD" - they are unlikely to be filled ever, if left like this. i would suggest to do only
    complete pages, finishing these sections

<dimir> I promise to fill them later when we close this issue

  • in the agent page, ""lastlogsize" changed" example also includes mtime, and it has changed - is that indeed correct ?

<dimir> No, we do not send "mtime" for log items. Thanks, fixed.

RESOLVED

<sasha> increasing the size of proxy configuration data should be mentioned in the "Upgrade notes"

REOPENED

<dimir> Correct.

RESOLVED

sasha This will result in 1-2 % increase depending on amount of log items. - it is incorrect. log must be removed because these fields are added to all items.

REOPENED

<dimir> I wanted to specify that for configuration with no log items it will increase by 1%, otherwise by 2. But I agree this might be not clear, simplified the explanation.

RESOLVED

<sasha> Thanks a lot! CLOSED

Comment by dimir [ 2015 Mar 09 ]

Memory leak in active agent on receiving list of active checks fixed in r52639. Thanks, andris!

Comment by Oleksii Zagorskyi [ 2015 Mar 11 ]

I've changed one sentence on the whatsnew300 page:
Log file monitoring (*log, **logrt* and *eventlog*) was improved by keeping log file meta information on Zabbix server.
to
Log file monitoring (*log, **logrt* and *eventlog*) was improved by sending actual log file meta information in specific cases and keeping it on Zabbix server side.
hope it's a bit more clear now

Comment by Oleksii Zagorskyi [ 2015 Sep 25 ]

I apparently was wrong 5 years ago saying (google translate is not perfect, but logic is clear):

But this feature is definitely indicates the agent to handle only those log entries that occurred after starting the agent. And I do not want to miss the events that occurred during the period when the Agent is not working.

no one fixed me ... sad.

documentation is not detailed in this regard, it just says:

skip (skip processing of older data).

sad ...

thankfully to lines:

	/* existing log[] and eventlog[] data can be skipped */
	metric->skip_old_data = (0 != metric->lastlogsize ? 0 : 1);

The logic has been implemented in very first commit r9669 of DEV-444 and it's actual

Comment by Oleksii Zagorskyi [ 2015 Sep 29 ]

ZBX-9918 asks to document details for the "skip" parameter.

Comment by Alexander Vladishev [ 2015 Nov 12 ]

This caused regressions: ZBX-10061, ZBX-10067, ZBX-10100

Generated at Fri Apr 19 14:07:31 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.