[ZBXNEXT-2430] Option to define log monitoring on agent level Created: 2014 Sep 01 Updated: 2016 Jul 27 Resolved: 2016 Jun 27 |
|
Status: | Closed |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Agent (G) |
Affects Version/s: | 2.2.6 |
Fix Version/s: | 3.2.0alpha1 |
Type: | New Feature Request | Priority: | Major |
Reporter: | Marc | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 12 |
Labels: | log, logmonitoring, logrt, newitemkey, patch, performance | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | ZbxNext2430-UseCaseRateLimit.png ZbxNext2430-ab-2.4.6-2.patch ZbxNext2430-ab-2.4.6.patch ZbxNext2430-b-2.4.6.patch | ||||||||||||||||
Issue Links: |
|
Description |
When monitoring a log file one often wants to get an event when a certain condition is met. What I think of is the possibility to configure a kind of active log item on agent level. |
Comments |
Comment by richlv [ 2014 Sep 02 ] |
how would this differ from regexp parsing on the agent side ? |
Comment by Marc [ 2014 Sep 02 ] |
The application may log up to 10 thousand lines per second, so to not fall behind reducing maximum lines per second is no option. Now when an application like this has a serious problem it might log ERROR messages in a similar rate. Being able to catch up with this kind of logs was one of the key moments for Zabbix over the legacy home-brew monitoring solution. However, when it comes to log monitoring it beats Zabbix in many ways since it allows to do the work locally and sending only one event when a certain condition is met. I don't think about reinventing a SEC like functionality and putting it into the agent - what would be really cool actually. Depending on how this might get implemented it could allow to have one item per log file that covers a massive amount of independent log events in a readable way. At the end it should be possible to cover most logging situations without having to send the data to the server, what would be a great improvement to log monitoring in Zabbix. |
Comment by Marc [ 2014 Sep 02 ] |
Having the possibility to define such agent side logic on server level in a convenient way would be even better. |
Comment by richlv [ 2014 Sep 07 ] |
sounds like the main problem is lots of ERRORs being logged at a high rate. |
Comment by Christian Reizlein [ 2015 May 26 ] |
that would be awesome feature, both, proposed A and B. |
Comment by Marc [ 2015 Oct 05 ] |
Attached patch ZbxNext2430-b-2.4.6.patch is a quick'n'dirty hack to implement b) counting matches of pattern. The patch introduces an optional seventh parameter to influence the log processing type: log[file,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<proctype>] |
Comment by Marc [ 2015 Oct 07 ] |
Attached patch ZbxNext2430-ab-2.4.6.patch implements a) rate limiting and b) counting matches of pattern log[file,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<proctype>] If <maxlines> = 100, then up to 100 messages/lines per second are returned and up to 300 messages/lines per second are skipped each interval. Or in other words 1/4 may be send and 3/4 may be skipped. |
Comment by Marc [ 2015 Oct 07 ] |
Attached patch ZbxNext2430-ab-2.4.6-2.patch support several rate limits. proctype - possible values:
When using ratelimit16 and having <maxlines> = 10, then no more than 10 messages/lines per second are send but up to 150 messages/lines per second are skipped each interval. This should allow to use lower <maxlines> settings but parsing/skipping more log messages than normal. Edit: 144526:20151007:135747.819 Rate limit reached for key 'log[/tmp/zabbix_agentd.log,error,,,,,ratelimit16]'. Skipping data until parsing limit reached or EOF. By this one can identify that messages are likely not considered due to rate limiting. |
Comment by Oleksii Zagorskyi [ 2015 Oct 20 ] |
Regarding user friendly naming. Regarding possible values for the param. ratelimit4, ratelimit8, ratelimit16 look like not very easy to understand. |
Comment by Andris Mednis [ 2015 Nov 05 ] |
Comments so far mention only enhancements to log[] item. Same for logrt[], too ? |
Comment by Marc [ 2015 Nov 05 ] |
Yes, of crouse logrt[] too. I'd even say actually any log alike item. While I personally could easily live without having event log considered yet |
Comment by Marc [ 2015 Nov 06 ] |
Find below a calculation for the use case of rate limit we made for a real-life scenario. |
Comment by Andris Mednis [ 2015 Nov 09 ] |
Proposals so far include:
If something happens and log file starts growing very fast then the first lines may contain a valuable information what happened, the last lines may contain current information, and the middle lines are likely less informative (consequences of root cause from the first lines). |
Comment by Andris Mednis [ 2015 Nov 09 ] |
Another idea - what if we add a parameter which specifies maximum acceptable time delay of logfile parsing ? For example, 30 seconds. Then agent dynamically tunes amount of lines to be skipped to keep up with high intensity log file and a user is not concerned which <ratelimitX> to choose for particular logfile. |
Comment by Marc [ 2015 Nov 13 ] |
I presume to [...] spend half of the limit [...] on the first lines and <maxlines>/2 on the last lines [...] we need to write matching lines to a ring buffer of size <maxlines> / 2 after exceeding the initial <maxlines> / 2 matching lines - since we do not know when or whether at all reaching EOF respectively <maxparse>. In assumption that a rate limit may likely happen for multiple update intervals subsequently, I don't see a big benefit then. In such cases the first lines of the next update interval are in fact the last lines of the previous update interval. For the sake of completeness the layman approach I had thought of up to now Having two additional parameters <logic> and <maxparse>. The former parameter does allow many different log processing methods. Here count and ratelimit. The latter parameter can be useful for the rate limit function as well as log monitoring in general - as it allows to have control over the currently hard coded limitation. That's to say:
In addition to that, when having the logic set to <ratelimit>,
The only downside I see there is the increased count of item parameters and the order of parameters. E.g. I think <maxlines> and <maxparse> should actually be next to each other. On the other hand that's "just" cosmetics. So I generally see two subjects here:
In consideration that rate limit is expected to happen for subsequent update intervals, I'd argue it's almost unrelated whether we skip all lines after a limit, split the limit for begin and end, or sending every <n>th line. I believe finally it can't be foreseen where more important information resides. I believe it's rather important to have a mechanism available to identify that rate limiting took place. It's up to the user then to evaluate whether there was something important skipped. Your second approach sounds indeed pretty interesting. Can you elaborate on how this could be implemented (from a logical pov)? |
Comment by Marc [ 2015 Nov 13 ] |
For the record. I would really appreciate to keep the <logic> option resp. to see the count functionality considered too. |
Comment by Andris Mednis [ 2015 Nov 19 ] |
Proposal for new log[], logrt[] parameters 1. Add two new parameters - <method> and <maxdelay> - to log[] and logrt[] items: log[/path/to/file/file_name,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<method>,<maxdelay>] logrt[/path/to/file/regexp_describing_filename_pattern,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<method>,<maxdelay>] <method> parameter. Type: string. Values:
<maxdelay> parameter is maximum acceptable delay in seconds. Type: float. Values:
2. Currently there are 2 limits on number of logfile lines: ---------------------------------------------------------------------- /* do not flood Zabbix server if file grows too fast */ s_count = rate * metric->refresh; <---- maxlines * refresh_interval <---- maxlines (MaxLinesPerSecond): min = 1, default = 20, max = 1000 /* do not flood local system if file grows too fast */ p_count = 4 * s_count; <---- number of processed lines is 4x send-to-server lines ---------------------------------------------------------------------- These limits are not changed. 3. Loop over logfile records as usual until one of "break" conditions is true:
4. If processing finished due to reaching one of limits (s_count or p_count), then calculate:
5. If 0 < t_del <= maxdelay , then delay is acceptable, wait the next check, maybe it can catch up next time. 6. If t_del > maxdelay , then ignore lines by "jumping" over them to position pos = size - v * maxdelay . This is a first draft. Waiting for comments. |
Comment by Andris Mednis [ 2015 Nov 24 ] |
New ideas: log[/path/to/file/file_name,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<maxdelay>] logrt[/path/to/file/regexp_describing_filename_pattern,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<maxdelay>] (see description of <maxdelay> in previous comment) 2. Implement "count" functionality as two new items: log.count[file,<regexp>,<encoding>,<maxproclines>,<mode>,<maxdelay>] logrt.count[/path/to/file/regexp_describing_filename_pattern,<regexp>,<encoding>,<maxproclines>,<mode>,<maxdelay>] where <maxproclines> - max number of lines per second to be analyzed. |
Comment by Marc [ 2015 Nov 24 ] |
I was thinking of another option "de-duplication" for quite some time and believe it applies to what you've described for E.g. log messages like these: 921:20151124:174529.693 item "host1:user.param[inflexible]" became not supported: Unsupported item key. 921:20151124:174530.187 item "host1:user.param[inflexible,too]" became not supported: Unsupported item key. 921:20151124:174531.624 item "host2:user.param[inflexible]" became not supported: Unsupported item key. 921:20151124:174532.597 item "host3:user.param[flexible]" became not supported: Not supported by Zabbix Agent. processed in one interval could result in: Item became not supported because: Unsupported item key. Item became not supported because: Not supported by Zabbix Agent. by an item like: log.dedup[file, "^.{27}item \".*\" (became not supported): (.*)",,,, "Item \1 because: \2"] Otherwise we have an |
Comment by richlv [ 2015 Nov 25 ] |
okkuv9xh, thanks for that usecase - i was just talking to andris yesterday and trying to think of other needs in this area. |
Comment by Marc [ 2015 Nov 25 ] |
Sorry! I was actually referring to c) mentioned in this comment. c) another use case - using the output feature, extracting a substring (like a number), comparing it on the agent level to a threshold and only sending to the server a rate-limited value of 1, or count of matches, or the raw value. |
Comment by Andris Mednis [ 2016 Apr 25 ] |
okkuv9xh, one more decision you might be interested to comment on - what to do if log.count[] cannot send result to server. Current decision is added to specs as follows: |
Comment by Aleksandrs Saveljevs [ 2016 May 02 ] |
Andris, we may wish to consider fixing ZBXNEXT-1842 together with this issue. andris ZBXNEXT-1842 seems to be about sending more metadata to server and displaying them in frontend. It was decided to not include ZBXNEXT-1842 into current solution. |
Comment by Andris Mednis [ 2016 Jun 02 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-2430-2 . (Frontend part is not yet implemented but it is one help screen modification away. New parameter 'maxdelay' and new items log.count[] and logrt.count[] are usable in frontend). |
Comment by vitalijs.cemeris (Inactive) [ 2016 Jun 02 ] |
Added log.count and logrt.count keys to item helper |
Comment by vitalijs.cemeris (Inactive) [ 2016 Jun 02 ] |
(1) Translation strings added:
andris Thanks! Reviewed. gunarspujats CLOSED |
Comment by Andris Mednis [ 2016 Jun 03 ] |
Documented in
martins-v Also: sasha CLOSED |
Comment by Andris Mednis [ 2016 Jun 09 ] |
(2) In function adjust_position_after_jump():
/* Searching forward did not find a newline. Now search backwards until 'min_size'. */
lastlogsize_tmp = lastlogsize_aligned;
seek_pos = lastlogsize_aligned;
Value of 'lastlogsize_tmp' is never read after assignment. andris RESOLVED in r60590. sandis.neilands CLOSED. |
Comment by Sandis Neilands (Inactive) [ 2016 Jun 09 ] |
(3) Preferably we should avoid reusing function names even for static functions. This complicates full text search and code analysis. We have two instances of destroy_logfile_list(). andris RESOLVED in r60592. sandis.neilands CLOSED. |
Comment by Sandis Neilands (Inactive) [ 2016 Jun 09 ] |
(4) In process_log_check() when maxdelay validation fails we print "Invalid seventh parameter". For *.count items it is the sixth parameter. andris Thanks! RESOLVED in r60595. sandis.neilands CLOSED. |
Comment by Sandis Neilands (Inactive) [ 2016 Jun 10 ] |
(5) Setup Item key: log[/tmp/testlog.log,,,1,,,1] . Item update interval: 1 second. Scenario 1. Add few log lines (ending with newline) to /tmp/testlog.log. line 1 line 1 line 1 2. Add log line (without newline) to /tmp/testlog.log. line 2222222222222 3. Extend the previous lines with additional '2's (still, don't add newline). line 222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 Expected Zabbix server gets either the whole 'line 2...' or no part of it. Actual Zabbix server gets the value '2222....' sandis.neilands Furthermore, consider this scenario. Add "line 1" (no newline). Add "234" to "line 1" (still no newline). The server will have the following records (2 is missing). line 1 24 sandis.neilands CLOSED. Turns out that text editors insert newline implicitly. After retesting using hex editor the problem was not reproduced. |
Comment by Sandis Neilands (Inactive) [ 2016 Jun 10 ] |
(6) Currently the occurrence of jumps is signaled out-of-band via agent's log. This could be problematic since the end-user must now correlate agent's log with the values stored in Zabbix. The user might not even have access to agent's log. Perhaps we should include an in-band signalling as well. For log and logrt it could be additional log entry (it should be unique enough for the users to be able to set triggers on it). For *.count - if we want it to still be supported in this case - return the negative value of the counted lines. This way the amount of lines read is still known but also it would indicate that some data was skipped. For triggers, calculated items. - use abs(). sandis.neilands After discussing with andris and alexei it was decided to stay with the current design for now. CLOSED. |
Comment by Marc [ 2016 Jun 10 ] |
sandis.neilands, Edit: Edit: Edit: |
Comment by richlv [ 2016 Jun 10 ] |
putting related but different info in log values hopefully won't be done. on the other hand, one might allow creating companion item. if the item exists, such info is put in it. if not, it is ignored. |
Comment by Gunars Pujats (Inactive) [ 2016 Jun 14 ] |
(7) [F] Frontend changes 1. Add new parameter - <maxdelay> - to log[] and logrt[] items. vitalijs.cemeris RESOLVED in r60489 gunarspujats CLOSED |
Comment by Gunars Pujats (Inactive) [ 2016 Jun 14 ] |
(8) [F] include/classes/items/CHelpItems.php:366 - missing function closing bracket, therefore creating new log.count[] item frontend produces validation error: Invalid key "log.count[file,<regexp>,<encoding>,<maxproclines>,<mode>,<maxdelay>" for item "log_count" on "rs": unexpected end of key. [items.php:644 → CFrontendApiWrapper->update() → CApiWrapper->__call() → CFrontendApiWrapper->callMethod() → CApiWrapper->callMethod() → CFrontendApiWrapper->callClientMethod() → CLocalApiClient->callMethod() → call_user_func_array() → CItem->update() → CItemGeneral->checkInput() → CApiService::exception() in include/classes/api/services/CItemGeneral.php:325] vitalijs.cemeris RESOLVED in r60629 gunarspujats CLOSED |
Comment by Gunars Pujats (Inactive) [ 2016 Jun 14 ] |
(9) [F] include/classes/items/CHelpItems.php:370 - item key logrt.count[] is missing first parameter file_regexp vitalijs.cemeris RESOLVED in r60629 gunarspujats CLOSED |
Comment by Gunars Pujats (Inactive) [ 2016 Jun 14 ] |
[F] Frontend tested. |
Comment by Andris Mednis [ 2016 Jun 14 ] |
Fixed in version:
|
Comment by Andris Mednis [ 2016 Jun 15 ] |
(10) Coverity detected a memory leak: ** CID 149415: Resource leaks (RESOURCE_LEAK) /src/zabbix_agent/active.c: 1255 in process_log_check() __________________________________________________________ *** CID 149415: Resource leaks (RESOURCE_LEAK) /src/zabbix_agent/active.c: 1255 in process_log_check() 1249 } 1250 out: 1251 zbx_free(encoding_uc); 1252 1253 free_request(&request); 1254 >>> CID 149415: Resource leaks (RESOURCE_LEAK) >>> Variable "logfiles_new" going out of scope leaks the storage it points to. 1255 return ret; 1256 } 1257 1258 static int process_eventlog_check(char *server, unsigned short port, ZBX_ACTIVE_METRIC *metric, 1259 zbx_uint64_t *lastlogsize_sent, char **error) 1260 { andris RESOLVED in new dev branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-2430 r60649 and r60653. sandis.neilands CLOSED. andris Fixed in version:
|
Comment by Sandis Neilands (Inactive) [ 2016 Jun 16 ] |
Successfully tested. |
Comment by Sandis Neilands (Inactive) [ 2016 Jun 17 ] |
(11) Tonight's Coverity scan didn't close the issue. There is a memory leak of logfiles_new in another branch. CID 149415 is still open. Scenario:
Output from Valgrind: ==6224== 16,965 (16,896 direct, 69 indirect) bytes in 3 blocks are definitely lost in loss record 232 of 233 ==6224== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==6224== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==6224== by 0x44DB0F: zbx_realloc2 (misc.c:499) ==6224== by 0x41961E: add_logfile (logfiles.c:1059) ==6224== by 0x419F04: pick_logfile (logfiles.c:1196) ==6224== by 0x419D9C: pick_logfiles (logfiles.c:1292) ==6224== by 0x415914: make_logfile_list (logfiles.c:1391) ==6224== by 0x41480A: process_logrt (logfiles.c:2374) ==6224== by 0x40FA09: process_log_check (active.c:1177) ==6224== by 0x40EABC: process_active_checks (active.c:1648) ==6224== by 0x40D739: active_checks_thread (active.c:1762) ==6224== by 0x443010: zbx_thread_start (threads.c:128) BTW, the path that Coverity is showing in process_logrt() doesn't lead to the problem - ret must not be SUCCESS, e.g. process_log() --> zbx_read2() must fail. As to why these problems almost impossible to find analytically: the path through the code to get to this problem contains more than 100 decisions spanning multiple levels of functions. Humans can't follow so many possibilities. andris RESOLVED in new dev branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-2430 r60681. sandis.neilands CLOSED. andris Fixed in version:
|