[ZBXNEXT-444] Improvement processing of event logs when Agent start and stop (value "lastlogsize" optimization) Created: 2010 Jul 07 Updated: 2015 Dec 11 Resolved: 2015 Feb 20 |
|
Status: | Closed |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Agent (G), Proxy (P), Server (S) |
Affects Version/s: | 1.8.2 |
Fix Version/s: | 2.5.0 |
Type: | Change Request | Priority: | Major |
Reporter: | Oleksii Zagorskyi | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 6 |
Labels: | eventlog, logmonitoring, windows | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | Improve processing of event logs when Agent start and stop (value 'lastlogsize' optimization).txt | ||||||||||||||||||||||||||||||||
Issue Links: |
|
Description |
As I promised - this is my third serious post about the work Zabbix agent with the Windows event logs. But this feature is definitely indicates the agent to handle only those log entries that occurred after starting the agent. And I do not want to miss the events that occurred during the period when the Agent is not working. For me it basically and I use this opportunity will not, though for others it may be very useful. As in its previous reports ZBXNEXT-443 and Example ? 1. We must catch rare messages from the Security event log - events "Account lockout" on the domain controller. Example ? 2. It is necessary to catch the event of the results of disk checking at boot (chkdsk - Checking file system) Specificity of both examples is that the events occur between very large count of other events. The same for the Example ?2. For example a workstation which is boot every day. So every morning, the agent will process the entire System eventlog such as a few years. Is bad !!! And if a few similar checks ? In short what I am - propose made changes to the work of the agent: Another innovation, which I propose (again, the future of mode -> skip is not for us !!!): Well, that's all my friends. This was my third and last message. Next will be about the triggers of the Event Log Specification: https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-444 |
Comments |
Comment by richlv [ 2010 Jul 07 ] |
i'd say here are only two suggestions 1. sending of lastlogsize to server upon agent shutdown (and i'd also say - periodic sending of lastlogsize even if no new incoming messages for server have appeared); this is also slightly similar to |
Comment by Oleksii Zagorskyi [ 2010 Jul 20 ] |
I noticed that when changing the item key is the value lastlogsize set to 0. This makes sense for example if the user changes the key, such as a switch from the log Application to log System. But for example in the case of the key changes from 'eventlog[Application,,,,16|66]' to 'eventlog[Application,,,,16|66|703]', ie we "extend the mask" for events capture, is no need to set 'lastlogsize=0'. Setting 'lastlogsize=0' instruct the Agent to re-read the entire log and re-sending all the events with ID=16,66 and 703, which incidentally we do not need also. Here, in this case very much would help option which I ask for in paragraph № 2 (in comment from Richlv). Then, documentation can make a point - if you change the key and do not want to re-sending entire log, then check the checkbox ...".." zalex_ua as stated below, it's not true starting from 2.0.0, so -> CLOSED. |
Comment by Oleksii Zagorskyi [ 2010 Aug 27 ] |
I've already forgotten, and again caught the fact that when changing a item key 'lastlogsize' is reset to 0. This warning will sound like this: zalex_ua as stated below, it's not true starting from 2.0.0, so -> CLOSED! |
Comment by Oleksii Zagorskyi [ 2012 Aug 27 ] |
Just another issue which similar to this one in a similar aspect: the agent should report "something" to the server before the correct stop. |
Comment by Oleksii Zagorskyi [ 2012 Dec 19 ] |
Here is a description of some not very clear case. Suppose we have a key: The text "unique-text" appears VERY rarely in the log. As a result the item becomes visually not supported in frontend, but actually after 10 minutes (refrech unsupported) it continue to be REALLY monitored by agent. The item will return to supported state (visually) only if it will find a line with text "unique-text" in the log. But all this time (hours+, days+) it will look in frontend as notsupported and it's ABSOLUTELY NOT CLEAR WHAT'S GOING ON. I have no idea how it could be improved. |
Comment by Oleksii Zagorskyi [ 2013 Aug 22 ] |
The problem about updating "lastlogsize" after changing item key also reported in |
Comment by Oleksii Zagorskyi [ 2014 Jul 02 ] |
Rich once said in IRC and now I checked it - starting from frontend 2.0.0 the "lastlogsize" column is not reseted to 0 if we update item key. |
Comment by Aleksandrs Saveljevs [ 2014 Oct 23 ] |
Specification available at https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-444 . |
Comment by dimir [ 2014 Dec 02 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-444 . |
Comment by dimir [ 2014 Dec 02 ] |
(1) One question. Currently agent would send meta information update with every new line in the log file. E. g.
$ echo -e "foo\nbar\nbaz" >> /tmp/my.log
Would generate 3 meta information update packets, each with 4 bytes change in lastlogsize, no matter if the line matched given pattern or not. Is that correct behavior or we would want to send 1 update packed with 12 bytes change in lastlogsize? zalex_ua "3 meta information update packets" looks redundant to me. Would be better to send "1 update packed with 12 bytes change". <dimir> So I guess that would be something like MetaUpdateLines, number of lines required to send meta information update packet. <dimir> We decided to send meta information update at most once per log check. If a user sets delay to 10 seconds I guess he is OK with having one packet every 10 seconds even if there is no match. zalex, if you feel there is a need for configuration option to control that process perhaps you could create a zbxnext? <richlv> users commonly set (and we suggest them to) log monitoring item interval to 1 second. it's not a big deal (or at least it does not seem to be so currently), but maybe a new zbxnext to send metainfo only every 10 or so seconds would be useful asaveljevs Quoting the specification:
In Zabbix agent configuration file, this roughly corresponds to BufferSend configuration option. It has not been explicitly stated in the specification, but the way I imagined how it would work is that at most one meta information update for a single log item would be sent per BufferSend seconds (only in case something has changed). If there are any technical difficulties achieving this behavior, please describe them. <dimir> RESOLVED in r50989 asaveljevs CLOSED |
Comment by dimir [ 2014 Dec 02 ] |
OK, just had conversation with sasha and we agreed that this is not good. In case of no match only one meta information update packet should be sent by agent. Will add this to spec. |
Comment by dimir [ 2014 Dec 04 ] |
(2) I noticed one more thing when working with this issue. Agent would not send state change of NOTSUPPORTED log item before the size or mtime of log file changes. E. g. scenario:
Should we change agent behavior to send log item state change as soon as there is no error any more? <richlv> that would be very good, but what data would we send if none of those values changes ? asaveljevs Yes, we should send "state" change, even if "lastlogsize" and "mtime" have not changed. Specification is explicit about it:
<dimir> RESOLVED in r51075 asaveljevs If server's "Refresh unsupported items" setting is sufficiently low (e.g., 1 second) so that unsupported items are always included in the active check list, then the problem described above by dimir is still observed. REOPENED. <dimir> RESOLVED in r51717 asaveljevs "goto" was unnecessary, because it did not jump over anything. RESOLVED in r51721. <dimir> Agreed. CLOSED |
Comment by richlv [ 2014 Dec 04 ] |
(3) spec says this should resolve <dimir> Yep, that is already there, in the dev branch. But I'll note about it resolved explicitly bit later. <dimir> RESOLVED in r50976 asaveljevs CLOSED |
Comment by dimir [ 2014 Dec 05 ] |
(4) [F] Seems when changing log item key (even e. g. logrt[*] -> log[*]) lastlogsize is not updated in the items table. Was that always like that? Should we consider resetting lastlogsize, mtime and state when item key is changed? zalex_ua read all comments above <dimir> Right, I was also thinking that would be the best choice. CLOSED |
Comment by dimir [ 2014 Dec 08 ] |
FYI, new feature request, respect "refresh unsupported" by active agent: ZBXNEXT-2633 . |
Comment by Aleksandrs Saveljevs [ 2014 Dec 15 ] |
(5) Currently, in function proxy_get_history_data(), we always send "lastlogsize" and "mtime". Would it make sense to only send these for log items? <dimir> Seems like rather question to sasha. As for me, I think it's good to avoid sending useless data. <dimir> sasha agreed on this idea so I'm going to implement it. <dimir> RESOLVED in r51239 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
(6) In src/zabbix_agent/active.c, there is a new function: static int metric_ready_to_process(const ZBX_ACTIVE_METRIC *metric) { if (ITEM_STATE_NORMAL != metric->state) { if (ITEM_STATE_NOTSUPPORTED != metric->state || 0 == metric->refresh_unsupported) return FAIL; } return SUCCEED; } If metric->state is neither ITEM_STATE_NORMAL nor ITEM_STATE_NOTSUPPORTED, then what is it? <dimir> I made it so for possible new states in the future. If those appear those would mean not ready by default. And we would have a single entry point where to consider those, not go throughout the code and adjust checks there. RESOLVED asaveljevs I am not aware of any place where we protect against hypothetical code changes in the future - code should reflect its current state. In this particular case, if we add a new item state (which is currently not being planned), we would still have to go through the code and check every place, including this one. Moreover, the change assumes that the hypothetical item state that we will add will mean "not ready for processing", but it might as well be the other way round. asaveljevs Perhaps discussable further with sasha. REOPENED. <dimir> Agreed. RESOLVED in r51372 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
(7) In parse_list_of_checks(), we can return from the function without destroying "added_metrics" vector. In the current implementation, this does not leak memory, but from the coding style perspective, this is not good. <dimir> RESOLVED in r51202 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
(8) In parse_list_of_checks(), where we remove active checks that were not received, we can erroneously skip some, because after zbx_vector_ptr_remove() variable "i" should be the same on the next iteration. Also, if the metric order is not important, consider using zbx_vector_ptr_remove_noorder() for efficiency. <dimir> RESOLVED in r51204 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
(9) The following comment was reformatted, but the line is too long now: if (NULL != mtime) /* will be null for "eventlog" and "log" and the value will be 0, only "logrt" matters */ el->mtime = *mtime; While fixing that, it would be good to paraphrase it to be more understandable. <dimir> I agree on rephrasing, but the length is exactly 120 characters, which is not over limit? RESOLVED in r51206 asaveljevs Based on r51205: $ grep will.be.0 src/zabbix_agent/active.c | wc -L 121 asaveljevs Removed the comment completely in r51310. It was not necessary. RESOLVED. <dimir> CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
(10) I wonder why the following change was done in process_eventlog_check() function: Index: src/zabbix_agent/active.c =================================================================== --- src/zabbix_agent/active.c (revision 51138) +++ src/zabbix_agent/active.c (working copy) @@ -1047,7 +1114,6 @@ goto out; } - s_count = 0; p_count = 0; lastlogsize = metric->lastlogsize; This seems to leave "s_count" variable uninitialized. <dimir> Unintentional changes, don't know how that could happen. RESOLVED in r51205 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
(11) Please review stylistic suggestions in r51203. <dimir> Looks great, thanks! CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 29 ] |
(12) In process_active_checks() function, in case an item becomes not supported, we call process_value() with persistent=0. However, when a meta information update is needed, we call process_value() with persistent=1. Shouldn't persistent be 0 is the latter case, too? <dimir> Agreed. RESOLVED in r51365 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Dec 29 ] |
(13) Zabbix server crashes in the following scenario. A log[] item was created with a non-existing file. Zabbix agent correctly reported it to the server as follows: { "clock": 1419859792, "data": [ { "clock": 1419859789, "host": "Zabbix server", "key": "log[/tmp/logfile.txt]", "ns": 311836170, "state": 1, "value": "Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory" } ], "ns": 312502657, "request": "agent data" } After some time, the agent sent another report that the item is still not supported: { "clock": 1419859910, "data": [ { "clock": 1419859909, "host": "Zabbix server", "key": "log[/tmp/logfile.txt]", "ns": 341259525, "state": 1 } ], "ns": 341554742, "request": "agent data" } This, I believe, should not have happened. Also note the absence of the error message. It is at this point that the server crashed with the following backtrace: 31039:20141229:153150.341 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ... ... 31039:20141229:153150.343 15: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](zbx_strlen_utf8_nchars+0x67) [0x49dde5] 31039:20141229:153150.343 14: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](zbx_db_strlen_n+0x23) [0x4e3c74] 31039:20141229:153150.343 13: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data]() [0x470ad7] 31039:20141229:153150.343 12: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](dc_add_history+0x41) [0x470c4e] 31039:20141229:153150.343 11: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](process_mass_data+0x6f3) [0x4c7ecd] 31039:20141229:153150.343 10: /home/asaveljevs/projects/zabbix-bin/sbin/zabbix_server: trapper #5 [processing data](process_hist_data+0x6e4) [0x4c8dc5] <dimir> My bad, the second packet shouldn't have been sent. RESOLVED in r51369 asaveljevs The diff looks like it only changes agent code. Therefore, it is still possible to crash the server by sending bad data. REOPENED. <dimir> Currently we do not handle the case when proxy/server receives unsupported item without an error message. It seems to me such handling shouldn't be introduced within the scope of this issue. RESOLVED asaveljevs Before implementing this issue, server and proxy did not crash when they received an item without an error message. So you plan to introduce a way for any malicious attacker to crash the server within the scope of this issue? No, it will not do. Such messages should be rejected. REOPENED. <dimir> My bad, you are totally right. RESOLVED in r51405,r51442 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ] |
(14) Suppose we are monitoring a log file and fill it as follows: $ echo -ne 'line 1\nline 2\nline 3\nline 4\nline 5\n' > /tmp/logfile.txt Zabbix agent then sends the following two messages with BufferSend seconds in between: 5675:20150108:120757.049 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 1","lastlogsize":7,"clock":1420711677,"ns":49700463}],"clock":1420711677,"ns":49771574}] 5675:20150108:120802.053 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 2","lastlogsize":14,"clock":1420711677,"ns":50627725},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 3","lastlogsize":21,"clock":1420711677,"ns":50790296},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 4","lastlogsize":28,"clock":1420711677,"ns":50911200},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line 5","lastlogsize":35,"clock":1420711677,"ns":51042818}],"clock":1420711682,"ns":53654876}] It can be seen that the first line is sent in a separate message, while the rest of the lines are sent in the second message. The agent should probably send all lines in one message. asaveljevs Although that seems to be the behavior in latest 2.2 and current trunk, so fixing this is optional. asaveljevs Discussed with sasha: will create a new ZBX to fix in trunk only, if not fixed here. asaveljevs Moved out to |
Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ] |
(15) If a file has 0 size (think 14710:20150108:124238.062 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","clock":1420713757,"ns":62134475}],"clock":1420713758,"ns":62382604}] This happens to work, but I do not think this is according to protocol. The agent should include "lastlogsize" of 0. <dimir> RESOLVED in r51500 asaveljevs "lastlogsize" and "mtime" are now sent for every item, but it would probably be better to send them for log items only. REOPENED. <dimir> Send lastlogsize and mtime only for active items of type log. Proposing to use flags for active items buffer elements. RESOLVED in r51565 asaveljevs Commit in r51617 renames ZBX_FLAG_ELEMENT_PERSISTENT to ZBX_BUFFER_ELEMENT_PERSISTENT and moves these flag constants to a different (more appropriate) header. Please take a look. RESOLVED. <dimir> I have yet renamed the set of flags to ZBX_METRIC_FLAG_* . It seems to sound better along with the new flags that I have added. Also the type of metric is now stored in flags and we do not need to strncmp() item keys during every check. Please check part of the commit in r51646. RESOLVED. asaveljevs It is possible to define ZBX_METRIC_FLAG_LOG differently, saving bit space, as is done in src/libs/zbxdbhigh/lld_item.c, for instance. RESOLVED in r51650 and r51653. <dimir> CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ] |
(16) Suppose we are monitoring two items: log[/tmp/logfile.txt] and log[/tmp/logfile.txt,SENDME]. We then create a new file as follows: $ echo -ne 'line a\n!SENDME!\nline b\n' > logfile.txt Zabbix agent will then send the following: 15134:20150108:125915.436 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"!SENDME!","lastlogsize":16,"clock":1420714755,"ns":435783934}],"clock":1420714755,"ns":435903828}] 15134:20150108:125920.441 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","lastlogsize":23,"clock":1420714755,"ns":436784750},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line a","lastlogsize":7,"clock":1420714755,"ns":437154119},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"!SENDME!","lastlogsize":16,"clock":1420714755,"ns":437278892},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line b","lastlogsize":23,"clock":1420714755,"ns":437403214}],"clock":1420714760,"ns":440947476}] This seems to be correct. However, in "Latest data" we see only "line a" and "line b" for the first item and "!SENDME!" and an empty value for the second item. <dimir> RESOLVED in r51520 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ] |
(17) Perform the same scenario as 16, wait a second, and then remove "logfile.txt". Zabbix agent will send the following, which is correct: 15134:20150108:130950.879 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"!SENDME!","lastlogsize":39,"clock":1420715390,"ns":879491221}],"clock":1420715390,"ns":879613275}] 15134:20150108:130955.885 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","lastlogsize":46,"clock":1420715390,"ns":880508379},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line a","lastlogsize":30,"clock":1420715390,"ns":880879926},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"!SENDME!","lastlogsize":39,"clock":1420715390,"ns":881004172},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line b","lastlogsize":46,"clock":1420715390,"ns":881127530},{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715394,"ns":885074194},{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715394,"ns":885253293}],"clock":1420715395,"ns":885555347}] However, the SENDME item stays supported in the frontend. Moreover, even when the agent sends the following lines later, the SENDME items is still supported in the frontend: 15134:20150108:131051.899 sending [{"request":"active checks","host":"Zabbix server"}] 15134:20150108:131056.902 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt,SENDME]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715456,"ns":902500077}],"clock":1420715456,"ns":902665723}] 15134:20150108:131101.904 JSON before sending [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":46,"clock":1420715456,"ns":903478041}],"clock":1420715461,"ns":904777423}] <dimir> I tried two times and couldn't reproduce it. Could you please give more details, e. g. how long to wait before removing the file etc. asaveljevs Might have been fixed by (16). CANNOT REPRODUCE. <dimir> Actually I tried it before deploying fix for (16). |
Comment by Aleksandrs Saveljevs [ 2015 Jan 08 ] |
(18) Please review my suggestions in r51450. <dimir> CLOSED |
Comment by Oleksii Zagorskyi [ 2015 Jan 09 ] |
Check |
Comment by Aleksandrs Saveljevs [ 2015 Jan 13 ] |
(19) Suppose we are monitoring a file using log[/tmp/logfile.txt] and we create this log file using the following command:
$ echo -ne 'line\n' > /tmp/logfile.txt
Agent sends the following: [{"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line","lastlogsize":5,"mtime":0,"clock":1421156171,"ns":932857930}],"clock":1421156172,"ns":933155449} This is correct. If we now remove the file, then agent sends: {"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"Cannot obtain information for file \"/tmp/logfile.txt\": [2] No such file or directory","state":1,"lastlogsize":5,"mtime":0,"clock":1421156181,"ns":938771840}],"clock":1421156182,"ns":939035400} This is correct. However, if we now recreate the file using the same command above, then we get the following: {"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","value":"line","lastlogsize":5,"mtime":0,"clock":1421156241,"ns":953114015}],"clock":1421156241,"ns":953163636} {"request":"agent data","data":[{"host":"Zabbix server","key":"log[/tmp/logfile.txt]","lastlogsize":5,"mtime":0,"clock":1421156241,"ns":953593188}],"clock":1421156246,"ns":956022902} Two messages are sent, the second one being redundant. <dimir> Let's send state along with list of active checks. RESOLVED in r51568,r51570 asaveljevs The solution is suspicious, because it changes server and agent communication protocol without a specification. This is undesirable. asaveljevs Also, the following in function add_check() does not look good: ... metric->state = ITEM_STATE_NORMAL; metric->refresh_unsupported = 0; metric->lastlogsize = lastlogsize; metric->mtime = mtime; metric->state = state; ... asaveljevs As described by dimir, the above solution was implemented to address the following scenario:
asaveljevs It is suggested that when an item is first received from the server after agent startup, its update is always sent to the server. REOPENED. <dimir> Introduced new metric flag: ZBX_METRIC_FLAG_NEW which allows always sending at least meta information update packet after agent startup. This fixes situation when unsupported log item changes state without changing meta information during restart. RESOLVED in r51646. asaveljevs Unfortunately, the problem was not solved: two messages are still sent in the above scenario. REOPENED. <dimir> Sorry, there was one fault in the logic of determining if we need to send meta information update packet. RESOLVED in r51673 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 16 ] |
(20) According to the specification, only logrt[] should send "mtime". Currently, all log items send it. <dimir> RESOLVED in r51674 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 19 ] |
(21) Please review changes in r51682, r51684, r51685, and r51694. In particular, r51684 updates to the latest trunk and resolves conflicts. <dimir> CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 19 ] |
(22) It is possible to crash Zabbix server by sending in the following JSON using misc/debug/sender.pl: { "clock": 1421658777, "data": [ { "clock": 1421658771, "host": "Zabbix server", "key": "LOG[\"/tmp/logfile.txt\"]", "lastlogsize": 108, "ns": 944843829 } ], "ns": 948056387, "request": "agent data" } Here, item type is set to "Character" and JSON does not include "value" tag (like in a meta update). Crash backtrace: 25503:20150119:112742.799 === Backtrace: === 25503:20150119:112742.800 16: zabbix_server: trapper #2 [processing data](print_fatal_info+0x117) [0x491e88] 25503:20150119:112742.800 15: zabbix_server: trapper #2 [processing data]() [0x492221] 25503:20150119:112742.800 14: /lib/x86_64-linux-gnu/libc.so.6(+0x35180) [0x7fc335912180] 25503:20150119:112742.800 13: zabbix_server: trapper #2 [processing data](zbx_replace_invalid_utf8+0x3b1) [0x49f956] 25503:20150119:112742.800 12: zabbix_server: trapper #2 [processing data](set_result_type+0x2d8) [0x452525] 25503:20150119:112742.800 11: zabbix_server: trapper #2 [processing data](process_mass_data+0x7a5) [0x4c9429] 25503:20150119:112742.800 10: zabbix_server: trapper #2 [processing data](process_hist_data+0x6ea) [0x4ca275] 25503:20150119:112742.800 9: zabbix_server: trapper #2 [processing data]() [0x43c3f0] 25503:20150119:112742.800 8: zabbix_server: trapper #2 [processing data]() [0x43d3eb] 25503:20150119:112742.800 7: zabbix_server: trapper #2 [processing data]() [0x43d9f0] 25503:20150119:112742.800 6: zabbix_server: trapper #2 [processing data](trapper_thread+0x1a9) [0x43db9b] 25503:20150119:112742.800 5: zabbix_server: trapper #2 [processing data](zbx_thread_start+0x37) [0x492cf5] 25503:20150119:112742.800 4: zabbix_server: trapper #2 [processing data](MAIN_ZABBIX_ENTRY+0x587) [0x41a0e1] 25503:20150119:112742.800 3: zabbix_server: trapper #2 [processing data](daemon_start+0x2ab) [0x49140c] 25503:20150119:112742.800 2: zabbix_server: trapper #2 [processing data](main+0x268) [0x419b58] 25503:20150119:112742.800 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc3358feb45] 25503:20150119:112742.800 0: zabbix_server: trapper #2 [processing data]() [0x413119] <dimir> RESOLVED in r51703 asaveljevs A better place for the condition seems to be a bit lower, so that more frequent conditions are executed first. RESOLVED in r51704. <dimir> CLOSED |
Comment by Aleksandrs Saveljevs [ 2015 Jan 20 ] |
(23) As pointed out by sasha, we have missed one important thing in the specification: we have made it so that proxy sends "lastlogsize" and "mtime" to the server, but not the other way round. By itself, sending of meta fields from proxy to server is only useful if the agent is later switched from proxy to server. However, this is less likely. The real motivation behind sending of meta fields from proxy to server was that, if proxy database is removed, proxy should get correct configuration from server. So the general idea is that server should now send meta fields to proxy (for log items only?) and proxy should act in a way similar to the agent - if it already has configuration data for these items, it should ignore these fields. Otherwise, it should create a new item with these meta values. <dimir> Will send lastlogsize and mtime for all items. RESOLVED in r51743, r51746 sasha Please take a look at my solution in r52136. <dimir> Looks great! CLOSED |
Comment by Oleksii Zagorskyi [ 2015 Jan 29 ] |
(24) yesterday I discovered one edge case (not related to current development) and want to discuss it. When a host goes to maintenance (without data collection), server still responds with list of agent's active checks. 3376:20150129:020223.064 JSON before sending [{"request":"agent data","data":[{"host":"core2","key":"eventlog[Test]","value":"RandomID(8238)","lastlogsize":9,"timestamp":1422489740,"source":"TestSRC2","severity":4,"eventid":70,"clock":1422489742,"ns":182718694}],"clock":1422489743,"ns":204912928}] 3376:20150129:020223.064 JSON back [{ "response":"success", "info":"processed: 0; failed: 1; total: 1; seconds spent: 0.000098"}] 3376:20150129:020223.064 In check_response() response:'{ "response":"success", "info":"processed: 0; failed: 1; total: 1; seconds spent: 0.000098"}' Because of the reject, server for next request of list of active checks still responds with previous "lastlogsize":8 for the eventlog key. 3376:20150129:020247.299 sending [{"request":"active checks","host":"core2"}] 3376:20150129:020247.299 before read 3376:20150129:020247.314 got [{ "response":"success", "data":[ { "key":"eventlog[Test]", "delay":5, "lastlogsize":8, "mtime":0}, { "key":"system.uptime", "delay":30, "lastlogsize":0, "mtime":0}]}] After the maintenance is finished there can be two different cases: a) if agent is still running and found new eventlog value - it sends it to server with "lastlogsize":10 and everything is fine. Question is - will be the b) case "fixed" in current development? <dimir> Interesting use case, thanks. So, is the proper solution here to allow meta information update for hosts in maintenance mode? zalex_ua hard question. I don't think we need to change something in server code related to maintenance. <dimir> As richlv mentioned, but what if hundreds of events (in the eventlog) were generated while host in maintenance. Do you find it normal that after maintenance period all these alerts will be triggered? I'd still consider updating meta information during host maintenance. This probably sounds scary, but if you think about it, the behavior is logical. zalex_ua of course accepting the meta information even during maintenance is the most "robust" approach. I'd don't have any objections to do it, but it would requires server code changes and I'm not sure you would happy to do that. <dimir> It's not that complex actually. We already have a check where we skip incoming data if it is for host in maintenance status with no data collection. We would just add an exception for this case, if it's a log item we'd drop the value (if it exists) and continue processing the data (in other words convert real value packet to meta information update packet). zalex_ua ok, feel free to do it in such a way. <dimir> In this case "processed" would be returned. I think it sounds right. That will introduce a small change though, before, in such case a user would get "failed", after this fix he'd get "processed". zalex_ua things become more and more complex , heeeh. <dimir> Discussed it with sasha. The proposed solution is still kind of "hack". If agent is down during the maintenance we'd still collect all the events generated during maintenance period. Moreover, we can not depend on the maintenance period because some log messages do not contain timestamps. So the conclusion is that we will not fix this issue in this task as we do not see a simple and clear solution here. The problem of collecting events generated during host maintenance if agent gets restarted afterwords will still exist. Feel free to open another issue. CLOSED |
Comment by dimir [ 2015 Feb 11 ] |
Fixed in pre-2.5.0 (r52202). |
Comment by dimir [ 2015 Feb 11 ] |
(25) [D] Document new feature (a hint: search for "meta information").
RESOLVED <richlv> thanks for the protocol docs. some notes :
<dimir> I promise to fill them later when we close this issue
<dimir> No, we do not send "mtime" for log items. Thanks, fixed. RESOLVED <sasha> increasing the size of proxy configuration data should be mentioned in the "Upgrade notes" REOPENED <dimir> Correct.
RESOLVED sasha This will result in 1-2 % increase depending on amount of log items. - it is incorrect. log must be removed because these fields are added to all items. REOPENED <dimir> I wanted to specify that for configuration with no log items it will increase by 1%, otherwise by 2. But I agree this might be not clear, simplified the explanation. RESOLVED <sasha> Thanks a lot! CLOSED |
Comment by dimir [ 2015 Mar 09 ] |
Memory leak in active agent on receiving list of active checks fixed in r52639. Thanks, andris! |
Comment by Oleksii Zagorskyi [ 2015 Mar 11 ] |
I've changed one sentence on the whatsnew300 page: |
Comment by Oleksii Zagorskyi [ 2015 Sep 25 ] |
I apparently was wrong 5 years ago saying (google translate is not perfect, but logic is clear):
no one fixed me ... sad. documentation is not detailed in this regard, it just says:
sad ... thankfully to lines:
/* existing log[] and eventlog[] data can be skipped */
metric->skip_old_data = (0 != metric->lastlogsize ? 0 : 1);
The logic has been implemented in very first commit r9669 of DEV-444 and it's actual |
Comment by Oleksii Zagorskyi [ 2015 Sep 29 ] |
|
Comment by Alexander Vladishev [ 2015 Nov 12 ] |