[ZBX-17103] Missing data in log monitoring Created: 2019 Dec 19 Updated: 2020 Jan 08 Resolved: 2020 Jan 08 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Templates (T) |
Affects Version/s: | 4.2.7 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | FK | Assignee: | Andris Mednis |
Resolution: | Won't fix | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Production |
Attachments: |
![]() ![]() ![]() |
Description |
Hello, I have a ** very basic agent active logfile monitoring item with this simple item key : log[/logiciels/zabbix/logs/NodeListSurv.log]). It works fine most of the time, returning everything in my file. But from time to time, I only get a few of the last characters in the file, while the file contains much more. You can see int he screenshot that I get datas at 09:53 and 10:03, but at 09:57 there is only the 3 last characters of the file. Has someone ever encountered this issue please ? Steps to reproduce:
Result: |
Comments |
Comment by Andris Mednis [ 2019 Dec 19 ] |
How is it recreated? Just overwritten with a new file? If every 5 minutes the log file is overwritten with a new one but Zabbix agent is told to analyze the log file every 30 sec then it could lose some data. You can try to decrease update interval to, say, 3 sec and see how it goes. Is it possible to rotate the logfile? Or recreate it less often? |
Comment by FK [ 2019 Dec 19 ] |
The logfile is overwritten indeed. I'll try to set the update interval of the item to 3sec, but I don't get why it could solve the problem btw... I would understand if it showed sometimes only the begining of the file, cause it might be written at the same time, but here I get only a few of the last characters... |
Comment by Andris Mednis [ 2019 Dec 19 ] |
3 sec update interval (or you can set even 1 sec) will not solve the problem, I hope that less data will be missed due to log recreation. As of showing only few last characters in the log file - nobody reported such bug. To investigate how it happens you can set DebugLevel=4 in zabbix_agentd.conf or increase DebugLevel just for "active checks" process with -R option at runtime. Maybe increase LogFileSize, too. What operating system, what type of file system is used? How many records are written into log file per minute (approximately)? |
Comment by FK [ 2019 Dec 19 ] |
The logfile is on a RHEL 7.5, on a xfs filesyste. File is overwritten every 5 minutes with about 120 lines. |
Comment by FK [ 2019 Dec 20 ] |
Debug mode doesn't show anything in the agent log : For instance, I got the problem at 12:44 (with no character returned at all) and in the agent log, I don't get any entry at that time : 64055:20191220:122950.946 after changing modification time the size of log file "/logiciels/zabbix/logs/NodeListSurv.log" still has not been updated, consider it to be a new file 64055:20191220:123442.770 the modification time of log file "/logiciels/zabbix/logs/NodeListSurv.log" has been updated without changing its size, try checking again later 64055:20191220:123445.791 after changing modification time the size of log file "/logiciels/zabbix/logs/NodeListSurv.log" still has not been updated, consider it to be a new file 64055:20191220:123953.267 the modification time of log file "/logiciels/zabbix/logs/NodeListSurv.log" has been updated without changing its size, try checking again later 64055:20191220:123956.273 after changing modification time the size of log file "/logiciels/zabbix/logs/NodeListSurv.log" still has not been updated, consider it to be a new file 64055:20191220:124959.404 the modification time of log file "/logiciels/zabbix/logs/NodeListSurv.log" has been updated without changing its size, try checking again later 64055:20191220:125002.410 after changing modification time the size of log file "/logiciels/zabbix/logs/NodeListSurv.log" still has not been updated, consider it to be a new file 64055:20191220:125458.189 the modification time of log file "/logiciels/zabbix/logs/NodeListSurv.log" has been updated without changing its size, try checking again later 64055:20191220:125501.206 after changing modification time the size of log file "/logiciels/zabbix/logs/NodeListSurv.log" still has not been updated, consider it to be a new file |
Comment by Andris Mednis [ 2019 Dec 20 ] |
It does not look like a log with DebugLevel=4, there should be much more details. |
Comment by FK [ 2019 Dec 30 ] |
I got the problem this morning again (regularly). You'll find in attachement an extract of the agent log with DebugLevel=4. Let me know if it helps. You can see at some point that the value is just '2%;' instead of a whole line like "knl1764.krj.gie;250;250;26;224;1450m;40%;28;777%;3544Mi;11%;30100Mi;97%;" |
Comment by FK [ 2020 Jan 07 ] |
Hello, Let me know if you need more information please. |
Comment by Andris Mednis [ 2020 Jan 07 ] |
Hi! Thanks for the log file. It shows how the issue happened: 103458.991 process_logrt() old file list: 103458.991 nr:0 filename:'/logiciels/zabbix/logs/NodeListSurv.log' mtime:1577698199 size:8489 processed_size:8489 seq:1 incomplete:0 dev:64772 ino_hi:0 ino_lo:286092 md5size:512 md5buf:20c0b3c29506a80e2866d68ab4ed59a4 103458.991 process_logrt() new file list: (mtime:0 lastlogsize:8489 start_idx:0) 103458.991 nr:0 filename:'/logiciels/zabbix/logs/NodeListSurv.log' mtime:1577698497 size:8493 processed_size:8489 seq:0 incomplete:0 dev:64772 ino_hi:0 ino_lo:286092 md5size:512 md5buf:20c0b3c29506a80e2866d68ab4ed59a4 103458.991 In process_log() filename:'/logiciels/zabbix/logs/NodeListSurv.log' lastlogsize:8489 mtime:0 103458.991 In process_value() key:'knl1063.krj.gie:log[/logiciels/zabbix/logs/NodeListSurv.log]' value:'2%;' mtime 1577698199 (GMT: 2019. 30. December, Monday, 09:29:59) changed to 1577698497 (GMT: 2019. 30. December, Monday, 09:34:57) - about 5 minutes ahead. As file's inode and md5sum of the initial block did not change it was handled as the same file with only 4 bytes appended. So, it works as expected by design. If you are sure that it should have been like "knl1764.krj.gie;250;250;26;224;1450m;40%;28;777%;3544Mi;11%;30100Mi;97%;" - can you look into application writing this log file? Is there only one application which writes into the log file? Is it a multi-threaded or multi-process application? If so, does the application use log file locking to prevent several threads or processes from writing into the log file at the same time? |
Comment by FK [ 2020 Jan 07 ] |
Hi ! The "application" writing the log file is just another zabbix agent active item from the same template. That item runs a script which (roughly) executes commands to get pods information and writes the result in the logfile. item key : system.run[/etc/zabbix/zabbix_agentd.d/scripts/Template_SYS_ZROC_Openshift/ocp_list_nodes_surv.sh,nowait] That script is a bit long to execute (more than 30 sec, that's why I used option nowait and another item to read the result). To be sure the reading item doesn't try to read while the logfile is written, my script writes the result in a temporary file, and copies the temp file as the final logfile at the end. Basically :
My script doesn't handle lock, but isn't multithread ou multiprocess either. To be sure the script was the only one writting my log, I deactivated my item during 1/2h and the logfile didn't change (the error appears roughly every 15 minutes). I hope my explanation was clear though ^^ |
Comment by Andris Mednis [ 2020 Jan 07 ] |
Thanks for explanation! Currently your specific situation is not friendly to log[]-item algorithm because the whole log file gets overwritten quickly and often and with the same (or almost the same) content. Every time in this situation log[] item algorithm needs to decide - is it the same log file or a new one? Should it be analyzed from the start or from the last position (only appended bytes). Seems like "cp $logtmp $logfinal" copying happened to overwrite the log file just into the same i-node, with almost the same content (md5sum of the first 512 bytes was the same). Only mtime changed and it was 4 bytes longer. So, the agent reasonably decided that only the last 4 bytes were appended. Can you consider appending straight to $logfinal:
This could be more log[]-item friendly. Of course, you need to truncate /logiciels/zabbix/logs/NodeListSurv.log from time to time, maybe once a day. |
Comment by FK [ 2020 Jan 07 ] |
Without skip mode, shouldn't the log file be read from the beginning each time ? I didn't want to append the final logfile directly beause my script takes about 40 sec to write it completely, so I didn't want it to be written while it was read by my other item. |
Comment by Andris Mednis [ 2020 Jan 07 ] |
'skip' mode is in effect only for a new log[] item, .i.e. only once. When the log file is being monitored, the 'skip' mode has no effect, even if the agent is restarted. I think there is no need to worry about log file being modified while the agent analyzes it. log[]-item algorithm should handle it reasonably. It should not analyze incomplete, partly written log records (without newline at the end). |
Comment by FK [ 2020 Jan 07 ] |
Alright, it seems to work fine that way, indeed. Thank you very much ! |
Comment by Andris Mednis [ 2020 Jan 08 ] |
Thanks! Nice to hear |