[ZBX-22943] Massive Memory Leak in Agent2 on Logfile Monitoring Created: 2023 Jun 08 Updated: 2024 Apr 10 Resolved: 2023 Sep 12 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 6.4.2, 6.4.3, 7.0.0alpha5 |
Fix Version/s: | 6.4.7rc1, 7.0.0alpha5, 7.0 (plan) |
Type: | Problem report | Priority: | Critical |
Reporter: | Daniel Hafner | Assignee: | Artjoms Rimdjonoks |
Resolution: | Fixed | Votes: | 2 |
Labels: | agent2 | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Tested under: OEL 7-9/RHEL 7-9/CENTOS 7,8, x86_64 |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||
Team: | |||||||||||||||||||||
Sprint: | Sprint 104 (Sep 2023) | ||||||||||||||||||||
Story Points: | 1 |
Description |
Description: There seams to be a Memory Leak in Agent V2 version. How massive it is, will depend on how many logfiles and aggressivly are going to be monitored. Currently we need to restart the Agent multple times a Day. Tests already tried:
CallStack[12]: may-leak=66 (4833 bytes) expired=66 (4833 bytes), free_expired=0 (0 bytes) alloc=452 (33265 bytes), free=275 (20277 bytes) freed memory live time: min=0 max=4 average=0 un-freed memory live time: max=15 0x00007f4c2e467740 libc-2.17.so malloc()+0 0x0000000000bd52e7 zabbix_agent2 zbx_malloc2()+103 0x0000000000aa0b43 zabbix_agent2 __zbx_zabbix_log()+245 0x0000000000bf4039 zabbix_agent2 process_log_check()+12265 0x0000000000aa17c9 zabbix_agent2 _cgo_c762f3fe2651_Cfunc_process_log_check()+160 0x000000000048b304 zabbix_agent2
Measurement: watch -n 1 "pmap -x $(pgrep zabbix_agent2) | tail;echo; ps -Tf -p $(pgrep zabbix_agent2) | wc -l" Check the RSS and Dirty Value, both increases slow, but steadily. The usage is slightly wobbeling around +-2MB. Refer to: https://www.zabbix.com/forum/zabbix-help/465504-zabbix-agent-2-memory-leak-due-logfile-monitoring https://discord.com/channels/713327720528085042/1116380766511837225{} Steps to reproduce: Enable Logfile Monitoring, wait some time ~5-15min Result: The Memory Usage RSS & Dirty according to pmap will report massive Memory Usage Values. Depending on how many Logfiles are monitored, ivh seen the following Memory Values:
Today tested (see Discord Link). Same Test under Agent V1 will not result in any Memory issue. Expected: Current Workaround: |
Comments |
Comment by Artjoms Rimdjonoks [ 2023 Jun 09 ] |
chirrut Previously, I encountered that this is a native behavior for golang to continuously use more and more rss memory. This could very well be unrelated to the problem and be a legitimate issue. Please check this flag, while I am investigating what else could be the problem. |
Comment by Vladislavs Sokurenko [ 2023 Jun 09 ] |
The following code must be fixed, it should only call malloc after log level is checked and it is going to log something: void __zbx_zabbix_log(int level, const char *format, ...) { if (zbx_agent_pid == getpid()) { va_list args; char *message = NULL; size_t size; va_start(args, format); size = vsnprintf(NULL, 0, format, args) + 2; va_end(args); message = (char *)zbx_malloc(NULL, size); va_start(args, format); vsnprintf(message, size, format, args); va_end(args); handleZabbixLog(level, message); zbx_free(message); } } <arimdjonoks> This does look like something we could fix in this ticket as an extra, but this should not be the cause of the memory leak. <andris> Successfully tested. CLOSED |
Comment by Daniel Hafner [ 2023 Jun 09 ] |
Hi! The Agent runs ~30min and the Memory increased starting from 25MB to 48MB. br, |
Comment by Thic Drinklots [ 2023 Jul 06 ] |
Hi! Kind regards |
Comment by Artjoms Rimdjonoks [ 2023 Jul 06 ] |
ThickDrinkLots, there seems to be no issue. |
Comment by Thic Drinklots [ 2023 Jul 06 ] |
Sorry, now I see this issue is related to log monitoring, which I don't have enabled. But I had version 6.4.4 installed on several AWS EC2's and on all of them zabbix-agent slowly drained memory. On machines with 1GB of RAM it caused some OOM messages to be thrown (not by agent itself, but by locate's updatedb for example) and the machine being completely unresponsive. A restart of zabbix-agent2 service is a solution for about 5-7 hours. As a workaround, I installed version 6.0 LTS, but I can reinstall 6.4.4 on some test machines to reproduce this strange behavior. If you need me to run some commands to provide more details, please let me know. Sorry for the confusion. |
Comment by Daniel Hafner [ 2023 Jul 06 ] |
What do you mean by, "there seems to be no issue"? We have at least 20 Systems with exact the same Problem. The memory usage of this agents is annoying high.
Excerpt from today: (24h runtime!!!) Address Kbytes RSS Dirty Mode Mapping If you need some more debugging info just ask... <arimdjonoks> Please provide me with more detailed information, the only thing i see is just an output from unknown tool for unknown processes. I need to know exactly how did you measure the memory used. (tool used, its version, its parameters, etc.) Also, which templates/items you use ? I have been testing agent2 myself memory usage with standard templates and various log items and I got the result like: |
Comment by Daniel Hafner [ 2023 Jul 06 ] |
Hi, please refer to the Ticket header:
Measurement: watch -n 1 "pmap -x $(pgrep zabbix_agent2) | tail;echo; ps -Tf -p $(pgrep zabbix_agent2) | wc -l" Check the RSS and Dirty Value, both increases slow, but steadily. The usage is slightly wobbeling around +-2MB. Version: [root@ ~]# pmap -V Graph: About a graph, im going to generate one. Templates: No we have created some Logmonitors for Oracle Logfilemonitoring key full: log[/u01/app/oracle/diag/rdbms/.../trace/alert_XYZ.log,@alertlog_selected]
It is ok, if the Memory is peaking a little bit, but not multiple gigabytes. As mentioned above, same configuration will not rise any issue on the agent (1) memory usage. |
Comment by Thic Drinklots [ 2023 Jul 07 ] |
Update - at least in my case issue may not be related to the zabbix-agent2 in the 6.4.4 version. After I made a downgrade to the 6.0.19 I'm still having similar symptoms. |
Comment by Artjoms Rimdjonoks [ 2023 Jul 12 ] |
Investigation 2 1) I created 1000 items (using LLD): log[/tmp/zabbix_agent2.log,0,,,,,,,] log[/tmp/zabbix_agent2.log,1,,,,,,,] log[/tmp/zabbix_agent2.log,2,,,,,,,] ... 2) Result graph for vfs.file.regexp[/proc/734147/status,"VmData"] does look like memory is indeed increasing. 3) Generated pprof report using: go tool pprof -callgrind -output callgrind.out http://localhost:6060/debug/pprof/heap gprof2dot --format=callgrind --output=out.dot ./callgrind.out dot -Tpng out.dot -o graph.png pprof callgrind heap report1 : and with 40 minutes difference report2: notice, how ProcessLogCheck increased from 4% to 27%, this does indeed looks suspicious. After several hours I deleted all items, and then rechecked the proof graph: There is no ProcessLogCheck present anymore, the heap memory with it was removed by go. So there is no heap memory occupied by Zabbix Agent 2 log processing logic, which indicates to me there is no memory leak. 4) Heap-memory tracking # runtime.MemStats # Alloc = 45716592 # TotalAlloc = 6777223921304 # Sys = 4084216008 # Lookups = 0 # Mallocs = 21168956437 # Frees = 21168316090 # HeapAlloc = 45716592 # HeapSys = 3975741440 # HeapIdle = 3928309760 # HeapInuse = 47431680 # HeapReleased = 3909525504 # HeapObjects = 640347 # Stack = 4653056 / 4653056 # MSpan = 396160 / 6968640 # MCache = 9600 / 15600 # BuckHashSys = 14950706 # GCSys = 79958792 # OtherSys = 1927774 # NextGC = 60796560 # LastGC = 1689230997948570166 I have been tracking with Zabbix the HeapInuse field. After I delete them, it goes back: 5) vfs.file.regexp[/proc/1275615/status,"VmData"] data after 12 hr extensive testing and deletion of all test items increased to 7 Mb in the meantime. Conclusion Feel free to comment on my finding and suggest improvements. (I actually found a small memory leak, but that was relatively edge-case and could not be the cause of any significant memory consumption. This will be fixed as |
Comment by moosup [ 2023 Sep 01 ] |
Sorry the attachments got uploaded by accident. https://support.zabbix.com/browse/ZBX-23349 this is where you can find them if you need them. If you want me to also upload them here please let me know.
|
Comment by Vladislavs Sokurenko [ 2023 Sep 01 ] |
Caused by DEV-2137 |
Comment by Vladislavs Sokurenko [ 2023 Sep 01 ] |
Is it possible to test whether the leak is gone if we provide a patch or it's better when new version is released ? martsupplydrive |
Comment by moosup [ 2023 Sep 01 ] |
Yes, I am happy to assist. If there is a patch available I will apply it to some of my servers. I may need some help updating the agents from sources since I update mine from packages normally. |
Comment by Artjoms Rimdjonoks [ 2023 Sep 06 ] |
martsupplydrive We are planning to include the fixes to closes releases, but if you could test any of them and provide early feedback - that would be great, thank you. |
Comment by Artjoms Rimdjonoks [ 2023 Sep 06 ] |
Available in versions:
Note: fixes cover tls connections, log items and eventlog (Windows) items. |
Comment by moosup [ 2023 Sep 07 ] |
I will start testing today and give an update in a few hours to see if i can notice a difference.
|
Comment by moosup [ 2023 Sep 07 ] |
I installed the 6.4.7RC1 build on 2 of my servers. I restarted one of the "older (6.4.6)" Agents on another server with similar logfile checks so i can compare them. With the previous version the memory increased around 250/300 MB a day and so far i havent seen that. I dont want to say for sure that is has been solved since it has only been around 8 hours since the agents started. But so far so good. I will monitor the Agents performance for the next couple of days and post updates here. |
Comment by moosup [ 2023 Sep 11 ] |
New Update: |