[ZBX-22943] Massive Memory Leak in Agent2 on Logfile Monitoring Created: 2023 Jun 08  Updated: 2024 Apr 10  Resolved: 2023 Sep 12

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 6.4.2, 6.4.3, 7.0.0alpha5
Fix Version/s: 6.4.7rc1, 7.0.0alpha5, 7.0 (plan)

Type: Problem report Priority: Critical
Reporter: Daniel Hafner Assignee: Artjoms Rimdjonoks
Resolution: Fixed Votes: 2
Labels: agent2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Tested under: OEL 7-9/RHEL 7-9/CENTOS 7,8, x86_64
Packaged: via official RPM and self compiles
Agent2 Versions Tested: 6.2+ unti 6.4.3


Attachments: JPEG File RRR.jpg     PNG File Screenshot 2023-07-12 at 09.12.41.png     PNG File graph-1.png     PNG File graph-2.png     PNG File graph.png     PNG File graph2.png     PNG File graph_2.png     PNG File heap_in_use_12_hr_2.png     PNG File heap_in_use_12hr.png     PNG File image-2023-07-06-14-39-44-738.png    
Issue Links:
Causes
Duplicate
is duplicated by ZBX-23349 Potential Memory Leak in Zabbix Agent2 Closed
Sub-task
depends on ZBX-23107 Small Memory Leak in Zabbix Agent 2 i... Closed
Team: Team C
Sprint: Sprint 104 (Sep 2023)
Story Points: 1

 Description   

Description:

There seams to be a Memory Leak in Agent V2 version. How massive it is, will depend on how many logfiles and aggressivly are going to be monitored.

Currently we need to restart the Agent multple times a Day.

Tests already tried:

  • Disable/Enable Agent encryption, no change
  • BufferSize = 1, will slow the issue
  • Work with PersistentBuffer, no change
  • Increased Plugins.Log.MaxLinesPerSecond to 150,500,1000, no change
  • reduced to one Logfile in Monitoring (it needs some time, but also increased slowly) ~1h
  • memleax says:
CallStack[12]: may-leak=66 (4833 bytes)
 expired=66 (4833 bytes), free_expired=0 (0 bytes)
 alloc=452 (33265 bytes), free=275 (20277 bytes)
 freed memory live time: min=0 max=4 average=0
 un-freed memory live time: max=15
 0x00007f4c2e467740 libc-2.17.so malloc()+0
 0x0000000000bd52e7 zabbix_agent2 zbx_malloc2()+103
 0x0000000000aa0b43 zabbix_agent2 __zbx_zabbix_log()+245
 0x0000000000bf4039 zabbix_agent2 process_log_check()+12265
 0x0000000000aa17c9 zabbix_agent2 _cgo_c762f3fe2651_Cfunc_process_log_check()+160
 0x000000000048b304 zabbix_agent2

 

Measurement:

 watch -n 1 "pmap -x $(pgrep zabbix_agent2) | tail;echo; ps -Tf -p $(pgrep zabbix_agent2) | wc -l" 

Check the RSS and Dirty Value, both increases slow, but steadily. The usage is slightly wobbeling around +-2MB.

Refer to:

https://www.zabbix.com/forum/zabbix-help/465504-zabbix-agent-2-memory-leak-due-logfile-monitoring

https://discord.com/channels/713327720528085042/1116380766511837225{}

Steps to reproduce:

Enable Logfile Monitoring, wait some time ~5-15min

Result:
**

The Memory Usage RSS & Dirty according to pmap will report massive Memory Usage Values. Depending on how many Logfiles are monitored, ivh seen the following Memory Values:

  • 24h ~ 500M-1,5G
  • 14 Days ~ 15-20GB

Today tested (see Discord Link). Same Test under Agent V1 will not result in any Memory issue.

Expected:
A stable Memory Usage, Less than multiple 100MB.

Current Workaround:
Switch to agent v1



 Comments   
Comment by Artjoms Rimdjonoks [ 2023 Jun 09 ]

chirrut
Is the issue reproducible with the "GODEBUG=madvdontneed=1" flag ?
e.g. " GODEBUG=madvdontneed=1 ./sbin/zabbix_agent2 -c /home/arimdjonoks/zabbix/etc/zabbix_agent2.conf"

Previously, I encountered that this is a native behavior for golang to continuously use more and more rss memory.
Essentially it comes down to golang garbage collector marking the freed memory with MADV_FREE which tells OS - "i am not using this memory, but keep assigned it to me, once other processes request it - then you can deallocate it from me entirely".
There is a way to disable this behavior by using this madvdontneed flag.

This could very well be unrelated to the problem and be a legitimate issue. Please check this flag, while I am investigating what else could be the problem.

Comment by Vladislavs Sokurenko [ 2023 Jun 09 ]

The following code must be fixed, it should only call malloc after log level is checked and it is going to log something:

void __zbx_zabbix_log(int level, const char *format, ...)
{
	if (zbx_agent_pid == getpid())
	{
		va_list	args;
		char *message = NULL;
		size_t size;

		va_start(args, format);
		size = vsnprintf(NULL, 0, format, args) + 2;
		va_end(args);
		message = (char *)zbx_malloc(NULL, size);
		va_start(args, format);
		vsnprintf(message, size, format, args);
		va_end(args);

		handleZabbixLog(level, message);
		zbx_free(message);
	}
}

<arimdjonoks> This does look like something we could fix in this ticket as an extra, but this should not be the cause of the memory leak.
RESOLVED in 5e414711a2f

<andris> Successfully tested. CLOSED

Comment by Daniel Hafner [ 2023 Jun 09 ]

Hi!
GODEBUG=madvdontneed=1 tested -->

The Agent runs ~30min and the Memory increased starting from 25MB to 48MB.

br,
Daniel

Comment by Thic Drinklots [ 2023 Jul 06 ]

Hi!
I have similar issues with zabbix-agent2 in version 6.4.4, revision a749236b3d9 on Ubuntu 22.04

Kind regards

Comment by Artjoms Rimdjonoks [ 2023 Jul 06 ]

ThickDrinkLots, there seems to be no issue.
Please provide information that would suggest otherwise.

Comment by Thic Drinklots [ 2023 Jul 06 ]

Sorry, now I see this issue is related to log monitoring, which I don't have enabled.

But I had version 6.4.4 installed on several AWS EC2's and on all of them zabbix-agent slowly drained memory. On machines with 1GB of RAM it caused some OOM messages to be thrown (not by agent itself, but by locate's updatedb for example) and the machine being completely unresponsive. A restart of zabbix-agent2 service is a solution for about 5-7 hours.

As a workaround, I installed version 6.0 LTS, but I can reinstall 6.4.4 on some test machines to reproduce this strange behavior. If you need me to run some commands to provide more details, please let me know.

Sorry for the confusion.

Comment by Daniel Hafner [ 2023 Jul 06 ]

@arimdjonoks

What do you mean by, "there seems to be no issue"?

We have at least 20 Systems with exact the same Problem. The memory usage of this agents is annoying high.

 

Excerpt from today: (24h runtime!!!)

Address           Kbytes     RSS   Dirty Mode  Mapping
---------------- ------- ------- -------
total kB         9656396 4969084 4952456
total kB         6277280 3096092 3079392
total kB         5954220 3108472 3091504
total kB         6154340 3111732 3094716
total kB         6514800 3154036 3137476
total kB         6193992 3154872 3138204
total kB         5800484 3650960 3634136
total kB         7172020 3153072 3136404
total kB         9745384 5117148 5100508
total kB         6548076 3088624 3071780
total kB         8096352 5147700 5130960
total kB         7329760 3108960 3091904
total kB         8472232 5212468 5195740
total kB         5750296 3151756 3135148
total kB         12170964 9066636 9050024
total kB         7229616 3155976 3139372
total kB         4362528 2329532 2312856
total kB         5764216  363108  322980
total kB         5651572  362068  323808
total kB         3194960 1311408 1304536{}

If you need some more debugging info just ask...

<arimdjonoks>
I am saying "there seems to be no issue" is because, just the fact that application uses more and more memory - does not on its own prove there is a memory leak. I need to see that memory is increasing over long period of time, and is also never recovered back.

Please provide me with more detailed information, the only thing i see is just an output from unknown tool for unknown processes.

I need to know exactly how did you measure the memory used. (tool used, its version, its parameters, etc.)
I need to see some measurements over the last 24 hrs. (graph would be ideal)

Also, which templates/items you use ?
Did you notice that the particular items could be causing this ?

I have been testing agent2 myself memory usage with standard templates and various log items and I got the result like:

(this is basically a zabbix items that reads "/proc/<agent 2 PID>/status" and extracts VmData).
This indeed does contain large spikes of memory usage, but then it goes back, (which seems like how garbage collection works in go)

Comment by Daniel Hafner [ 2023 Jul 06 ]

Hi,

please refer to the Ticket header:

 

Measurement:

 watch -n 1 "pmap -x $(pgrep zabbix_agent2) | tail;echo; ps -Tf -p $(pgrep zabbix_agent2) | wc -l" 

Check the RSS and Dirty Value, both increases slow, but steadily. The usage is slightly wobbeling around +-2MB.

Version:

[root@ ~]# pmap -V
pmap from procps-ng 3.3.10

Graph:

About a graph, im going to generate one.

Templates:

No we have created some Logmonitors for Oracle Logfilemonitoring

key full: log[/u01/app/oracle/diag/rdbms/.../trace/alert_XYZ.log,@alertlog_selected]

 

It is ok, if the Memory is peaking a little bit, but not multiple gigabytes. As mentioned above, same configuration will not rise any issue on the agent (1) memory usage.

Comment by Thic Drinklots [ 2023 Jul 07 ]

Update - at least in my case issue may not be related to the zabbix-agent2 in the 6.4.4 version. After I made a downgrade to the 6.0.19 I'm still having similar symptoms.

Comment by Artjoms Rimdjonoks [ 2023 Jul 12 ]

Investigation 2
I extensively tested the items with log[<x>,<y>...] pattern with Zabbig Agent 2 using pprof (go standard tool for investigating memory leaks), and I see no any signs of memory leaks present.

1) I created 1000 items (using LLD):

log[/tmp/zabbix_agent2.log,0,,,,,,,]
log[/tmp/zabbix_agent2.log,1,,,,,,,]
log[/tmp/zabbix_agent2.log,2,,,,,,,]
...

2) Result graph for

vfs.file.regexp[/proc/734147/status,"VmData"]

item:

does look like memory is indeed increasing.

3) Generated pprof report using:

go tool pprof -callgrind -output callgrind.out http://localhost:6060/debug/pprof/heap
gprof2dot --format=callgrind --output=out.dot ./callgrind.out
dot -Tpng out.dot -o graph.png

pprof callgrind heap report1 :

and with 40 minutes difference report2:

notice, how ProcessLogCheck increased from 4% to 27%, this does indeed looks suspicious.

After several hours I deleted all items, and then rechecked the proof graph:

There is no ProcessLogCheck present anymore, the heap memory with it was removed by go. So there is no heap memory occupied by Zabbix Agent 2 log processing logic, which indicates to me there is no memory leak.

4) Heap-memory tracking
pprof produces heap reports with actual numbers like:

# runtime.MemStats
# Alloc = 45716592
# TotalAlloc = 6777223921304
# Sys = 4084216008
# Lookups = 0
# Mallocs = 21168956437
# Frees = 21168316090
# HeapAlloc = 45716592
# HeapSys = 3975741440
# HeapIdle = 3928309760
# HeapInuse = 47431680
# HeapReleased = 3909525504
# HeapObjects = 640347
# Stack = 4653056 / 4653056
# MSpan = 396160 / 6968640
# MCache = 9600 / 15600
# BuckHashSys = 14950706
# GCSys = 79958792
# OtherSys = 1927774
# NextGC = 60796560
# LastGC = 1689230997948570166

I have been tracking with Zabbix the HeapInuse field.
Result over the last 12hr with thousand of items:

After I delete them, it goes back:

5) vfs.file.regexp[/proc/1275615/status,"VmData"] data after 12 hr extensive testing and deletion of all test items increased to 7 Mb in the meantime.

Conclusion
I see no evidence of memory leak, I see that golang is slowly consuming memory. But the rate at which it does that - does not not look suspicious to me.

Feel free to comment on my finding and suggest improvements.

(I actually found a small memory leak, but that was relatively edge-case and could not be the cause of any significant memory consumption. This will be fixed as part of this ticket. ZBX-23107)

Comment by moosup [ 2023 Sep 01 ]

Sorry the attachments got uploaded by accident. https://support.zabbix.com/browse/ZBX-23349 this is where you can find them if you need them.

If you want me to also upload them here please let me know.

 

Comment by Vladislavs Sokurenko [ 2023 Sep 01 ]

Caused by DEV-2137
https://git.zabbix.com/projects/ZBX/repos/zabbix/commits/ad7cb6607
Also:
https://git.zabbix.com/projects/ZBX/repos/zabbix/commits/f583f58541cd9add667d0c2d3544f5068571f8f6#src/go/pkg/zbxlib/logfile.go

Comment by Vladislavs Sokurenko [ 2023 Sep 01 ]

Is it possible to test whether the leak is gone if we provide a patch or it's better when new version is released ? martsupplydrive

Comment by moosup [ 2023 Sep 01 ]

Yes, I am happy to assist. If there is a patch available I will apply it to some of my servers. I may need some help updating the agents from sources since I update mine from packages normally.

Comment by Artjoms Rimdjonoks [ 2023 Sep 06 ]

martsupplydrive
we have the fix available for test for the following branches:
feature/ZBX-22943-6.5
feature/ZBX-22943-6.4
feature/ZBX-22943-6.0

We are planning to include the fixes to closes releases, but if you could test any of them and provide early feedback - that would be great, thank you.

Comment by Artjoms Rimdjonoks [ 2023 Sep 06 ]

Available in versions:

Note: fixes cover tls connections, log items and eventlog (Windows) items.

Comment by moosup [ 2023 Sep 07 ]

I will start testing today and give an update in a few hours to see if i can notice a difference.

 

Comment by moosup [ 2023 Sep 07 ]

I installed the 6.4.7RC1 build on 2 of my servers. I restarted one of the "older (6.4.6)" Agents on another server with similar logfile checks so i can compare them.
So far it looks like the RC build performs better. The amount of memory used by the RC is around half of what the 6.4.6 uses atm.  The servers are similar but not identical so because of that i dont want to make an early conclusion. 

With the previous version the memory increased around 250/300 MB a day and so far i havent seen that. I dont want to say for sure that is has been solved since it has only been around 8 hours since the agents started. But so far so good.

I will monitor the Agents performance for the next couple of days and post updates here.

Comment by moosup [ 2023 Sep 11 ]

New Update:
Looking at the data from this weekend i can clearly see a Major improvement. The memory usage is as expected and it starts to level like it did before.
The 6.4.6 Agent i restarted at the same time as the 6.4.7RC1 Agent uses 4 times as much memory on a similar system and settings.

Generated at Mon Jun 09 07:56:42 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.