[ZBX-4195] Windows zabbix_agentd Working Set size steadily growing on Windows Server 2008 64-bit boxes Created: 2011 Oct 03  Updated: 2017 May 30  Resolved: 2012 Mar 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 1.8.8
Fix Version/s: 1.8.12rc1, 2.0.0rc3

Type: Incident report Priority: Blocker
Reporter: Kam Lane Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: agent, memoryleak, windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows Server 2008 64-bit


Attachments: File Template_Windows_Server.conf     XML File Template_Windows_Server.zbx.xml     File Template_Windows_Server_2008.conf     XML File Template_Windows_Server_2008.zbx.xml     JPEG File after_fix.jpg     PNG File agent1.9.8_memleak4.png     PNG File agent_memleak_in_r19318.png     PNG File agent_memleak_with_pause_r19318.png     PNG File clear_test-damn_leak.png     PNG File updated_graph.PNG     PNG File zabbix_agentd.win.working_set.PNG     PNG File zabbix_agentd_running_away.PNG    
Issue Links:
Duplicate

 Description   

I'm noticing the working set size for the zabbix_agentd windows process grow and not garbage collect/cleanup. I started running the agent on all these boxes on 2011.09.30. Here is the 4-day weekend view. I am doing a lot of active checks calling perf_counters. I've been building a new windows template to share but am still trying to get a few things to add up before I do. Also note that the "Template_Windows" which is listed as a template in my selected group was pulled into the graph...is it supposed to do that?

'Scale' & 'Average by' are both set to 'Daily' for the graph.



 Comments   
Comment by richlv [ 2011 Oct 03 ]

template included in a bar report seems to be a bug, could you please create a new ZBX report ?

Comment by Kam Lane [ 2011 Oct 03 ]

Are you referring to a new ZBX screenshot report for this issue or new ZBX bug report?

Comment by Kam Lane [ 2011 Oct 04 ]

Something else I noticed is of the hosts in the screenshot above, the ones that have a larger memory footprint are actually the hosts that have 12 unsupported active performance checks a piece for where the template is looking for a drive [D:] that doesn't exist.

Comment by Kam Lane [ 2011 Oct 04 ]

I opened ZBX-4201 for the template included in a bar report bug.

Comment by Kam Lane [ 2011 Nov 07 ]

You can see where I've restarted the agent everyday of the business week, but on the weekend the process runs away. This attached screenshot is for a 32-bit Windows Server 2003 host. The memory leak appears to be more evident on this host than the 2008 hosts.

Comment by Kam Lane [ 2011 Nov 07 ]

This is my default Windows Template that works across all windows hosts from xp to 2008.

Comment by Kam Lane [ 2011 Nov 07 ]

This is the add-on template for Windows Server 2008 hosts that links to Template_Windows_Server.zbx.xml

Comment by Kam Lane [ 2011 Nov 07 ]

This is the default "Include=" for all windows hosts in my main configuration file.

Comment by Kam Lane [ 2011 Nov 07 ]

This is only included on windows server 2008 hosts where the 2008 template is linked.

Comment by Kam Lane [ 2011 Nov 07 ]

I thought attaching these might make tracking down this leak easier. They're also pretty detailed as far as the statistics they pull, so the rest of the community may benefit from them. I dumped the `typeperf -qx` of a bunch of different hosts and cross-referenced them via joins in a mysql table to generate the "common" Template_Windows_Server.

Comment by Kam Lane [ 2011 Nov 07 ]

My suspicion is the leak has something to do with de-referencing of the non-existent checks for example on a host that has no D drive: "Active check [perf.logicaldisk._D.counter[avg_disk_queue_length]] is not supported. Disabled."

Comment by Oleksii Zagorskyi [ 2012 Jan 31 ]

I have maybe similar memleak in the agent v1.9.8 r23548 on the WinXP32 32bit.

After restarting agent, rebooting host, changing DebugLevel from 4 to 3 the memleak still reproducible.
Problem appeared proximately one week ago. The host WinXP32 used for different debugging/tests, so maybe some checks has been added week ago.

Speed of memory eating - 0,6MB/hours and and noticeable immediately after the agent start.
I continue investigating.

In the same time v1.9.8 r23548 works on 4 hosts Server2003 32bit without any problems.

<zalex> Note: at the Server2003 hosts above only present perf counters are monitored. This explains next my comment.

Comment by Oleksii Zagorskyi [ 2012 Feb 02 ]

Huh, problem reproduced.
Defined exactly where and when it happens.
Broken in ZBX-3547

Tested two available revisions of zabbix_agentd from 1.8 SVN branch - 19048 and 19318 (ZBX-3547 merged in).
See attached picture "agent_memleak_in_r19318.png"
I used two missing perf counters (doesn't matter which exactly).

Note that before implementing ZBX-3547 zabbix agent returned "0" for missing counters, after ZBX-3547 it returns "NOT SUPPORTED" for missing counters

Count of missing monitored counters has direct impact on memleak speed !

Additionally you can see at "agent1.9.8_memleak4.png". I hope all is clear.
The bug CONFIRMED.

Comment by Kam Lane [ 2012 Feb 02 ]

I also see the memory leak with 32-bit boxes, it just consumes at a slower rate. As a workaround, I've created a scheduled job/task that executes the following batch script everyday at noon to restart the agent and reduce the working set size:

@echo off

IF /I "%PROCESSOR_ARCHITECTURE%"=="x86" GOTO WIN32
IF /I "%PROCESSOR_ARCHITECTURE%"=="AMD64" GOTO WIN64
GOTO:eof

:WIN32
ECHO Restarting 32-bit Zabbix Agent Service
C:\zabbix\win32\zabbix_agentd -c C:\zabbix\zabbix_agentd.win.conf -x
C:\zabbix\win32\zabbix_agentd -c C:\zabbix\zabbix_agentd.win.conf -s
GOTO:eof

:WIN64
ECHO Restarting 64-bit Zabbix Agent Service
C:\zabbix\win64\zabbix_agentd -c C:\zabbix\zabbix_agentd.win.conf -x
C:\zabbix\win64\zabbix_agentd -c C:\zabbix\zabbix_agentd.win.conf -s
GOTO:eof

Comment by Oleksii Zagorskyi [ 2012 Feb 03 ]

I have important additions:
In all cases I described above I used these keys for missing perf counters:
perf_counter[\Процесс(opera)\Байт виртуальной памяти,60]
perf_counter[\Процесс(opera)\Байт файла подкачки,60]
Yes, this host is Russian windows (English names of counters ~ are Process ... "bytes of virtual memory" and "bytes of swap file")

Note second item key parameter: 60 second (avg value calculated at the zabbix agent side)
It's a important factor!

I've just tested these keys without 2nd parameter - and I do NOT see memory leaks.
But, I observed some strange things when I MANY times tried to manually (using zabbix_get ...) receive different missing perf counters with different values of 2nd parameter -> I finished my attempts/experiments but memory continued to grow (at the 18:05-18:30) and it was strange.

I decided to perform clear experiment.
See picture "clear_test-damn_leak.png"
Those two keys, but WITHOUT 2nd parameter, continued to be monitored long time without memleaks.
At the 23:13 I ONLY ONCE performed manual check (using zabbix_get) with the key "perf_counter[\Процесс(opera)\Байт виртуальной памяти,60]"
After some time (aprox 30 minutes) memory started to constantly grow.

So seems, memory continue leaking even WITHOUT requests from the zabbix server.
Huh.

Here is corresponding part of the windows zabbix_agentd.log (r19318):
5076:20120202:231246.385 Processing request.
5076:20120202:231246.510 Requested [perf_counter[\Процесс(opera)\Байт виртуальной памяти,60]]
5076:20120202:231246.510 In PERF_COUNTER()
5076:20120202:231246.510 In add_perf_counter() counter:'\Процесс(opera)\Байт виртуальной памяти' interval:60
5076:20120202:231246.510 add_perf_counter(): unable to add PerfCounter '\Процесс(opera)\Байт виртуальной памяти': Указанный элемент не найден.
5076:20120202:231246.526 PERF_COUNTER(): unable to add PerfCounter '\Процесс(opera)\Байт виртуальной памяти': Указанный элемент не найден.
5076:20120202:231246.526 End of PERF_COUNTER()
5076:20120202:231246.526 Sending back [ZBX_NOTSUPPORTED]

Comment by Oleksii Zagorskyi [ 2012 Feb 06 ]

As additional proof see attached "agent_memleak_with_pause_r19318.png"
I finished my experiment/test.

Comment by dimir [ 2012 Mar 28 ]

So this is not reproducible in trunk?

<zalex> hmm, why do you think so?
trunk involved too, because ZBX-3547 has been merged to trunk too.

Comment by Alexander Vladishev [ 2012 Mar 30 ]

Fixed in the development branch svn://svn.zabbix.com/branches/dev/ZBX-4195

Test results: https://support.zabbix.com/secure/attachment/18486/after_fix.jpg

<zalex> Thanks, I'll retest it too, just to give additional confirmation.
<zalex> I confirm that the leak is not reproducible in the dev branch. I have performed enough deep tests as before. TESTED.

Comment by dimir [ 2012 Apr 03 ]

Successfully tested.

Comment by Alexander Vladishev [ 2012 Apr 03 ]

Fixed in version pre-1.8.12 r26593 and pre-2.0.0rc3 r26594.

Generated at Sat Apr 20 04:39:34 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.