[ZBX-15919] Possible memory leak from zabbix server Created: 2019 Apr 01 Updated: 2019 Dec 10 |
|
Status: | Need info |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 4.0.5 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Trivial |
Reporter: | Patrick Lachance | Assignee: | Zabbix Development Team |
Resolution: | Unresolved | Votes: | 4 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Production |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Description |
Steps to reproduce:
I have around 4000 NVPS and 32 GB of memory on the Zabbix server. Database is hosted on a different server and it runs with 3 proxies. Had some issues for a while now but it was not consistent, it was crashing every couple of weeks with out of memory. It is now happening way more often, at least every week and now it happened last friday and it happened again this morning. I can see by the graphs that memory usage is constantly growing after reboot from 40% to 90% and then I do a preventive restart of the zabbix server on a trigger so we not impacted that much. I am running on RHEL 7 and I will be happy to provide you with everything you need for investigation very fast. I am attaching a couple of screenshots to see memory usage of the server. Thank you for your time ! |
Comments |
Comment by Edmunds Vesmanis [ 2019 Apr 02 ] |
Hi Patrick, Please attach screenshots of all graphs for 1 day period from "Template App Zabbix Server"
And screenshots of Administration -> Queue and zabbix server config and log file:
Regards, |
Comment by Edmunds Vesmanis [ 2019 Apr 02 ] |
P.s. don't forget to strip out sensitive information! |
Comment by Patrick Lachance [ 2019 Apr 02 ] |
Hi Edmunds, Everything was attached except logs which are too big. I will try and find something that I can access from the network here to attach it somewhere. Thanks |
Comment by Patrick Lachance [ 2019 Apr 02 ] |
Actually, I have no mean to put the logs right now as everything is blocked by company policies but I will do it from home later today. I have provided the rest in the meantime. Thanks |
Comment by Patrick Lachance [ 2019 Apr 02 ] |
It is the link for zabbix_server.log |
Comment by Glebs Ivanovskis [ 2019 Apr 02 ] |
Make sure it is not |
Comment by Patrick Lachance [ 2019 Apr 02 ] |
It does not seem to be. The heap stays steady and as commented at the bottom of ZBX 10486 , I have :
[root@slmonzbxsp1 ~]# rpm -qa | grep -i curl
So I am to the latest stable version that should not cause the issue. |
Comment by Patrick Lachance [ 2019 Apr 02 ] |
Also, I had only 3 web scenario running and I removed them all to make sure as they were not used. |
Comment by Glebs Ivanovskis [ 2019 Apr 02 ] |
Not sure what you mean by that. |
Comment by Patrick Lachance [ 2019 Apr 03 ] |
I meant I ran the script and it was not looking like it was in issue |
Comment by Patrick Lachance [ 2019 Apr 04 ] |
Any more ideas about this issue ? You need any more information? |
Comment by Patrick Lachance [ 2019 Apr 23 ] |
@Edmunds Vesmanis , what are the next steps for this investigation? |
Comment by Edgars Melveris [ 2019 May 29 ] |
Hello, proc.mem[zabbix_server,,,,rss] on the Zabbix server? Let it gather data for some time and upload the results here. |
Comment by Patrick Lachance [ 2019 May 29 ] |
I have put the item but I am not sure how useful it will be since I scripted something to restart zabbix server everyday since the leak is making the server go out of memory every day now so I made a preventive restart |
Comment by Patrick Lachance [ 2019 May 30 ] |
Hi Edgars, I have attached a graph of the requested metric. As you can see, it peaks until the server is restarted at 10 AM EST. |
Comment by Glebs Ivanovskis [ 2019 May 30 ] |
golf4r, could you please save the values of this item As plain text (button in Latest data)? The graph has such an interesting shape... Do you have Database monitor items? Is ODBC pooling enabled? |
Comment by Patrick Lachance [ 2019 May 30 ] |
I have attached the values as requested. We have odbc monitoring but it is not running on the server, it is running on the proxies. |
Comment by Edgars Melveris [ 2019 Jun 10 ] |
How where the values exported? Do they correspond to the same time frame as the graph? |
Comment by Patrick Lachance [ 2019 Jun 10 ] |
Yes they correspond and they were exported as you requested... saevd as plain text |
Comment by TD Fabrice [ 2019 Jun 20 ] |
Hi , i confirm this issue since our migration to 4.0 version , we have noticed that the consumption of memory is abnormal . Like Patrick , to avoid a crash , we scripted something to restart server every 4 hours |
Comment by Vladislavs Sokurenko [ 2019 Jun 21 ] |
Could you please try to find out which exact process seems to be leaking: ps -aux cat /proc/PID/smaps Other option is that you could install Zabbix server with the following flags and provide log after Zabbix server is stopped: ./bootstrap.sh ; ./configure CFLAGS="-fsanitize=leak -g" LDFLAGS="-fsanitize=leak" --enable-server --with-mysql --prefix=$(pwd) |
Comment by Martin Mørch [ 2019 Sep 30 ] |
Wanted to add information to this. We're seeing similar issues since upgrading to 4.0. Currently using zabbix-server-mysql-4.0.12-1.el6.x86_64 on CentOS 6. Thursday, inspired by this report, I started collecting memory metrics for zabbix_server: Data is gathered using proc.mem[zabbix_server,,,,xxxx] |
Comment by Vladislavs Sokurenko [ 2019 Sep 30 ] |
Thank you for reporting this martinmorch, can you please be so kind and provide additional information requested in previous comment ? Currently it is not clear as to where leak happens, are you using SNMP items ? |
Comment by Martin Mørch [ 2019 Sep 30 ] |
Vladislav Sukorenko It seems primarily to be poller processes. Here's an smaps paste from a poller process that grew 1.5MB in about 15 minutes. I know this can simply be variance, but it seems they all grew 0.5-1.5MB in that period. We have 60 poller processes, so it adds up: Yes, we are using SNMP a lot, as well as agent, Database, external scripts and internal. Update |
Comment by Vladislavs Sokurenko [ 2019 Sep 30 ] |
Indeed, usage has grown here 564c7413f000-564c74b64000 rw-p 00000000 00:00 0 Size: 10388 kB Rss: 10240 kB Pss: 10240 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 10240 kB Referenced: 9912 kB Anonymous: 10240 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB VmFlags: rd wr mr mw me ac Could you please connect to that pid using gdb and check what is this memory filled with, maybe it will give some clues ? Unfortunately we currently cannot reproduce the issue but my main suspicion is on SNMP discovery. gdb -p PID dump memory /tmp/dump.dump 0x564c7413f000 0x564c74b64000 hexdump -C /tmp/dump.dump |
Comment by Martin Mørch [ 2019 Sep 30 ] |
Can you provide me with your direct e-mail information somehow so I can send you the hex dump? I see that it is packed with sensitive information, including the agent PSK |
Comment by Vladislavs Sokurenko [ 2019 Sep 30 ] |
Sure, it's [email protected], but maybe you have spotted something already in the dump, especially something that repeats over and over again ? |
Comment by Martin Mørch [ 2019 Sep 30 ] |
This reoccurs a ton: |8..tLV..........| |...........tLV..| |h..tLV..........| |...........tLV..| |...tLV..........| |........0..tLV..| |...tLV..........| |H..tLV..........| |x..tLV..........| |................| |................| In many different variations: |.&%tLV...t.tLV..| |........1.......| |................| |.-%tLV...u.tLV..| |........1.......| |................| |.4%tLV..0u.tLV..| |........1.......| |................| |P<%tLV..`u.tLV..| |........1.......| |................| |.D%tLV...u.tLV..| |........1.......| |................| |0O%tLV...u.tLV..| |........1.......| |................| |PZ%tLV...u.tLV..| |........1.......| |
Comment by Vladislavs Sokurenko [ 2019 Sep 30 ] |
I am sorry but it is not clear where that "tLV" comes from. Would be nice to identify items that leak, that way it might be reproducible, I would start with SNMP discovery, though I am not sure if that's and option for you to experiment with disabling some items or moving them to Zabbix proxy. |
Comment by Martin Mørch [ 2019 Sep 30 ] |
Unfortunately I cannot play around too much since this is our production setup. However, we have SNMP discovery split across 1 server and 2 proxies for 3 locations and zabbix_proxy is not exhibiting this behaviour on our 2 proxies. |
Comment by Shishaev Yuriy [ 2019 Dec 04 ] |
Vladislavs Sokurenko what info you need? I have same issue.When i upgrade zabbix server 3.4 > 4.2 > 4.4 have same memory leak. Restarting server every 10 day. My configuration: OS - Debian 9.9; dpkg -l |grep curl curl 7.52.1-5+deb9u9 libcurl3:amd64 7.52.1-5+deb9u9 libcurl3-gnutls:amd64 7.52.1-5+deb9u9 RAM - 4GB; Zabbix_server version - 4.4.1 Memory 1 day history. If need i can give more than 1 day history. |
Comment by Vladislavs Sokurenko [ 2019 Dec 04 ] |
It would be best to pinpoint passive items that cause the issue, are you using SNMP Lucefron ? Could try moving those to Zabbix proxy and seeing if issue improves for Zabbix server. Then if snmp items are to blame, could try launching Zabbix proxy with memory analyzer such as valgrind For example if Zabbix server is analyzed this would produce out files that can be analyzed, but analyzer will decrease performance so should be used with care. valgrind --tool=massif zabbix_server -c /etc/zabbix/zabbix_server.conf --foreground |
Comment by Shishaev Yuriy [ 2019 Dec 04 ] |
Yes i'm use SNMP (trap and SNMPv2agent). Hight usage is SNMPv2agent = 2459 item on server other on proxy. I can try to moving all SNMP to testing proxy server, but it will take long time because it is prod server. Vladislavs Sokurenko you suppose the problem is in SNMP? |
Comment by Vladislavs Sokurenko [ 2019 Dec 06 ] |
I understand that it can be problematic it's not necessary to move all items, could move one half or quarter and see if it leaks or even duplicate that quarter to a test Zabbix proxy, unfortunately it looks like SNMP but could not narrow it down yet. |
Comment by Ingus Vilnis [ 2019 Dec 06 ] |
Hi guys, A bit late to the party but here are my two cents regarding this issue. The "memory leak" is generally observed for Pollers and History syncers if you monitor the "rss" memory for each of them. The readings of these can easily go over the total amount of memory for particular machine. Having a glance over the attached graphs from various users I can say that many of you are looking at wrong metrics. USED memory % is irrelevant for Linux machines. This is not Windows. Used memory must be close to 100% otherwise you are wasting precious RAM on the server. AVAILABLE memory is what you need to keep rather low, say 1GB on a dedicated Zabbix server to be on the safe side. Another thing - how much memory have you allocated to the caches. Yeah, you have 32GB RAM, let's max out all Zabbix caches because you just can. No, add additional internal items to see how many bytes in these caches are actually used, then tune the values accordingly. And last one - look at the near 0% busy data collector processes. Again same thing - Zabbix allows to start 1000 of pollers and enjoy a graph where they are 0% busy but then you have a "memory leak" because these 1000 pollers still need it. For comparison I am now looking at an instance with 2500 real nvps, 30 Pollers avg 22% busy, 8GB or RAM having 4 of them available, memory behavior identical to reported but having zero issues with the instance as the "leak" tends to level after a month of uptime. |
Comment by Vladislavs Sokurenko [ 2019 Dec 06 ] |
Thank you ingus.vilnis you are completely right, RSS can't be trusted because it includes shared memory, for example if HistoryCacheSize is set to 1G and is filled by different history syncers then they all will show usage of 1 gigabyte but in reality the memory is shared between them. Could try looking into PSS. |
Comment by Shishaev Yuriy [ 2019 Dec 06 ] |
ingus.vilnis thx for info, but i download available memory graph, and my server have 4GB ram, 150 poller, 1779 real nvps, if see how much poller's are bussy, i can say ~140, and when server fresh (after restart zabbix) he have 1.5GB available memory, but after 1 week only 500Mb... Also when i was on 3.4 zabbix with same config i don't have memory leak, but when i upgrade to 4.0 and not added host's/item i see how memory disappears. |