[ZBX-9084] Zabbix Proxy allocates more memory that necessary when it starts Created: 2014 Nov 27 Updated: 2024 Apr 10 Resolved: 2019 Dec 19 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P) |
Affects Version/s: | 2.4.1, 2.4.2, 4.0.12 |
Fix Version/s: | 4.0.16rc1, 4.4.4rc1, 5.0.0alpha1, 5.0 (plan) |
Type: | Problem report | Priority: | Trivial |
Reporter: | Alisson Ramos de Oliveira | Assignee: | Vladislavs Sokurenko |
Resolution: | Fixed | Votes: | 11 |
Labels: | PGSQL, sqlite | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Zabbix Proxy using SQLite3 installed by rpm |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||
Issue Links: |
|
||||||||||||||||
Team: | |||||||||||||||||
Sprint: | Sprint 58 (Nov 2019), Sprint 59 (Dec 2019) | ||||||||||||||||
Story Points: | 0.5 |
Description |
When zabbix_proxy starts without database and create it, that's ok. It only allocate the memory to procced with the collect data. On the other hand, if database exists and have, for example, 100M, zabbix_proxy allocate about 100M for each process. In a host with 2GB of RAM, with, for example, 40 pollers, the zabbix_proxy is dead by "out of memory" killer because it has allocated a lot of memory. If database is removed and zabbix_proxy is restarted, the daemon starts normally and allocate much less memory. A new restart on the process cause out of memory again. |
Comments |
Comment by richlv [ 2014 Nov 28 ] |
just to confirm, that's not shared memory between all of those processes ? |
Comment by Aleksandrs Saveljevs [ 2014 Nov 28 ] |
With an existing database, how long does the proxy work after start, before it gets terminated by OOM killer? Could it be a memory leak? Could you please show cache settings from your proxy configuration file? How many items are monitored by proxy? |
Comment by Alisson Ramos de Oliveira [ 2014 Nov 28 ] |
richlv, if you mean /dev/shm, so not. The database is in /etc/zabbix. The shared memory (IPC) show me: ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x68021a2d 3801088 zabbix 600 8388575 74 0x78021a2d 3833857 zabbix 600 8388608 74 0x67021a2d 3866626 zabbix 600 43673191 74 0x73021a2d 3899395 zabbix 600 7707033 74 0x53021a2d 3932164 zabbix 600 18688 74 The same result is without database, but the process not dead. The data received from server (when database does not exists): 27776:20141128:081805.798 received configuration data from server, datalen 20327247 The items by status (at this point, i think the proxy should not receive the disabled items, but it's another problem): sqlite> SELECT status,COUNT(*) from items GROUP BY 1; 0|93865 1|3 Attached the log. The CacheSize parameter is on 49M. If change to lower value the proxy dies saying "Increase the CacheSize parameter". The start pollers that is as follow: StartPollers=30 StartPingers=20 StartDiscoverers=3 StartSNMPTrapper=1 I guess the problem is the fork() process. If the parent process has a huge memory usage (49M, in this case), the child, at the fork moment, will have the same huge memory of the parent. But it is just a suspicion. One last information: the database (sqlite) has 168M now, but start arround 32M. |
Comment by Alisson Ramos de Oliveira [ 2014 Nov 28 ] |
I made a new test. The command "ps aux | grep zabbix_proxy" show me that the RSS memory usage is very different with and without database. The file "last_ps_with_database.log" show the last "ps" before process die. The "last_ps_without_database.log" is at the moment before all process starts. |
Comment by Aleksandrs Saveljevs [ 2014 Nov 28 ] |
This looks like it might be a memory leak. Would it be possible for you to run Zabbix proxy under Valgrind? It can be done in the following manner:
|
Comment by Alisson Ramos de Oliveira [ 2014 Nov 28 ] |
follows the requested file. The problem with valgrind is that one process die, the zabbix_proxy does not stops... I had kill all process manually. |
Comment by Aleksandrs Saveljevs [ 2014 Nov 28 ] |
Attached Valgrind log seems to show quite a number of "possibly lost" bytes: $ grep lost valgrind_proxy_with.log | grep possibly.*,.*bytes ==11345== possibly lost: 388,704 bytes in 4,312 blocks ==11349== possibly lost: 393,608 bytes in 4,318 blocks ==11346== possibly lost: 388,704 bytes in 4,312 blocks ==11347== possibly lost: 388,704 bytes in 4,312 blocks ==11348== possibly lost: 388,704 bytes in 4,312 blocks ==11341== possibly lost: 286,336 bytes in 4,234 blocks ==11283== possibly lost: 287,600 bytes in 4,234 blocks ==11316== possibly lost: 394,304 bytes in 4,321 blocks ==11317== possibly lost: 287,600 bytes in 4,234 blocks ==11315== possibly lost: 287,600 bytes in 4,234 blocks ==11319== possibly lost: 389,984 bytes in 4,313 blocks However, according to http://stackoverflow.com/questions/6771610/what-does-possible-lost-means-in-valgrind , this seems to be OK. There are two instances of the following leak: ==11317== 16 bytes in 1 blocks are definitely lost in loss record 33 of 228 ==11317== at 0x4A06BE0: realloc (vg_replace_malloc.c:662) ==11317== by 0x359C81DD25: ??? (in /usr/lib64/libsqlite3.so.0.8.6) ==11317== by 0x359C809670: ??? (in /usr/lib64/libsqlite3.so.0.8.6) ==11317== by 0x359C866D73: sqlite3_get_table (in /usr/lib64/libsqlite3.so.0.8.6) ==11317== by 0x49CE5C: zbx_db_vselect (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x478A8F: __zbx_DBselect (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x42E375: ??? (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x42E82F: send_list_of_active_checks_json (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x42CDCE: ??? (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x42D417: trapper_thread (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x459381: zbx_thread_start (in /usr/sbin/zabbix_proxy_sqlite3) ==11317== by 0x412B1F: MAIN_ZABBIX_ENTRY (in /usr/sbin/zabbix_proxy_sqlite3) They should be investigated. However, they are very small and cannot serve as an explanation for going out of memory. |
Comment by Alisson Ramos de Oliveira [ 2014 Nov 28 ] |
I made a new test, this time I don't remove the proxy.db. Previously, the test was started with no sqlite database created. For mapping what process is what, follow the start of the test: 16349:20141128:131309.174 proxy #0 started [main process] 16366:20141128:131309.221 proxy #1 started [configuration syncer #1] 16367:20141128:131309.287 proxy #2 started [heartbeat sender #1] 16368:20141128:131309.394 proxy #3 started [data sender #1] 16369:20141128:131309.514 proxy #4 started [poller #1] 16370:20141128:131309.674 proxy #5 started [poller #2] 16371:20141128:131309.792 proxy #6 started [poller #3] 16372:20141128:131310.060 proxy #7 started [poller #4] 16373:20141128:131310.118 proxy #8 started [poller #5] 16374:20141128:131310.267 proxy #9 started [poller #6] 16375:20141128:131310.366 proxy #10 started [poller #7] 16376:20141128:131310.657 proxy #11 started [poller #8] 16380:20141128:131311.306 proxy #15 started [poller #12] 16381:20141128:131311.847 proxy #16 started [poller #13] 16377:20141128:131311.986 proxy #12 started [poller #9] 16378:20141128:131312.496 proxy #13 started [poller #10] 16386:20141128:131312.636 proxy #21 started [poller #18] 16384:20141128:131312.603 proxy #19 started [poller #16] 16382:20141128:131312.698 proxy #17 started [poller #14] 16383:20141128:131312.657 proxy #18 started [poller #15] 16391:20141128:131313.000 proxy #26 started [poller #23] 16393:20141128:131312.994 proxy #28 started [poller #25] 16385:20141128:131312.864 proxy #20 started [poller #17] 16387:20141128:131313.403 proxy #22 started [poller #19] 16379:20141128:131313.400 proxy #14 started [poller #11] 16394:20141128:131313.499 proxy #29 started [poller #26] 16389:20141128:131313.509 proxy #24 started [poller #21] 16396:20141128:131313.537 proxy #31 started [poller #28] 16397:20141128:131313.544 proxy #32 started [poller #29] 16390:20141128:131313.645 proxy #25 started [poller #22] 16388:20141128:131313.731 proxy #23 started [poller #20] 16392:20141128:131313.679 proxy #27 started [poller #24] 16398:20141128:131313.733 proxy #33 started [poller #30] 16395:20141128:131313.807 proxy #30 started [poller #27] 16399:20141128:131314.084 proxy #34 started [unreachable poller #1] 16400:20141128:131315.964 proxy #35 started [trapper #1] 16407:20141128:131319.182 proxy #42 started [icmp pinger #3] 16405:20141128:131319.178 proxy #40 started [icmp pinger #1] 16406:20141128:131319.518 proxy #41 started [icmp pinger #2] 16403:20141128:131319.705 proxy #38 started [trapper #4] 16402:20141128:131319.743 proxy #37 started [trapper #3] 16401:20141128:131319.701 proxy #36 started [trapper #2] 16404:20141128:131320.073 proxy #39 started [trapper #5] 16408:20141128:131320.050 proxy #43 started [icmp pinger #4] 16409:20141128:131320.146 proxy #44 started [icmp pinger #5] 16410:20141128:131320.203 proxy #45 started [icmp pinger #6] 16411:20141128:131320.956 proxy #46 started [icmp pinger #7] 16412:20141128:131323.163 proxy #47 started [icmp pinger #8] 16416:20141128:131323.125 proxy #50 started [icmp pinger #11] 16419:20141128:131323.129 proxy #53 started [icmp pinger #14] 16414:20141128:131323.119 proxy #48 started [icmp pinger #9] 16417:20141128:131323.133 proxy #51 started [icmp pinger #12] 16418:20141128:131323.145 proxy #52 started [icmp pinger #13] 16415:20141128:131323.169 proxy #49 started [icmp pinger #10] 16420:20141128:131323.323 proxy #54 started [icmp pinger #15] 16422:20141128:131323.423 proxy #56 started [icmp pinger #17] 16421:20141128:131323.644 proxy #55 started [icmp pinger #16] 16423:20141128:131323.877 proxy #57 started [icmp pinger #18] 16424:20141128:131324.196 proxy #58 started [icmp pinger #19] 16425:20141128:131324.707 proxy #59 started [icmp pinger #20] 16426:20141128:131325.180 proxy #60 started [housekeeper #1] 16427:20141128:131326.082 proxy #61 started [http poller #1] 16433:20141128:131328.470 proxy #66 started [history syncer #2] 16432:20141128:131328.470 proxy #65 started [history syncer #1] 16434:20141128:131328.674 proxy #67 started [history syncer #3] 16436:20141128:131328.947 proxy #69 started [snmp trapper #1] 16437:20141128:131328.803 proxy #70 started [self-monitoring #1] 16435:20141128:131329.495 proxy #68 started [history syncer #4] 16349:20141128:131355.239 One child process died (PID:16367,exitcode/signal:9). Exiting ... |
Comment by Aleksandrs Saveljevs [ 2014 Dec 01 ] |
The new log shows no sign of a memory leak either. |
Comment by Alisson Ramos de Oliveira [ 2014 Dec 10 ] |
I discovered the problem. The OOM occurs only in VMWare's hypervisor. I made a test with same zabbix_proxy.conf and sqlite3 database on vSphere 5.1 and Xen Server 6.2. In Xen Server the proxy use ~800MB when it starts (71 pollers). In VMWare it uses all memory and is killed by OOM-Killer. In test was used 2 GB of RAM and proxy database with 84MB. nevertheless, i think that ~800MB is a lot of memory to use by proxy. But OK. Now, I will investigate the memory alocation in VMWare. |
Comment by Sergey Usov [ 2015 Apr 14 ] |
Same error on hyper-v Zabbix proxy v2.4.4 (revision 52341) (23 February 2015) 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 |
Comment by Igor Golubkov [ 2017 Apr 10 ] |
Same problem with 3.2.4. I see like x4 memory usage if sqlite database is present at proxy start. This happens only with sqlite, pgsql works normal. |
Comment by Kaue Pereira [ 2019 Mar 27 ] |
Still occurs on 4.0.5 proxies with sqlite and virtualized enviroments such as hyper-v. |
Comment by Daniel [ 2019 Jul 20 ] |
We're seeing similar issues on 4.2.4 running on CentOS (Hyper-V). |
Comment by James Cook [ 2019 Aug 30 ] |
Im seeing issues like this which I have tried absolutely everything to resolve (https://support.zabbix.com/browse/ZBX-16582)
Centos 7.6 PostgreSQL 11 Zabbix 4.0.7 |
Comment by Vladislavs Sokurenko [ 2019 Aug 30 ] |
Is there only a problem when using some kind of virtualization ? Could you please be so kind and describe your environment further (configuration file and VMware version) so we can try to reproduce the issue, have you tried appliance supplied by Zabbix ? Have you tried MySQL ? |
Comment by Daniel [ 2019 Aug 30 ] |
We didn't try appliance for proxies - especially not when using sqlite, as we're using mysql when proxies process more values. Our current solution is to change installation from sqlite to mysql because the issue doesn't occur here anymore. We've seen this issue running on Hyper-V based on Windows Server 2016. |
Comment by Vladislavs Sokurenko [ 2019 Aug 30 ] |
It could help if you provide memory usage for Zabbix proxy process ids cat /proc/$pid/smaps |
Comment by Daniel [ 2019 Aug 30 ] |
I can check this when we've the issue again - currently all systems have been ported to mysql to bypass the issue. |
Comment by James Cook [ 2019 Aug 31 ] |
If this helps: I have discovered the only way the proxy can be started is using the following method:
I have had the VMWare infrastructure team check the environment and they said everything is operating normally and no changes have been made on their side. Environment: Centos 7.6 PostgreSQL 11 Zabbix 4.0.7 |
Comment by James Cook [ 2019 Aug 31 ] |
I will check out Vladislavs's comment on Monday when back at work, as I can reproduce this with zabbix 4.0.7 and postgres 11... I have also update issue with debug level 5 logs if this helps |
Comment by James Cook [ 2019 Sep 02 ] |
Hi Vladislavs,
I have uploaded 3 files: * zabbix_proxy_process_listing_james_cook.txt - Process listing of all Zabbix Proxy processes * zabbix_proxy_smaps_james_cook.tar.gz - Compressed tar file containing all the smap details * zabbix_proxy_configure_options_james_cook - Configure options used to compile our Zabbix proxy binary
We use a compiled binary (see configure options) and load a loadable module that utilises python which is why you will see zabbix_proxy_custom and references to python in the smap details.
I have tested and confirmed that this issue still occurs using the out of the box Zabbix proxy binary with no loaded modules present.
One other piece of information is the size of the configuration sent from the server which is 'received configuration data from server at "XXXXXXXX", datalen 85835245'
Cheers James |
Comment by Vladislavs Sokurenko [ 2019 Sep 02 ] |
You could try experimenting with minimal set of flags and then increase and see if memory increases with the addition of some flag ./bootstrap.sh ; ./configure --prefix=$(pwd) --enable-proxy --with-postgresql In my case it was very low memory consumption vso 38833 0.0 0.0 23148 7368 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy -c /etc/zabbix/zabbix_proxy.conf --foreground vso 38837 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: configuration syncer [loading configuration] vso 38838 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: heartbeat sender [sending heartbeat message failed in 0.000169 sec, idle 60 sec] vso 38839 0.0 0.0 23148 4084 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: data sender [sent 0 values in 0.000000 sec, sending data] vso 38840 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: housekeeper [startup idle for 30 minutes] vso 38841 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: http poller #1 [got 0 values in 0.000437 sec, idle 5 sec] vso 38842 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: discoverer #1 [processed 0 rules in 0.001415 sec, idle 60 sec] vso 38844 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: history syncer #1 [processed 0 values in 0.000036 sec, idle 1 sec] vso 38846 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: self-monitoring [processed data in 0.000033 sec, idle 1 sec] vso 38848 0.0 0.0 23148 2672 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: task manager [processed 0 task(s) in 0.000778 sec, idle 5 sec] vso 38849 0.0 0.0 25312 5956 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: poller #1 [got 0 values in 0.000013 sec, idle 5 sec] vso 38851 0.0 0.0 25312 5956 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: poller #2 [got 0 values in 0.000016 sec, idle 5 sec] vso 38852 0.0 0.0 25312 5956 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: poller #3 [got 0 values in 0.000011 sec, idle 5 sec] vso 38854 0.0 0.0 25312 5956 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: poller #4 [got 0 values in 0.000015 sec, idle 5 sec] vso 38856 0.0 0.0 25312 5956 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: poller #5 [got 0 values in 0.000015 sec, idle 5 sec] vso 38858 0.0 0.0 25312 5952 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: unreachable poller #1 [got 0 values in 0.000014 sec, idle 5 sec] vso 38860 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection] vso 38861 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection] vso 38867 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection] vso 38869 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection] vso 38870 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection] vso 38871 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #6 [processed data in 0.000000 sec, waiting for connection] vso 38872 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #7 [processed data in 0.000000 sec, waiting for connection] vso 38873 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #8 [processed data in 0.000000 sec, waiting for connection] vso 38874 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #9 [processed data in 0.000000 sec, waiting for connection] vso 38875 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #10 [processed data in 0.000000 sec, waiting for connection] vso 38876 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #11 [processed data in 0.000000 sec, waiting for connection] vso 38877 0.0 0.0 23148 2612 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: trapper #12 [processed data in 0.000000 sec, waiting for connection] vso 38878 0.0 0.0 25304 5040 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: icmp pinger #1 [got 0 values in 0.000012 sec, idle 5 sec] vso 38879 0.0 0.0 23148 2708 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: preprocessing manager #1 [queued 0, processed 0 values, idle 5.005142 sec during 5.005226 sec] vso 38880 0.0 0.0 23148 2840 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: preprocessing worker #1 started vso 38881 0.0 0.0 23148 2840 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: preprocessing worker #2 started vso 38883 0.0 0.0 23148 2840 pts/0 S+ 10:49 0:00 ./sbin/zabbix_proxy: preprocessing worker #3 started Also not sure if issue is present in Centos 7.6 have you tried other operating systems ? We should try to reproduce the issue with same set of flags that you use. I am using Fedora release 30 (Thirty) Linux version 5.2.9-200.fc30.x86_64 ([email protected]) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Aug 16 21:37:45 UTC 2019 |
Comment by DaneT [ 2019 Sep 02 ] |
Hello James On the VirtualBox, compiled proxy with the same options as your memory usage stays low: zabbix 12302 0.0 0.2 158232 5792 ? S 13:32 0:00 sbin/zabbix_proxy --config /etc/zabbix/zabbix_proxy.conf zabbix 12306 0.0 0.3 158464 6572 ? S 13:32 0:00 sbin/zabbix_proxy: configuration syncer [synced config 20671 bytes in 0.213440 sec, idle 3600 sec] zabbix 12307 0.0 0.2 158232 5372 ? S 13:32 0:00 sbin/zabbix_proxy: heartbeat sender [sending heartbeat message success in 0.011470 sec, idle 60 sec] zabbix 12308 0.0 0.2 158284 6000 ? S 13:32 0:00 sbin/zabbix_proxy: data sender [sent 1 values in 0.002565 sec, idle 1 sec] zabbix 12309 0.0 0.1 158232 3120 ? S 13:32 0:00 sbin/zabbix_proxy: housekeeper [startup idle for 30 minutes] zabbix 12310 0.0 0.2 158232 5300 ? S 13:32 0:00 sbin/zabbix_proxy: http poller #1 [got 0 values in 0.000708 sec, idle 5 sec] zabbix 12311 0.2 0.4 262736 9680 ? S 13:32 0:00 sbin/zabbix_proxy: discoverer #1 [processed 0 rules in 0.001292 sec, idle 60 sec] zabbix 12312 0.0 0.2 158232 5300 ? S 13:32 0:00 sbin/zabbix_proxy: history syncer #1 [processed 0 values in 0.000018 sec, idle 1 sec] zabbix 12313 0.0 0.2 158232 5300 ? S 13:32 0:00 sbin/zabbix_proxy: history syncer #2 [processed 1 values in 0.001205 sec, idle 1 sec] zabbix 12314 0.0 0.2 158232 5300 ? S 13:32 0:00 sbin/zabbix_proxy: history syncer #3 [processed 0 values in 0.000014 sec, idle 1 sec] zabbix 12315 0.0 0.2 158232 5300 ? S 13:32 0:00 sbin/zabbix_proxy: history syncer #4 [processed 0 values in 0.000018 sec, idle 1 sec] zabbix 12316 0.0 0.2 158232 4604 ? S 13:32 0:00 sbin/zabbix_proxy: self-monitoring [processed data in 0.000037 sec, idle 1 sec] zabbix 12317 0.0 0.2 158232 5796 ? S 13:32 0:00 sbin/zabbix_proxy: task manager [processed 0 task(s) in 0.000414 sec, idle 5 sec] zabbix 12318 0.2 0.4 266080 9744 ? S 13:32 0:00 sbin/zabbix_proxy: poller #1 [got 1 values in 0.000047 sec, idle 2 sec] zabbix 12319 0.2 0.4 266060 9744 ? S 13:32 0:00 sbin/zabbix_proxy: poller #2 [got 0 values in 0.000007 sec, idle 2 sec] zabbix 12320 0.2 0.4 266064 9744 ? S 13:32 0:00 sbin/zabbix_proxy: poller #3 [got 0 values in 0.000006 sec, idle 2 sec] zabbix 12321 0.2 0.4 266060 9744 ? S 13:32 0:00 sbin/zabbix_proxy: poller #4 [got 0 values in 0.000006 sec, idle 2 sec] zabbix 12322 0.2 0.4 266068 9744 ? S 13:32 0:00 sbin/zabbix_proxy: poller #5 [got 0 values in 0.000008 sec, idle 2 sec] zabbix 12323 0.2 0.4 266060 9744 ? S 13:32 0:00 sbin/zabbix_proxy: unreachable poller #1 [got 0 values in 0.000031 sec, idle 5 sec] zabbix 12324 0.0 0.2 158232 5380 ? S 13:32 0:00 sbin/zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection] zabbix 12325 0.0 0.2 158232 5380 ? S 13:32 0:00 sbin/zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection] zabbix 12326 0.0 0.2 158232 5380 ? S 13:32 0:00 sbin/zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection] zabbix 12327 0.0 0.2 158232 5380 ? S 13:32 0:00 sbin/zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection] zabbix 12328 0.0 0.2 158232 5380 ? S 13:32 0:00 sbin/zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection] zabbix 12329 0.0 0.2 161544 4604 ? S 13:32 0:00 sbin/zabbix_proxy: icmp pinger #1 [got 0 values in 0.000010 sec, idle 5 sec] postgres 12330 0.0 0.4 236292 9908 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49106) idle postgres 12331 0.0 0.5 236292 10280 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49108) idle postgres 12332 0.0 0.4 236292 9904 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49110) idle postgres 12333 0.0 0.4 236296 9900 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49112) idle postgres 12334 0.0 0.5 236380 10480 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49114) idle postgres 12335 0.1 0.6 238596 13264 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49116) idle postgres 12336 0.1 0.5 236948 11028 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49120) idle postgres 12337 0.0 0.4 236296 9512 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49122) idle postgres 12338 0.0 0.5 236992 10900 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49124) idle postgres 12339 0.0 0.4 236296 9516 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49126) idle postgres 12340 0.0 0.4 236296 9508 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49132) idle postgres 12341 0.0 0.4 236296 9512 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49134) idle postgres 12342 0.0 0.4 236296 9512 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49136) idle postgres 12343 0.0 0.4 236296 9516 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49138) idle postgres 12344 0.0 0.4 236296 9512 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49140) idle postgres 12345 0.0 0.4 236296 9512 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49142) idle postgres 12346 0.0 0.4 236296 9516 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49144) idle postgres 12347 0.0 0.4 236296 9508 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49146) idle postgres 12348 0.0 0.4 236296 9516 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49148) idle postgres 12349 0.0 0.5 236516 10696 ? Ss 13:32 0:00 postgres: zabbix_proxy zabbix_proxy 127.0.0.1(49150) idle I also attached smaps of one of the poller processes |
Comment by Vladislavs Sokurenko [ 2019 Sep 02 ] |
This looks as the most suspicious place: 7fdd864b5000-7fddc64b5000 rw-s 00000000 00:04 393222 /SYSV00000000 (deleted) Size: 1048576 kB Rss: 91908 kB Pss: 605 kB Shared_Clean: 0 kB Shared_Dirty: 91908 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 91908 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB VmFlags: rd wr sh mr mp me ms sd Could you please share values of following parameters ? HistoryIndexCacheSize CacheSize HistoryCacheSize Is HistoryCacheSize set to 1G ? |
Comment by DaneT [ 2019 Sep 02 ] |
Compiled proxy running on VMWare, using all the provided keys, except for jabber: zabbix 101654 0.0 0.1 898680 3028 ? S 15:58 0:00 zabbix_proxy --config /etc/zabbix/zabbix_proxy.conf zabbix 101659 0.0 0.2 899028 3780 ? S 15:58 0:00 zabbix_proxy: configuration syncer [synced config 18485 bytes in 0.049150 sec, idle 3600 sec] zabbix 101660 0.0 0.1 898796 3080 ? S 15:58 0:00 zabbix_proxy: heartbeat sender [sending heartbeat message success in 0.025432 sec, idle 60 sec] zabbix 101661 0.0 0.1 898848 3540 ? S 15:58 0:00 zabbix_proxy: data sender [sent 1 values in 0.003230 sec, idle 1 sec] zabbix 101662 0.0 0.1 898680 1872 ? S 15:58 0:00 zabbix_proxy: housekeeper [startup idle for 30 minutes] zabbix 101663 0.0 0.1 898680 2268 ? S 15:58 0:00 zabbix_proxy: http poller #1 [got 0 values in 0.000572 sec, idle 5 sec] zabbix 101664 0.0 0.2 1003200 4628 ? S 15:58 0:00 zabbix_proxy: discoverer #1 [processed 0 rules in 0.001561 sec, idle 60 sec] zabbix 101665 0.0 0.1 898680 2280 ? S 15:58 0:00 zabbix_proxy: history syncer #1 [processed 0 values in 0.000008 sec, idle 1 sec] zabbix 101666 0.0 0.1 898680 2280 ? S 15:58 0:00 zabbix_proxy: history syncer #2 [processed 1 values in 0.000656 sec, idle 1 sec] zabbix 101667 0.0 0.1 898680 2280 ? S 15:58 0:00 zabbix_proxy: history syncer #3 [processed 0 values in 0.000009 sec, idle 1 sec] zabbix 101668 0.0 0.1 898680 2280 ? S 15:58 0:00 zabbix_proxy: history syncer #4 [processed 0 values in 0.000010 sec, idle 1 sec] zabbix 101669 0.0 0.1 898680 1996 ? S 15:58 0:00 zabbix_proxy: self-monitoring [processed data in 0.000023 sec, idle 1 sec] zabbix 101670 0.0 0.1 898812 3224 ? S 15:58 0:00 zabbix_proxy: task manager [processed 0 task(s) in 0.000311 sec, idle 5 sec] zabbix 101671 0.0 0.2 1006540 4888 ? S 15:58 0:00 zabbix_proxy: poller #1 [got 0 values in 0.000010 sec, idle 1 sec] zabbix 101672 0.0 0.2 1006520 4628 ? S 15:58 0:00 zabbix_proxy: poller #2 [got 0 values in 0.000012 sec, idle 1 sec] zabbix 101673 0.0 0.2 1006540 4888 ? S 15:58 0:00 zabbix_proxy: poller #3 [got 0 values in 0.000029 sec, idle 1 sec] zabbix 101674 0.0 0.2 1006540 4888 ? S 15:58 0:00 zabbix_proxy: poller #4 [got 0 values in 0.000009 sec, idle 1 sec] zabbix 101675 0.0 0.2 1006524 4888 ? S 15:58 0:00 zabbix_proxy: poller #5 [got 1 values in 0.000069 sec, idle 1 sec] zabbix 101676 0.0 0.2 1006520 4628 ? S 15:58 0:00 zabbix_proxy: unreachable poller #1 [got 0 values in 0.000051 sec, idle 5 sec] zabbix 101677 0.0 0.1 898812 3012 ? S 15:58 0:00 zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection] zabbix 101678 0.0 0.1 898812 3012 ? S 15:58 0:00 zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection] zabbix 101679 0.0 0.1 898812 3012 ? S 15:58 0:00 zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection] zabbix 101680 0.0 0.1 898812 3012 ? S 15:58 0:00 zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection] zabbix 101681 0.0 0.1 898812 3012 ? S 15:58 0:00 zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection] zabbix 101682 0.0 0.1 901992 1988 ? S 15:58 0:00 zabbix_proxy: icmp pinger #1 [got 0 values in 0.000060 sec, idle 5 sec] postgres 101683 0.0 0.2 236284 5180 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49154) idle postgres 101684 0.0 0.3 236288 5980 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49156) idle postgres 101685 0.0 0.2 236288 5184 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49158) idle postgres 101686 0.0 0.3 236984 6488 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49160) idle postgres 101687 0.0 0.4 238580 9172 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49162) idle postgres 101688 0.2 0.3 236940 6764 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49166) idle postgres 101689 0.0 0.3 236288 5976 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49168) idle postgres 101690 0.0 0.3 236288 5980 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49170) idle postgres 101691 0.0 0.3 236288 5980 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49172) idle postgres 101692 0.0 0.3 236372 6016 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49174) idle postgres 101693 0.0 0.2 236288 5188 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49176) idle postgres 101694 0.0 0.2 236288 5184 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49178) idle postgres 101695 0.0 0.2 236288 5184 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49180) idle postgres 101696 0.0 0.2 236288 5184 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49182) idle postgres 101697 0.0 0.3 236508 6312 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49184) idle postgres 101698 0.0 0.2 236288 5188 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49186) idle postgres 101699 0.0 0.2 236288 5184 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49188) idle postgres 101700 0.0 0.2 236288 5188 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49190) idle postgres 101701 0.0 0.2 236288 5188 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49192) idle postgres 101702 0.0 0.2 236288 5184 ? Ss 15:58 0:00 postgres: zabbix zabbix_proxy 127.0.0.1(49194) idle |
Comment by James Cook [ 2019 Sep 02 ] |
Hi Vladislavs, Yes the cache size is set to 1gb. Is this any issue as it has been configured like this for months and been restarted many times. It's super wierd why this only started the other day? Cheers |
Comment by Vladislavs Sokurenko [ 2019 Sep 02 ] |
Wouldn't hurt to monitor write cache on Zabbix proxy zabbix[wcache,<cache>,<mode>] My main suspicion is that it's related to write cache being filled up due to slow database, thus increasing memory usage. Is it possible that the majority of memory is not consumed by Zabbix proxy but by something else ? Could you please show output of top. Also can you confirm that decreasing HistoryCacheSize helps with the issue ? |
Comment by Leonardo Vendrusculo [ 2019 Sep 02 ] |
I don't know very well how it work... but for me the problem is: sqlite database size * StartPollers = Needed RAM to start the proxy so all this memory size is not allocated during the execution of the proxy... is being demanded just to start the service, maybe something related to the fork process In my case, if I start the proxy service with an existent sqlite database the memory needed is about 12GB of RAM and zabbix never free this memory... if I started the service without a previous database, the usage is up to 1.5GB. |
Comment by Daniel [ 2019 Sep 02 ] |
We have exact the same behaviour as Leonardo - the allocation happens only on startup with an existing database. So the error triggers mostly after the proxy is running a while and restarted during some maintenance/updates. |
Comment by James Cook [ 2019 Sep 03 ] |
Hi Vladislavs, I have attached a screenshot of the last 7 days for the configuration, history and history index cache (Zabbix Proxy Cache Utilisation James Cook). You can see there is no utilisation issue here and the gaps are cause by the outage we have had due to this issue. The issue does look to be as described by Daniel and Leonardo, where the it only occurs on startup when the proxies are forked and they have enabled monitoring to perform. This is why a clean database or my workaround (disabling hosts) functions; I suspect you could also reassign the monitored hosts to a different proxy, restart the proxy and then reassign the hosts. Looking at the debug logs in my issue (https://support.zabbix.com/browse/ZBX-16582) I saw the following behaviour:
In terms of forking memory usage, it utilises the CoW (copy on write) method; which means that the forked process will utilise the parents memory as shared memory and only duplicate this if a write is performed. My suspicion is that for some reason the Zabbix pollers when they are initially forked are using a large amount of the parents shared memory and then perform some write operation which suddenly consumes all the system memory as it needs to be duplicated. An interesting article regarding the pitfalls of forking is : https://devarea.com/linux-fork-system-call-and-its-pitfalls The behavior I observe when the fork error message does occur is that there was at least some Zabbix pollers that did succeed to start and the rest failed due to the memory allocation. Possible solutions which may resolve this:
The other thing to keep in mind that may contribute to this issue is enabling / disabling memory over commit at the OS or Hypervisor layer, which is enabled by default at the OS layer. In our case we disabled over commit at the OS layer as we deem it a safer option and we reserved the memory within the VMWare Hypervisor. With that in mind any process needs to be a slimline as possible in terms of memory consumption. Hope this helps. Cheers James |
Comment by Vladislavs Sokurenko [ 2019 Sep 03 ] |
Can you please decrease memory of those variables one by one and see which one helps to decrease the overall memory (if helps) ? HistoryIndexCacheSize CacheSize HistoryCacheSize |
Comment by James Cook [ 2019 Sep 04 ] |
Hi Vladislavs, While we were having the original issue prior to the work around I did in fact lower these at the time to 256M each, which allowed me to start more Zabbix pollers at the time, however only 10 or so. Yesterday we enabled memory over commit on our Centos boxes which is allowing them to start up with 100 pollers, however the question I am still yet to answer is why all of a sudden did this occur on our proxies and was reproducible. To enable memory over commit on Centos the following was added to sysctl:
To disable memory over commit on Centos the following was added to sysctl:
It may be worth anyone who is experiencing this issue may enable memory over commit and confirm this allows the Zabbix proxy to start; Or for yourself to disable memory over commit and see if this then effects your Zabbix proxy. NOTE: You can also change the ratio when over enabling memory over commit ie vm.overcommit_ratio = 1 (allow 101% configured memory) - So you may be able to force the issue by lowering this when memory over commit is enabled. So the question is still why does the Zabbix proxy require so much memory upon start, and after the first configuration sync it has no issues after that? Which I suspect is the fork and write somewhere in memory causing the memory to be duplicated. Cheers James
|
Comment by Vladislavs Sokurenko [ 2019 Sep 04 ] |
It would be nice to understand what is this memory used for and if it is memory for write cache "HistoryCacheSize" can you confirm that decreasing HistoryCacheSize makes pollers to consume allot less memory and attach smaps please. A quick update [email protected], from your smaps I see that Pss is equal to 2191492 KB which is roughly 2 GB |
Comment by Vladislavs Sokurenko [ 2019 Sep 04 ] |
If anyone experience the issue, please attach smaps for the process in question and output of htop |
Comment by James Cook [ 2019 Sep 09 ] |
Hi Vladislavs, I checked this and yes it is roughly 2GB, and our system memory is 8GB (only when the work around is in place) This is why I think upon startup of the Zabbix proxy it has memory allocation issues if the host configuration is in place and enabled. When Broken:
When Working:
I have a spare proxy now at work so I will attempt to force a fail and collect smaps in this condition this or next week (as I am on training).
Cheers James |
Comment by Vladislavs Sokurenko [ 2019 Sep 10 ] |
Main process perform forks so it should be it's memory that gets duplicated, can you please be so kind and provide smaps for it ? And maybe even dump of the biggest regions ? gdb -p PID dump memory /tmp/memory.dump 0x0266b000 0x02753000 Main suspicion is that it performs configuration sync once in main process and some memory is left over for other processes. |
Comment by James Cook [ 2019 Sep 12 ] |
Hi Vladislavs, I'm currently on training and will get some additional data for you next week. There are no invalid update intervals that are reported in the logs or anything. Regards James
|
Comment by Christian Anton [ 2019 Sep 17 ] |
We are apparently having the same issue here. One of our proxies, handling roughly 1000 hosts and 600 NVPS with CacheSize of 256M, HistoryCacheSize of 128M and HistoryIndexCacheSize of 64M. System works nicely when started without a sqlite database file and the system has enough available memory: zmonitor-prx-fr1 root ~]# free -m When restarting the proxy, it recognizes the database is there, starts to start its processes and immediately gets killed by OOM killer. |
Comment by Vladislavs Sokurenko [ 2019 Sep 17 ] |
It might be related to the main process using the database and then forks get the same memory |
Comment by Aigars Kadikis [ 2019 Nov 19 ] |
Hello [email protected]. You have been done some good investigations so far. Would you mind to continue the process by providing smaps for it? And maybe even dump of the biggest regions? For example: gdb -p PID dump memory /tmp/memory.dump 0x0266b000 0x02753000
|
Comment by Aigars Kadikis [ 2019 Nov 20 ] |
For those who are relatively new to this thread, but would like to investigate and understand what kind of content is lying in memory, here are sequential steps how to look on "poller process #1" memory heap: # take a note of preocess id ps aux|grep "[z]abbix_proxy: poller #1 " # output zabbix 13349 0.8 2.0 156260 20968 ? S Nov06 157:33 /usr/sbin/zabbix_proxy: poller #1 [got 0 values in 0.000049 sec, idle 1 sec] # take a look in smaps registry cat /proc/13349/smaps # look for the section containing heap with a decent size for example "1572 kB": 026c8000-02851000 rw-p 00000000 00:00 0 [heap] Size: 1572 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 1316 kB Pss: 1316 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 1316 kB Referenced: 1316 kB Anonymous: 1316 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB VmFlags: rd wr mr mw me ac # enter gdb mode for the same process we are debugging right now: gdb -p 13349 # dump the memory block. we are using a memory address from previous output + insert '0x' in at beginning of each address: dump memory /tmp/13349.dump 0x026c8000 0x02851000 quit # Make the memory block more readable to human eye: hexdump -C /tmp/13349.dump /tmp/13349.dump.content # see what is inside: cat /tmp/13349.dump.content | less |
Comment by Vladislavs Sokurenko [ 2019 Nov 24 ] |
Please try patch ZBX-9084-test.patch |
Comment by Vladislavs Sokurenko [ 2019 Dec 04 ] |
Issue is reproduced on CentOS Linux release 7.7.1908, VirtualBox Version 6.0.14 r133895 and SQLite 3.7.17 To reproduce the issue, simply discover 200k trapper items on one host, restart Zabbix proxy after it received configuration from Zabbix server and see out of memory error. Upgrading SQLite on CentOS Linux release 7.7.1908 to SQLite 3.30.1 2019-10-10 20:19:45 also solves the issue Issue is also reproducible on Fedora release 30 (Thirty) with SQLite 3.7.17 2013-05-20 |
Comment by Vladislavs Sokurenko [ 2019 Dec 10 ] |
Fixed in:
|
Comment by Vladislavs Sokurenko [ 2019 Dec 10 ] |
(1) [D] High memory usage during process startup with SQLite 3.7.17 on Centos/RHEL 7. Startup process has been improved in 4.0.16 and 4.4.4 to avoid similar issues with other databases as well. martins-v Added to known issues for 4.0, 4.4. RESOLVED vso CLOSED |