-
Incident report
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
4.0.0beta1
-
None
-
Ubuntu 18.04
Tried both SQLite3 and MariaDB10.1 - same issue
There are possibly two unrelated issues with memory leaks!!!
Steps to reproduce:
- Update config for the proxy
- VMwareCacheSize=2G
CacheSize=256M
HistoryCacheSize=256M
HistoryIndexCacheSize=256M - StartVMwareCollectors=32
VMwareFrequency=30
VMwarePerfFrequency=30
- VMwareCacheSize=2G
- Enable VMware monitoring via vCenter v5.5.0 (37 hosts)
Result:
Memory starts leaking slowly (the increase is due to a single VMware collector). After 5 hours of "slow leakage", memory instantaneously becomes all used up (free memory=0 and all swap is used) - possibly for a different reason (see below)
As a result, in 40 minutes one zabbix-proxy subprocess dies (due to OOM), see below
Aug 30 12:48:41 aumelzbxproxy01 kernel: [20245.087653] [ 2020] 111 2020 771956 164 339968 1665 0 zabbix_proxy Aug 30 12:48:41 aumelzbxproxy01 kernel: [20245.087655] Out of memory: Kill process 1933 (zabbix_proxy) score 418 or sacrifice child Aug 30 12:48:41 aumelzbxproxy01 kernel: [20245.090069] Killed process 1933 (zabbix_proxy) total-vm:4844788kB, anon-rss:1754648kB, file-rss:1728kB, shmem-rss:1015156kB Aug 30 12:48:41 aumelzbxproxy01 kernel: [20245.210083] oom_reaper: reaped process 1933 (zabbix_proxy), now anon-rss:0kB, file-rss:0kB, shmem-rss:1015156kB
The process that is killed by OOM is the following one
zabbix 1933 1423 0 07:11 ? 00:00:00 /usr/sbin/zabbix_proxy: poller #24 [got 24 values in 1.331793 sec, getting values]
And after it dies, "syncing history data" starts and takes 4 hours, during which zabbix-proxy does NOT send any data to the server. Then zabbix-proxy restarts (see in logs, by systemd)
1908:20180830:124831.130 slow query: 4.665959 sec, "select taskid,type,clock,ttl from task where status=1 and type in (2, 6) order by taskid" 1951:20180830:124831.813 slow query: 9.917240 sec, "commit;" 1946:20180830:124831.868 slow query: 10.056336 sec, "commit;" 1973:20180830:124831.870 slow query: 10.056378 sec, "commit;" 1857:20180830:124831.870 slow query: 10.054214 sec, "commit;" 1942:20180830:124832.230 slow query: 10.579126 sec, "commit;" 1942:20180830:124832.381 resuming Zabbix agent checks on host "AUSYDADMIN01": connection restored 1857:20180830:124837.434 slow query: 5.529120 sec, "begin;" 1856:20180830:124838.693 slow query: 6.580776 sec, "begin;" 1952:20180830:124838.750 slow query: 6.372370 sec, "begin;" 1978:20180830:124840.334 slow query: 19.614870 sec, "insert into proxy_autoreg_host (clock,host,listen_ip,listen_dns,listen_port,host_metadata) values (1535633300,'AUTSTMYDOTPDV01','10.60.16.32','autstmydotpd v01.myobtest.net',10050,'Windows AUTSTMYDOTPDV01 6.1.7601 Microsoft Windows Server 2008 R2 Standard Service Pack 1 x64')" 1970:20180830:124842.059 slow query: 25.347796 sec, "update hosts set errors_from=1535633296,disable_until=1535633299 where hostid=10445" 1423:20180830:124842.065 One child process died (PID:1933,exitcode/signal:9). Exiting ... zabbix_proxy [1423]: Error waiting for process with PID 1933: [10] No child processes 1423:20180830:124844.659 syncing history data... 1423:20180830:124854.005 syncing history data... 0.046460% 1423:20180830:124904.005 syncing history data... 0.093412% ..... 1423:20180830:170724.004 syncing history data... 99.933678% 1423:20180830:170734.003 syncing history data... 99.999017% 1423:20180830:170734.521 syncing history data done 1423:20180830:170734.557 Zabbix Proxy stopped. Zabbix 4.0.0beta1 (revision 84219). 6877:20180830:170745.113 Starting Zabbix Proxy (active) [aumelzbxproxy01]. Zabbix 4.0.0beta1 (revision 84219). 6877:20180830:170745.123 **** Enabled features **** 6877:20180830:170745.123 SNMP monitoring: YES
See screenshot... Time is +10GMT
See log file... Time is UTC
Expected:
1) No "slow memory leak" and no "instantaneous memory hike, bringing it to 0".
2) When a subprocess dies, it should not take zabbix-proxy 4 hours to do "syncing history data", during which nothing is sent to zabbix-server