[ZBX-12919] Preprocessing Manager - extreme memory usage Created: 2017 Oct 22  Updated: 2017 Nov 06  Resolved: 2017 Nov 06

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 3.4.2, 3.4.3
Fix Version/s: None

Type: Incident report Priority: Critical
Reporter: Andreas Biesenbach Assignee: Unassigned
Resolution: Workaround proposed Votes: 0
Labels: memory, preprocessing, zabbix_server
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating system: Red Hat Enterprise Linux Server release 7.4 (Maipo)
Hardware: UCSB-B200-M4, 64GB Memory, 1Socket/12Cores
Database: Percona Server 5.7 (local)

Latest software as well as firmware updates applied.


Attachments: PNG File Dataloss.png     PNG File Status_of_zabbix.png     PNG File Zabbix_environment.PNG     PNG File atop_preprocessingcpu.PNG     PNG File atop_preprocessingmem.PNG     PNG File oom_report.png     PNG File solved_internal_processes.PNG     PNG File solved_preprocessing_queue.PNG     PNG File zabbix_server.conf.png    

 Description   

Hi all,

since we updated our Zabbix Server to version 3.4 we are having a lot of trouble regarding the memory usage of preprocessing manager. The memory usage of this process sporadically raises up to 90% of total server memory and results in OOM process kills on OS level -> In most cases the server itself or the database are beeing killed with a loss of data in result.

Strange fact: This mostly happens on saturdays (I already checked cron-/anacrontab but couldn't find anything).

Steps to reproduce:
-

Result:

  1. High CPU and memory usage of preprocessing manager (see "atop_preprocessingmem.PNG" and "atop_proprocessingcpu.PNG")
  2. OOM Killer killing processes (see "oom_report.png")
  3. Loss of data within zabbix (see "Dataloss.png")

Seems like I am not the only one with this issue: https://www.zabbix.com/forum/showthread.php?p=203975#post203975

Tomorrow I will raise the debug level for preprocessing process as mentioned in ZBX-12791

Thanks in advance!



 Comments   
Comment by Glebs Ivanovskis (Inactive) [ 2017 Oct 22 ]

You may need to increase the number of preprocessing workers. Maybe you have a lot of items with "heavy" preprocessing options scheduled on Saturdays and default StartPreprocessors=3 can't cope with that.

But maybe the root cause is slow DB, I can't say for sure because you have old graph of process busyness which does not feature preprocessing manager and workers. Could you import the latest version of Template App Zabbix Server from here and upload more graphs afterwards?

Comment by Andreas Biesenbach [ 2017 Oct 23 ]

First of all: Thanks for you fast reply!

But maybe the root cause is slow DB

It looks like you are right. The MySQL backup is running locally since we do not have a replication slave. As soon as the backup starts the preprocessing manager queue gets higher and higher. I didn't think that this would happen when running mysqldump with the "--single-transaction" option.

/bin/mysqldump -uroot --single-transaction --routines --triggers --events --log-error=$backup_dir/${cur_date}_mysql_backup.log zabbix > /mysql/data_zabbix/backup/backup_zabbix_db.sql

You may need to increase the number of preprocessing workers. Maybe you have a lot of items with "heavy" preprocessing options scheduled on Saturdays and default StartPreprocessors=3 can't cope with that.

And again it looks like you are right. Even after canceling the MySQL backup the preprocessing manager is not able to work off his queue. It is currently at ~104.000.000 items and still raising.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3217 zabbix 0 -20 38.764g 0.025t 1.032g R 100.0 40.5 562:50.77 /usr/sbin/zabbix_server: preprocessing manager #1 [queued 104199936, processed 3 values, idle 0.000000 s

I will now stop zabbix server process (even when losing some data) and change the mentioned parameter

StartPreprocessors=3 

to a higher value. I comment results as soon as I got them.

-----------------------------------------------------------------------------------------------------------------------------------------------------------

Just as an info: I changed log level for preprocessing manager to debug. All I can see is that it is working as expected:

3217:20171023:073252.477 In preprocessor_enqueue() itemid: 4614996
3217:20171023:073252.477 In preprocessor_enqueue_dependent() itemid: 4614996
3217:20171023:073252.477 End of preprocessor_enqueue_dependent()
3217:20171023:073252.477 End of preprocessor_enqueue()
3217:20171023:073252.478 In preprocessor_enqueue() itemid: 4567725
3217:20171023:073252.478 In preprocessor_enqueue_dependent() itemid: 4567725
3217:20171023:073252.478 End of preprocessor_enqueue_dependent()
3217:20171023:073252.478 End of preprocessor_enqueue()
3217:20171023:073252.478 In preprocessor_enqueue() itemid: 4555928
3217:20171023:073252.478 In preprocessor_enqueue_dependent() itemid: 4555928
3217:20171023:073252.478 End of preprocessor_enqueue_dependent()
3217:20171023:073252.478 End of preprocessor_enqueue()
3217:20171023:073252.478 In preprocessor_enqueue() itemid: 4555930
3217:20171023:073252.478 In preprocessor_enqueue_dependent() itemid: 4555930
3217:20171023:073252.478 End of preprocessor_enqueue_dependent()
3217:20171023:073252.478 End of preprocessor_enqueue()
3217:20171023:073252.478 In preprocessor_enqueue() itemid: 4555929
3217:20171023:073252.478 In preprocessor_enqueue_dependent() itemid: 4555929

Best regards
Andi

Comment by Glebs Ivanovskis (Inactive) [ 2017 Oct 23 ]

I hope that situation will recover eventually. Do not forget to update the template, there is a new item for preprocessing queue monitoring (without a trigger so far, unfortunately).

Comment by Andris Zeila [ 2017 Oct 23 ]

Depending on the queued values ZBX-12857 might help to improve the situation. However it is slightly strange that preprocessing queue grows when history cache is full - the preprocessing manager would spend more time sleeping and waiting on history cache instead of processing value requests and adding them to queue. Well, processing data in batches (for example from proxies) would still cause queue to grow, but the rate of growth would be slowed.

Comment by Andreas Biesenbach [ 2017 Oct 23 ]

Hi all,

We set StartPreprocessors to a higher value and imported new templates as mentioned:

StartPreprocessors=20

I can't believe it. Zabbix server is now running stable. Even with MySQL backup running the queue permanently stays at 0 items. The only thing that now might cause errors are the history syncers, because their pre-select might take much longer during MySQL backup which results in higher history cache usage - but this can be fixed by increasing the history cache.

Thanks a lot!! Even if I will have an eye on it the next few days I am optimistically that the OOM situations are a thing of the past now


Comment by Ingus Vilnis [ 2017 Nov 06 ]

Hi Andreas,

From the last screenshots looks like your issue is resolved just by some tuning. I will therefore close this ticket as Workaround proposed.

Please reopen or create a new ticket if the problems with preprocessing still exist.

Generated at Fri Apr 26 03:25:03 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.