[ZBXNEXT-5113] Increase or remove configuration cache size limitation Created: 2019 Mar 16  Updated: 2020 May 25  Resolved: 2020 May 25

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Server (S)
Affects Version/s: 4.0.5
Fix Version/s: None

Type: New Feature Request Priority: Critical
Reporter: Brian Lloyd Assignee: Zabbix Support Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04
4.4.0-140-generic


Attachments: PNG File screenshot-1.png     Text File server.patch    
Issue Links:
Duplicate
is duplicated by ZBX-15956 Configuration Cache Fragmentation Closed

 Description   

CacheSize is currently configured for 8GB. Server dies when the limit is exceeded and indicates that the CacheSize should be increased. Server fails to start when CacheSize is configured for greater than 8GB.

Our zabbix server process starts and runs fine for 10-15 days, after which I'm speculating perhaps the memory allocated for the cache becomes more and more fragmented, causing an increased cache utilization. I see there have been historical issues to increase the size when the limit was lower. Is there a way the upper limit on the cache size can be removed altogether, leaving it to the sysadmin to determine the appropriate max size?

1868:20190314:034026.792 __mem_malloc: skipped 48828 asked 12903760 skip_min 256 skip_max 12857992
1868:20190314:034026.792 file:dbconfig.c,line:94 zbx_mem_realloc(): out of memory (requested 12903760 bytes)
1868:20190314:034026.792 file:dbconfig.c,line:94 zbx_mem_realloc(): please increase CacheSize configuration parameter
1868:20190314:034026.792 === memory statistics for configuration cache ===
1868:20190314:034026.792 free chunks of size 32 bytes: 8
1868:20190314:034026.792 free chunks of size 48 bytes: 8
1868:20190314:034026.792 free chunks of size 80 bytes: 1
1868:20190314:034026.793 free chunks of size 96 bytes: 1592
1868:20190314:034026.793 free chunks of size 104 bytes: 3
1868:20190314:034026.793 free chunks of size 112 bytes: 1
1868:20190314:034026.805 free chunks of size >= 256 bytes: 48828
1868:20190314:034026.805 min chunk size: 32 bytes
1868:20190314:034026.805 max chunk size: 12857992 bytes
1868:20190314:034026.805 memory of total size 8589934216 bytes fragmented into 22273991 chunks

mysql> select type, count from items group by type order by type;
--------------+

type count

--------------+

0 1351
1 1504
2 146
3 68211
4 4931716
5 4724
7 17
10 2
12 19
15 337
16 87
17 778099

--------------+
12 rows in set (13.30 sec)



 Comments   
Comment by Brian Lloyd [ 2019 Mar 16 ]

I'll also note, perhaps as something that should be reported as a bug (let me know), or perhaps just something I need to troubleshoot more closely in my environment, that the built-in graph that is part of the zabbix server health screen that shows the cache utilization doesn't seem to indicate that the cache is getting full. The attached screenshot shows the period of time leading up to the crash message in the description.

Comment by Brian Lloyd [ 2019 Mar 16 ]

Attached patch seems to work to allow me to increase CacheSize - or at least allows zabbix to start. I haven't run across limitations elsewhere that would require modification to support the increased limit.

server.patch

Comment by Glebs Ivanovskis [ 2019 Mar 17 ]

Previous increase of the limit happened in ZBXNEXT-2137.

server.patch looks good and should be enough to fulfil your requirements.

Judging by the line

1868:20190314:034026.792 __mem_malloc: skipped 48828 asked 12903760 skip_min 256 skip_max 12857992

fragmentation seems to be the issue. For some reason Zabbix needs to allocate ~12 MB, but there is no contiguous free space big enough.

12 MB seems too much to me and is worth investigating separately. Do you see any activity in the log file prior to the __mem_malloc() failure?

Comment by Brian Lloyd [ 2019 Mar 20 ]

The messages I see leading up to the crash include: Item became supported, item became unsupported, No such instance currently exists at this OID, and slow query, none of which is out of the ordinary when compared with historical logs.

Comment by Glebs Ivanovskis [ 2019 Mar 20 ]

As far as I know, error messages are one of the things stored in configuration cache, therefore item or trigger becoming unsupported can cause an allocation. Maybe there are error messages that are unreasonably long? If yes, there is something to improve in Zabbix besides lifting the limitation of CacheSize.

Comment by Brian Lloyd [ 2019 Mar 20 ]

Interesting... I'll have to dig through the logs again to see whether there are any long error messages that could be contributing as you've suggested could be occurring. I can say there are a lot of these messages, hosts going in and out of supported often as polling encounters errors - I believe this is caused by the network having a wide variety of firmware revisions on the devices being monitored, and not all of them supporting the same set of monitoring items [SNMP OIDs], causing many to throw errors until we finish synchronizing firmware installed across the install base, a work in progress.

Any thoughts on why the zabbix internal check on the configuration cache utilization never dips below 70%, even right before the crash? I don't imagine any error message being able to fill up that large a chunk of the cache (around 5.6 G when CacheSize is configured for 8G).

One other log message I failed to mention is unmatched SNMP trap. Those wouldn't get stored in the configuration cache, would they? Those are the largest messages I recall seeing at this time.

Thanks

Comment by Glebs Ivanovskis [ 2019 Mar 20 ]

Any thoughts on why the zabbix internal check on the configuration cache utilization never dips below 70%, even right before the crash? I don't imagine any error message being able to fill up that large a chunk of the cache (around 5.6 G when CacheSize is configured for 8G).

I believe this is purely due to fragmentation. There may be 5.6 GB of free space in total, but none of free chunks is big enough to accommodate 12 MB. Configuration cache can't be easily defragmented because memory allocated there is accessed using usual pointers. It is very convenient from developer's perspective, but also means that allocated blocks can't be moved.

One other log message I failed to mention is unmatched SNMP trap. Those wouldn't get stored in the configuration cache, would they?

Log messages are not stored in configuration cache. I was talking about item/trigger/... error messages which are stored in DB and which you see in the frontend when you hover over a red cross. These are stored in configuration cache.

Comment by Brian Lloyd [ 2019 Apr 05 ]

Applying the patch and increasing the cachesize from 8GB to 12GB has resulted in the server crashing at 21 days rather than the previous crashes that were all between 10 and 14 days. I can continue increasing the cache, but wonder if I should instead be opening a bug to see if the issue of memory fragmentation can be resolved. Can you advise?

Thanks

Comment by Glebs Ivanovskis [ 2019 Apr 05 ]

What was the __mem_malloc message this time? Do you have much free space in the cache according to graphs this time as well? If yes, I would definitely consider reporting it.

Comment by Brian Lloyd [ 2019 Apr 06 ]

I'll work on getting it reported as a bug. Thanks.

The last value pulled for value cache % free prior to the crash was 80.8084%

  2288:20190404:095916.682 __mem_malloc: skipped 193611 asked 12903760 skip_min 256 skip_max 12862152
  2288:20190404:095916.682 [file:dbconfig.c,line:94] zbx_mem_realloc(): out of memory (requested 12903760 bytes)
  2288:20190404:095916.682 [file:dbconfig.c,line:94] zbx_mem_realloc(): please increase CacheSize configuration parameter
  2288:20190404:095916.682 === memory statistics for configuration cache ===
  2288:20190404:095916.682 free chunks of size     32 bytes:       12
  2288:20190404:095916.682 free chunks of size     40 bytes:        2
  2288:20190404:095916.682 free chunks of size     48 bytes:        4
  2288:20190404:095916.682 free chunks of size     56 bytes:        1
  2288:20190404:095916.682 free chunks of size     80 bytes:        2
  2288:20190404:095916.682 free chunks of size     96 bytes:     1777
  2288:20190404:095916.682 free chunks of size    112 bytes:        1
  2288:20190404:095916.725 free chunks of size >= 256 bytes:   193611
  2288:20190404:095916.725 min chunk size:         32 bytes
  2288:20190404:095916.725 max chunk size:   12862152 bytes
  2288:20190404:095916.725 memory of total size 12884901512 bytes fragmented into 22921433 chunks
  2288:20190404:095916.725 of those, 10416940744 bytes are in   195410 free chunks
  2288:20190404:095916.725 of those, 2101217856 bytes are in 22726023 used chunks
  2288:20190404:095916.725 ================================
  2288:20190404:095916.725 === Backtrace: ===
  2288:20190404:095916.726 11: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](zbx_backtrace+0x3c) [0x4a0b7c]
  2288:20190404:095916.726 10: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](__zbx_mem_realloc+0x427) [0x49de97]
  2288:20190404:095916.727 9: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](zbx_binary_heap_insert+0xac) [0x4a23fc]
  2288:20190404:095916.727 8: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration]() [0x41dc8b]
  2288:20190404:095916.727 7: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](DCsync_configuration+0x8f1) [0x482751]
  2288:20190404:095916.727 6: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](dbconfig_thread+0xfe) [0x428fee]
  2288:20190404:095916.727 5: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](zbx_thread_start+0x3e) [0x4aa09e]
  2288:20190404:095916.727 4: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](MAIN_ZABBIX_ENTRY+0x6e6) [0x423f36]
  2288:20190404:095916.727 3: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](daemon_start+0x1bb) [0x4a060b]
  2288:20190404:095916.727 2: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](main+0x350) [0x423020]
  2288:20190404:095916.727 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fe1aeef0830]
  2288:20190404:095916.727 0: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](_start+0x29) [0x4233a9]
  1497:20190404:095917.844 One child process died (PID:2288,exitcode/signal:1). Exiting ...
Comment by Glebs Ivanovskis [ 2019 Apr 06 ]

There is definitely something wrong happening...

I'll work on getting it reported as a bug.

Let me know if you need any help.

Comment by Brian Lloyd [ 2019 Apr 06 ]

Link to bug report: ZBX-15956

Comment by Artjoms Rimdjonoks [ 2020 May 25 ]

brian.lloyd
ZBX-15956 is done so I will closed this issue.

Generated at Fri Mar 29 14:53:13 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.