[ZBX-15956] Configuration Cache Fragmentation Created: 2019 Apr 06 Updated: 2024 Apr 10 Resolved: 2019 Apr 30 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 4.0.5 |
Fix Version/s: | 4.0.8rc1, 4.2.2rc1, 4.4.0alpha1, 4.4 (plan) |
Type: | Problem report | Priority: | Blocker |
Reporter: | Brian Lloyd | Assignee: | Vladislavs Sokurenko |
Resolution: | Fixed | Votes: | 1 |
Labels: | cache, fragmentation | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Ubuntu 16.04 |
Attachments: |
![]() ![]() ![]() ![]() ![]() |
||||||||
Issue Links: |
|
||||||||
Team: | |||||||||
Sprint: | Sprint 51 (Apr 2019) | ||||||||
Story Points: | 3 |
Description |
I'm submitting this based on advice received in I've run into an issue where our zabbix_server process crashes due to exhaustion of the configuration cache - specifically, lack of an available memory chunk large enough to fulfill the allocation request. A log message suggested increasing the cachesize, but we were already at the 8GB limit. I submitted a feature request,
memory of total size 12884901512 bytes fragmented into 22921433 chunks of those, 10416940744 bytes are in 195410 free chunks of those, 2101217856 bytes are in 22726023 used chunks
I applied the patch I had proposed, built, increased the configuration cache value from 8GB to 12 GB, and waited. Previously, with the 8GB cachesize, we were getting a zabbix server crash every 10-15 days (2-3 days with cachesize previously set to 4GB). With the 12GB cachesize, we went 21 days before a crash occurred (log snippet above is from this crash). At this point I've increased the cachesize to 16GB, but I feel this probably isn't the best fix, as it seems almost all of the configuration cache is tied up in unusable fragmented memory and will inevitably lead to another out of memory scenario. Is there something that can be done to improve memory allocation so it doesn't become so fragmented? Thank you for your time. 2019-01-24 Incident - cachesize 4GB (similar instances occurred on 2019-01-22, 2019-01-18, 2019-01-15)
7054:20190124:185053.886 __mem_malloc: skipped 112009 asked 12903760 skip_min 256 skip_max 12628776 7054:20190124:185053.886 [file:dbconfig.c,line:94] zbx_mem_realloc(): out of memory (requested 12903760 bytes) 7054:20190124:185053.886 [file:dbconfig.c,line:94] zbx_mem_realloc(): please increase CacheSize configuration parameter 7054:20190124:185053.886 === memory statistics for configuration cache === 7054:20190124:185053.886 free chunks of size 24 bytes: 3 7054:20190124:185053.886 free chunks of size 48 bytes: 1 7054:20190124:185053.886 free chunks of size 56 bytes: 1 7054:20190124:185053.887 free chunks of size 64 bytes: 1 7054:20190124:185053.887 free chunks of size 80 bytes: 2 7054:20190124:185053.887 free chunks of size 96 bytes: 1614 7054:20190124:185053.887 free chunks of size 104 bytes: 2 7054:20190124:185053.908 free chunks of size >= 256 bytes: 112009 7054:20190124:185053.908 min chunk size: 24 bytes 7054:20190124:185053.908 max chunk size: 12628776 bytes 7054:20190124:185053.908 memory of total size 4294966920 bytes fragmented into 14429213 chunks 7054:20190124:185053.908 of those, 2736242192 bytes are in 113633 free chunks 7054:20190124:185053.908 of those, 1327857336 bytes are in 14315580 used chunks 7054:20190124:185053.908 ================================ 7054:20190124:185053.908 === Backtrace: === 7054:20190124:185053.909 13: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](zbx_backtrace+0x44) [0x5603ca76b9a2] 7054:20190124:185053.909 12: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](__zbx_mem_realloc+0x14d) [0x5603ca7679b0] 7054:20190124:185053.909 11: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](+0xc4eb6) [0x5603ca731eb6] 7054:20190124:185053.909 10: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](+0x1009e9) [0x5603ca76d9e9] 7054:20190124:185053.909 9: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](zbx_binary_heap_insert+0x69) [0x5603ca76df46] 7054:20190124:185053.909 8: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](+0xd0549) [0x5603ca73d549] 7054:20190124:185053.909 7: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](DCsync_configuration+0x1134) [0x5603ca73e762] 7054:20190124:185053.909 6: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](dbconfig_thread+0x168) [0x5603ca6ad0de] 7054:20190124:185053.909 5: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](zbx_thread_start+0x37) [0x5603ca7796cf] 7054:20190124:185053.909 4: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](MAIN_ZABBIX_ENTRY+0x975) [0x5603ca6a4faa] 7054:20190124:185053.909 3: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](daemon_start+0x315) [0x5603ca76b131] 7054:20190124:185053.909 2: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](main+0x305) [0x5603ca6a461f] 7054:20190124:185053.909 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f3be6306830] 7054:20190124:185053.909 0: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1548380855.229265 sec, syncing configuration](_start+0x29) [0x5603ca6a37e9] 7052:20190124:185054.437 One child process died (PID:7054,exitcode/signal:1). Exiting ...
2019-03-14 Incident - cachesize 8GB (similar instances occurred on 2019-02-28, 2019-02-19, 2019-02-05)
1868:20190314:034026.792 __mem_malloc: skipped 48828 asked 12903760 skip_min 256 skip_max 12857992 1868:20190314:034026.792 [file:dbconfig.c,line:94] zbx_mem_realloc(): out of memory (requested 12903760 bytes) 1868:20190314:034026.792 [file:dbconfig.c,line:94] zbx_mem_realloc(): please increase CacheSize configuration parameter 1868:20190314:034026.792 === memory statistics for configuration cache === 1868:20190314:034026.792 free chunks of size 32 bytes: 8 1868:20190314:034026.792 free chunks of size 48 bytes: 8 1868:20190314:034026.792 free chunks of size 80 bytes: 1 1868:20190314:034026.793 free chunks of size 96 bytes: 1592 1868:20190314:034026.793 free chunks of size 104 bytes: 3 1868:20190314:034026.793 free chunks of size 112 bytes: 1 1868:20190314:034026.805 free chunks of size >= 256 bytes: 48828 1868:20190314:034026.805 min chunk size: 32 bytes 1868:20190314:034026.805 max chunk size: 12857992 bytes 1868:20190314:034026.805 memory of total size 8589934216 bytes fragmented into 22273991 chunks 1868:20190314:034026.805 of those, 6183292960 bytes are in 50441 free chunks 1868:20190314:034026.805 of those, 2050257416 bytes are in 22223550 used chunks 1868:20190314:034026.805 ================================ 1868:20190314:034026.805 === Backtrace: === 1868:20190314:034026.811 13: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](zbx_backtrace+0x44) [0x55b0add47ec6] 1868:20190314:034026.811 12: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](__zbx_mem_realloc+0x14d) [0x55b0add43e09] 1868:20190314:034026.811 11: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](+0xc6335) [0x55b0add0e335] 1868:20190314:034026.811 10: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](+0x101f0d) [0x55b0add49f0d] 1868:20190314:034026.811 9: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](zbx_binary_heap_insert+0x69) [0x55b0add4a46a] 1868:20190314:034026.811 8: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](+0xd19a2) [0x55b0add199a2] 1868:20190314:034026.811 7: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](DCsync_configuration+0x1134) [0x55b0add1abbb] 1868:20190314:034026.811 6: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](dbconfig_thread+0x168) [0x55b0adc8883e] 1868:20190314:034026.811 5: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](zbx_thread_start+0x37) [0x55b0add55bf3] 1868:20190314:034026.811 4: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](MAIN_ZABBIX_ENTRY+0x975) [0x55b0adc8070a] 1868:20190314:034026.811 3: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](daemon_start+0x315) [0x55b0add47655] 1868:20190314:034026.811 2: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](main+0x305) [0x55b0adc7fd7f] 1868:20190314:034026.811 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f22d47a3830] 1868:20190314:034026.811 0: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1552556048.314516 sec, syncing configuration](_start+0x29) [0x55b0adc7ef49] 1488:20190314:034027.817 One child process died (PID:1868,exitcode/signal:1). Exiting ...
2019-04-04 Incident - cachesize 12GB 2288:20190404:095916.682 __mem_malloc: skipped 193611 asked 12903760 skip_min 256 skip_max 12862152 2288:20190404:095916.682 [file:dbconfig.c,line:94] zbx_mem_realloc(): out of memory (requested 12903760 bytes) 2288:20190404:095916.682 [file:dbconfig.c,line:94] zbx_mem_realloc(): please increase CacheSize configuration parameter 2288:20190404:095916.682 === memory statistics for configuration cache === 2288:20190404:095916.682 free chunks of size 32 bytes: 12 2288:20190404:095916.682 free chunks of size 40 bytes: 2 2288:20190404:095916.682 free chunks of size 48 bytes: 4 2288:20190404:095916.682 free chunks of size 56 bytes: 1 2288:20190404:095916.682 free chunks of size 80 bytes: 2 2288:20190404:095916.682 free chunks of size 96 bytes: 1777 2288:20190404:095916.682 free chunks of size 112 bytes: 1 2288:20190404:095916.725 free chunks of size >= 256 bytes: 193611 2288:20190404:095916.725 min chunk size: 32 bytes 2288:20190404:095916.725 max chunk size: 12862152 bytes 2288:20190404:095916.725 memory of total size 12884901512 bytes fragmented into 22921433 chunks 2288:20190404:095916.725 of those, 10416940744 bytes are in 195410 free chunks 2288:20190404:095916.725 of those, 2101217856 bytes are in 22726023 used chunks 2288:20190404:095916.725 ================================ 2288:20190404:095916.725 === Backtrace: === 2288:20190404:095916.726 11: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](zbx_backtrace+0x3c) [0x4a0b7c] 2288:20190404:095916.726 10: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](__zbx_mem_realloc+0x427) [0x49de97] 2288:20190404:095916.727 9: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](zbx_binary_heap_insert+0xac) [0x4a23fc] 2288:20190404:095916.727 8: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration]() [0x41dc8b] 2288:20190404:095916.727 7: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](DCsync_configuration+0x8f1) [0x482751] 2288:20190404:095916.727 6: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](dbconfig_thread+0xfe) [0x428fee] 2288:20190404:095916.727 5: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](zbx_thread_start+0x3e) [0x4aa09e] 2288:20190404:095916.727 4: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](MAIN_ZABBIX_ENTRY+0x6e6) [0x423f36] 2288:20190404:095916.727 3: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](daemon_start+0x1bb) [0x4a060b] 2288:20190404:095916.727 2: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](main+0x350) [0x423020] 2288:20190404:095916.727 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fe1aeef0830] 2288:20190404:095916.727 0: /usr/sbin/zabbix_server: configuration syncer [synced configuration in 1554393163.208568 sec, syncing configuration](_start+0x29) [0x4233a9] 1497:20190404:095917.844 One child process died (PID:2288,exitcode/signal:1). Exiting ...
|
Comments |
Comment by Glebs Ivanovskis [ 2019 Apr 06 ] | |||||||||||||||||
Whoever will be looking into it from Zabbix side, I want to draw your attention to the fact that Zabbix server tries to allocate a 12 MB chunk in configuration cache. This is not what it usually does since all the configuration is stored in hashsets of pointers and vectors of pointers to structures, and there are no structures in Zabbix as big as 12 MB. I have a suspicion that sometimes Zabbix constructs a huge error message (item became notsupported or something), probably citing the whole value. Since item errors are stored in configuration cache as well, this seems a reasonable explanation to me. | |||||||||||||||||
Comment by richlv [ 2019 Apr 06 ] | |||||||||||||||||
Very refreshing to see a public analysis, thank you so much, Gleb. | |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 06 ] | |||||||||||||||||
It might hep if objdump is provided, also if you increase log level to debug in configuration syncer only using runtime command it will print what consumes the most space and it also might be of value. How many unsupported items and triggers do you have ? | |||||||||||||||||
Comment by Brian Lloyd [ 2019 Apr 06 ] | |||||||||||||||||
I'm thinking I'll decrease the cache size to 4GB so that the error is more likely to occur within the next few days rather than having to wait weeks (judging by the used space shown in the logs, I could probably decrease to 2 or 3 in order to allow for the event to occur even more quickly), and then I'll set the increased log level as suggested. I'm not sure why it didn't occur to me to do this prior to submitting this issue. System information
| |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 06 ] | |||||||||||||||||
Please also attach graphs of free configuration cache does it grow over time or spikes at some moment ? | |||||||||||||||||
Comment by Brian Lloyd [ 2019 Apr 06 ] | |||||||||||||||||
Graphs from previous crashes are attached. They indicate that the memory usage stays pretty stable and doesn't have any spikes. 30 Day interval showing increases from 8 to 12 to 16 GB cachesize:
1 Day interval showing the most recent crash just before 10:00: | |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 08 ] | |||||||||||||||||
You can take patch mem_stats.diff | |||||||||||||||||
Comment by Oleksii Zagorskyi [ 2019 Apr 12 ] | |||||||||||||||||
one more case, another installation:
11953:20190411:133422.657 __mem_malloc: skipped 37614 asked 335664 skip_min 256 skip_max 335264
11953:20190411:133422.657 [file:dbconfig.c,line:94] zbx_mem_realloc(): out of memory (requested 335664 bytes)
11953:20190411:133422.657 [file:dbconfig.c,line:94] zbx_mem_realloc(): please increase CacheSize configuration parameter
11953:20190411:133422.657 === memory statistics for configuration cache ===
11953:20190411:133422.657 free chunks of size 32 bytes: 1
11953:20190411:133422.657 free chunks of size 40 bytes: 3
11953:20190411:133422.657 free chunks of size 48 bytes: 1
11953:20190411:133422.657 free chunks of size 64 bytes: 1
11953:20190411:133422.657 free chunks of size 80 bytes: 3
11953:20190411:133422.658 free chunks of size >= 256 bytes: 37614
11953:20190411:133422.658 min chunk size: 32 bytes
11953:20190411:133422.659 max chunk size: 335264 bytes
11953:20190411:133422.659 memory of total size 2147483272 bytes fragmented into 11289053 chunks
11953:20190411:133422.659 of those, 840311400 bytes are in 37623 free chunks
11953:20190411:133422.659 of those, 1126547040 bytes are in 11251430 used chunks
11953:20190411:133422.659 ================================
...
11933:20190411:133626.693 Zabbix Server stopped. Zabbix 4.0.5 (revision 90164).
it happened 3 times last month. | |||||||||||||||||
Comment by Oleksii Zagorskyi [ 2019 Apr 15 ] | |||||||||||||||||
debug log of config syncer cycle configuration_syncer.log | |||||||||||||||||
Comment by Brian Lloyd [ 2019 Apr 15 ] | |||||||||||||||||
Quick status update - it took me some time to coordinate the changes necessary to apply the suggested patch in our environment. I applied the mem_stats.diff patch and decreased my configuration cache to 2.5GB in an attempt to expedite the next crash. Based on previous crashes with a 4GB cache happening every few days, I expected a crash within a couple days. Oddly, it still hasn't crashed after 4.5 days. I'll update again when I have some relevant data. | |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 23 ] | |||||||||||||||||
Some info on possible cause, yet to be confirmed: Unfortunately when continuous memory "max chunk size" becomes too small then Zabbix server will crash even though there are free memory. In the issue description it Zabbix server run out of memory when performing reallocation, my suspicion is that reallocation is done when new items are discovered and added to queue, queue itself requires continuous memory. Perhaps new items are discovered continuously so that queue needs to be resized over and over again until it cannot be resized anymore. Perhaps increasing array growth factor from 3/2 to something else could improve the situation ? diff --git a/src/libs/zbxalgo/binaryheap.c b/src/libs/zbxalgo/binaryheap.c index 38531bc27e..101065e047 100644 --- a/src/libs/zbxalgo/binaryheap.c +++ b/src/libs/zbxalgo/binaryheap.c @@ -29,7 +29,7 @@ static void __binary_heap_ensure_free_space(zbx_binary_heap_t *heap); static int __binary_heap_bubble_up(zbx_binary_heap_t *heap, int index); static int __binary_heap_bubble_down(zbx_binary_heap_t *heap, int index); -#define ARRAY_GROWTH_FACTOR 3/2 +#define ARRAY_GROWTH_FACTOR 2 #define HAS_DIRECT_OPTION(heap) (0 != (heap->options & ZBX_BINARY_HEAP_OPTION_DIRECT)) Also making configuration cache reload less frequent could help with the issue | |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 24 ] | |||||||||||||||||
Currently looks like this code is to blame: zbx_binary_heap_clear(&config->timer_queue); zbx_hashset_iter_reset(&config->triggers, &iter); while (NULL != (trigger = (ZBX_DC_TRIGGER *)zbx_hashset_iter_next(&iter))) { zbx_binary_heap_elem_t elem; if (TRIGGER_STATUS_DISABLED == trigger->status) continue; if (TRIGGER_FUNCTIONAL_FALSE == trigger->functional) continue; if (ZBX_TRIGGER_TIMER_QUEUE != trigger->timer) continue; trigger->nextcheck = dc_timer_calculate_nextcheck(now, trigger->triggerid); elem.key = trigger->triggerid; elem.data = (void *)trigger; zbx_binary_heap_insert(&config->timer_queue, &elem); } Notice clearing of queue This fixes the issue but should be code reviewed: diff --git a/src/libs/zbxalgo/binaryheap.c b/src/libs/zbxalgo/binaryheap.c index 38531bc27e..7924bf9453 100644 --- a/src/libs/zbxalgo/binaryheap.c +++ b/src/libs/zbxalgo/binaryheap.c @@ -70,6 +70,7 @@ static void __binary_heap_ensure_free_space(zbx_binary_heap_t *heap) if (heap->elems_alloc != tmp_elems_alloc) { + zabbix_log(LOG_LEVEL_INFORMATION, "size %u", tmp_elems_alloc * sizeof(zbx_binary_heap_elem_t)); heap->elems = (zbx_binary_heap_elem_t *)heap->mem_realloc_func(heap->elems, tmp_elems_alloc * sizeof(zbx_binary_heap_elem_t)); if (NULL == heap->elems) @@ -322,13 +323,7 @@ void zbx_binary_heap_remove_direct(zbx_binary_heap_t *heap, zbx_uint64_t key) void zbx_binary_heap_clear(zbx_binary_heap_t *heap) { - if (NULL != heap->elems) - { - heap->mem_free_func(heap->elems); - heap->elems = NULL; - heap->elems_num = 0; - heap->elems_alloc = 0; - } + heap->elems_num = 0; if (HAS_DIRECT_OPTION(heap)) zbx_hashmap_clear(heap->key_index); | |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 25 ] | |||||||||||||||||
Fixed in development branch: | |||||||||||||||||
Comment by Vladislavs Sokurenko [ 2019 Apr 25 ] | |||||||||||||||||
Fixed in:
|