-
Incident report
-
Resolution: Duplicate
-
Major
-
None
-
5.4.5
-
None
-
CentOS 7.9, PHP 7.4.20,Zabbix 5.4.2, MySQL 5.7.34, Dell PowerEdge R640, 376GB RAM, SSDs rate at 12Gb/s
I have a rather large instantiation of over 4200 hosts and 1.5 million items and 1.3 million triggers. Our installation is almost exclusively driven by the use of Trapper items and some custom feeder code thats send many LLD rules and metrics to Zabbix Trappers. I just upgraded to version 5.4.2 and zabbix server has been repeatedly crashing with the following stack trace:
zabbix_server [221390]: 221391:20211026:173215.497 In zbx_trends_parse_range() param:1d:now ERROR [file and function: <cache.c,tfc_index_add>, revision:4c8f9aabe1, line:325]Something impossible just happened. 221390:20211026:173215:498 === Backtrace: === 221390:20211026:173215:498 13: /usr/share/zabbix/sbin/zabbix_server(zbx_backtrace+0x35) [0x56c255] 221390:20211026:173215:498 12: /usr/share/zabbix/sbin/zabbix_server() [0x5ff6ef] 221390:20211026:173215:498 11: /usr/share/zabbix/sbin/zabbix_server(zbx_tfc_put_value+0x9b) [0x5ffb8b] 221390:20211026:173215:498 10: /usr/share/zabbix/sbin/zabbix_server(zbx_trends_eval_avg+0x206) [0x5fedf6] 221390:20211026:173215:498 9: /usr/share/zabbix/sbin/zabbix_server(evaluate_function2+0x297) [0x557267] 221390:20211026:173215:498 8: /usr/share/zabbix/sbin/zabbix_server(evaluate_expressions_0x830) [0x54f2b0] 221390:20211026:173215:498 7: /usr/share/zabbix/sbin/zabbix_server() [0x50911d] 221390:20211026:173215:498 6: /usr/share/zabbix/sbin/zabbix_server(free_database_cache+0x108) [0x50cb98] 221390:20211026:173215:498 5: /usr/share/zabbix/sbin/zabbix_server(zbx_on_exit+0x96) [0x42fb36] 221390:20211026:173215:498 4: /usr/share/zabbix/sbin/zabbix_server
...and the rest of the stack is just libpthread and MAIN_ZABBIX_ENTRY, daemon_start and main().
I've noticed that the configuration syncer is throwing A LOT of these messages during configuration syncing:
cannot parse function <functionid> period base: invalid period shift expression
But when i go into the table and search for any of the functionids I don't see a problem (i.e.):
functionid | itemid | triggerid | name | parameter 4982378 | 238304 | 7641283 | trendavg | $,3h:now
I also noticed that the History Syncers seem to be much busier after this upgrade. Prior to the upgrade the History Syncers were running at about 22% CPU usage on average. Following the upgrade, they're at 50%.
It's hard to copy the configs over because the system is on an Intranet, but our processor count is as follows and has worked very well in Zabbix 5.2.4:
StartTrappers = 5 StartDBSyncers = 6 HistoryCacheSize = 250M HistoryIndexCacheSize = 250M TrendCacheSize = 500M TrendFunctionCacheSize = 500M ValueCacheSize = 6G CacheSize = 8G CacheUpdateFrequency = 120 StartLLDProcessors = 50
The upgrade took quite a while as you can imagine because of the change in the trigger expression syntax between version 5.2 and 5.4. We also build our installation from sources and package it with other software. The upgrade took about 2 hours between installation and all of the database migration done by Zabbix Server, so we had a few thousand files to process once our custom code started back up. About every 10 minutes Zabbix Server crashed with the above stack trace and the history syncer processes, when they reported metrics, were very busy at or near 100% CPU. In the zabbix_server.log we saw many log messages for "item <item_name> is outside history storage period". We try to retain as little history as possible in order to run the triggers, but since the upgrade took so long we had many metrics that were discarded and not inserted into the database because they were old. After the backlog was worked off it seemed to remain healthier and would not crash as much. This seems to point to the problem only occurring when Zabbix server was under a heavy workload.
For reference our custom code pushes approximately 150,000,000 metrics into Zabbix per day, or about 1736/sec, and everything was healthy in Zabbix 5.2.4.
For this upgrade we did modify many of our triggers to make more use of the trend functions due to the new trend function cache feature,
For what its worth, I increase the loglevel to debug level 5, but that actually slowed zabbix server down enough that I couldn't reproduce the crash. It seems that when Zabbix server is at such a high loglevel, it would take a lot long for my code's sender processes to establish the socket to the Trappers. Slowing down the sending seemed to be the key to avoiding the crashes.
Is there more database contention in this version?
- duplicates
-
ZBX-21720 "Something impossible has just happened" at <cache.c,tfc_index_add>, revision:c7c3044a4a2, line:311
- Closed