[ZBX-20590] preprocessing worker utilization Created: 2022 Feb 16 Updated: 2024 Apr 10 Resolved: 2023 Oct 23 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 6.0.0 |
Fix Version/s: | 6.0.23rc1, 6.4.8rc1, 7.0.0alpha7, 7.0 (plan) |
Type: | Problem report | Priority: | Critical |
Reporter: | Igor Shekhanov | Assignee: | Aleksandre Sebiskveradze |
Resolution: | Fixed | Votes: | 21 |
Labels: | None | ||
Σ Remaining Estimate: | Not Specified | Remaining Estimate: | Not Specified |
Σ Time Spent: | Not Specified | Time Spent: | Not Specified |
Σ Original Estimate: | Not Specified | Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||||||
Sub-Tasks: | |||||||||||||||||||||||||
Team: | |||||||||||||||||||||||||
Sprint: | Sprint 85 (Feb 2022), Sprint 86 (Mar 2022), Sprint 87 (Apr 2022), Sprint 88 (May 2022), Sprint 89 (Jun 2022), Sprint 90 (Jul 2022), Sprint 91 (Aug 2022), Sprint 92 (Sep 2022), Sprint 93 (Oct 2022), Sprint 94 (Nov 2022), Sprint 95 (Dec 2022), Sprint 96 (Jan 2023), Sprint 97 (Feb 2023), Sprint 98 (Mar 2023), Sprint 99 (Apr 2023), Sprint 100 (May 2023), Sprint 101 (Jun 2023), Sprint 102 (Jul 2023), Sprint 103 (Aug 2023), Sprint 104 (Sep 2023), Sprint 105 (Oct 2023) | ||||||||||||||||||||||||
Story Points: | 1 |
Description |
Hi, sorry for my english (google translate) Updated zabbix 5.4 => 6.0 and a bug or feature got out OS: Oracle8, kernel 5.4.17-2136.304.4.1.el8uek.x86_64, Database: PostgreSQL + TimescaleDB, official zabbix yum repo |
Comments |
Comment by David Mayr [ 2022 Feb 16 ] |
We saw a similar/same thing when going from 5.4 to 6.0 (1 server + 16 proxies, ~4100 NVPS, mysql, ubuntu 20.04). We also saw another issue where many hosts had the nodata function triggered on hundreds of hosts sporadically - usually after a few minutes after a zabbix-server restart, for like e.g. about one hour nothing, then same values came in again and then it stopped again and so on. |
Comment by Aleksey Volodin [ 2022 Feb 17 ] |
Hello! Thank you for reporting this. Can you please run this command on your Zabbix Server hosts: zabbix_server -R diaginfo And share it output from Zabbix Server log file? Also, can you please share this command output: ps aux | grep preproc Best regards, |
Comment by Igor Shekhanov [ 2022 Feb 17 ] |
280235:20220217:125638.998 Starting Zabbix Server. Zabbix 6.0.0 (revision 5203d2ea7d). 280235:20220217:125638.998 ****** Enabled features ****** 280235:20220217:125638.998 SNMP monitoring: YES 280235:20220217:125638.998 IPMI monitoring: YES 280235:20220217:125638.998 Web monitoring: YES 280235:20220217:125638.999 VMware monitoring: YES 280235:20220217:125638.999 SMTP authentication: YES 280235:20220217:125638.999 ODBC: YES 280235:20220217:125638.999 SSH support: YES 280235:20220217:125638.999 IPv6 support: YES 280235:20220217:125638.999 TLS support: YES 280235:20220217:125638.999 ****************************** 280235:20220217:125638.999 using configuration file: /etc/zabbix/zabbix_server.conf 280235:20220217:125639.048 TimescaleDB version: 20501 280235:20220217:125639.060 current database version (mandatory/optional): 06000000/06000000 280235:20220217:125639.060 required mandatory version: 06000000 280244:20220217:125639.137 starting HA manager 280244:20220217:125639.153 HA manager started in active mode 280235:20220217:125639.258 server #0 started [main process] 280246:20220217:125639.259 server #1 started [service manager #1] 280247:20220217:125639.260 server #2 started [configuration syncer #1] 280252:20220217:125639.943 server #3 started [alert manager #1] 280253:20220217:125639.944 server #4 started [alerter #1] 280254:20220217:125639.944 server #5 started [alerter #2] 280255:20220217:125639.945 server #6 started [alerter #3] 280257:20220217:125639.946 server #7 started [preprocessing manager #1] 280258:20220217:125639.947 server #8 started [preprocessing worker #1] 280259:20220217:125639.947 server #9 started [preprocessing worker #2] 280260:20220217:125639.948 server #10 started [preprocessing worker #3] 280261:20220217:125639.949 server #11 started [preprocessing worker #4] 280262:20220217:125639.950 server #12 started [preprocessing worker #5] 280263:20220217:125639.950 server #13 started [preprocessing worker #6] 280264:20220217:125639.951 server #14 started [preprocessing worker #7] 280265:20220217:125639.951 server #15 started [preprocessing worker #8] 280266:20220217:125639.952 server #16 started [lld manager #1] 280267:20220217:125639.952 server #17 started [lld worker #1] 280268:20220217:125639.953 server #18 started [lld worker #2] 280269:20220217:125639.954 server #19 started [lld worker #3] 280270:20220217:125639.954 server #20 started [housekeeper #1] 280273:20220217:125639.955 server #21 started [timer #1] 280274:20220217:125639.955 server #22 started [timer #2] 280276:20220217:125639.956 server #23 started [timer #3] 280277:20220217:125639.957 server #24 started [http poller #1] 280279:20220217:125639.957 server #25 started [http poller #2] 280285:20220217:125639.960 server #29 started [discoverer #3] 280282:20220217:125639.962 server #27 started [discoverer #1] 280295:20220217:125639.965 server #35 started [history syncer #6] 280299:20220217:125639.965 server #39 started [escalator #2] 280294:20220217:125639.966 server #34 started [history syncer #5] 280297:20220217:125639.966 server #37 started [history syncer #8] 280302:20220217:125639.966 server #41 started [proxy poller #1] 280300:20220217:125639.967 server #40 started [escalator #3] 280284:20220217:125639.968 server #28 started [discoverer #2] 280304:20220217:125639.970 server #42 started [proxy poller #2] 280298:20220217:125639.970 server #38 started [escalator #1] 280307:20220217:125639.972 server #44 started [proxy poller #4] 280290:20220217:125639.973 server #32 started [history syncer #3] 280289:20220217:125639.974 server #31 started [history syncer #2] 280280:20220217:125639.974 server #26 started [http poller #3] 280315:20220217:125639.974 server #49 started [poller #1] 280296:20220217:125639.975 server #36 started [history syncer #7] 280317:20220217:125639.976 server #50 started [poller #2] 280325:20220217:125639.976 server #55 started [unreachable poller #2] 280314:20220217:125639.976 server #48 started [task manager #1] 280322:20220217:125639.977 server #53 started [poller #5] 280320:20220217:125639.978 server #52 started [poller #4] 280328:20220217:125639.979 server #56 started [unreachable poller #3] 280324:20220217:125639.980 server #54 started [unreachable poller #1] 280312:20220217:125639.981 server #47 started [self-monitoring #1] 280319:20220217:125639.981 server #51 started [poller #3] 280333:20220217:125639.981 server #59 started [trapper #3] 280334:20220217:125639.984 server #60 started [trapper #4] 280309:20220217:125639.987 server #45 started [proxy poller #5] 280341:20220217:125639.988 server #64 started [icmp pinger #3] 280346:20220217:125639.988 server #69 started [icmp pinger #8] 280292:20220217:125639.988 server #33 started [history syncer #4] 280349:20220217:125639.989 server #72 started [history poller #2] 280287:20220217:125639.991 server #30 started [history syncer #1] 280336:20220217:125639.991 server #61 started [trapper #5] 280305:20220217:125639.993 server #43 started [proxy poller #3] 280344:20220217:125639.996 server #67 started [icmp pinger #6] 280347:20220217:125639.996 server #70 started [alert syncer #1] 280331:20220217:125639.997 server #58 started [trapper #2] 280310:20220217:125639.998 server #46 started [proxy poller #6] 280340:20220217:125640.000 server #63 started [icmp pinger #2] 280338:20220217:125640.001 server #62 started [icmp pinger #1] 280343:20220217:125640.001 server #66 started [icmp pinger #5] 280350:20220217:125640.001 server #73 started [history poller #3] 280362:20220217:125640.003 server #75 started [history poller #5] 280329:20220217:125640.004 server #57 started [trapper #1] 280342:20220217:125640.004 server #65 started [icmp pinger #4] 280345:20220217:125640.007 server #68 started [icmp pinger #7] 280364:20220217:125640.007 server #77 started [trigger housekeeper #1] 280367:20220217:125640.008 server #79 started [odbc poller #2] 280361:20220217:125640.009 server #74 started [history poller #4] 280366:20220217:125640.011 server #78 started [odbc poller #1] 280363:20220217:125640.012 server #76 started [availability manager #1] 280368:20220217:125640.013 server #80 started [odbc poller #3] 280348:20220217:125640.015 server #71 started [history poller #1] 280235:20220217:131530.510 == value cache diagnostic information == 280235:20220217:131530.510 items:446 values:3048 mode:0 time:0.000320 280235:20220217:131530.510 memory: 280235:20220217:131530.510 size: free:1073483864 used:222200 280235:20220217:131530.510 chunks: free:6 used:2206 min:48 max:1073464280 280235:20220217:131530.511 buckets: 280235:20220217:131530.511 48:1 280235:20220217:131530.511 56:1 280235:20220217:131530.511 64:1 280235:20220217:131530.511 256+:3 280235:20220217:131530.511 top.values: 280235:20220217:131530.511 itemid:105961 values:7 request.values:1 280235:20220217:131530.511 itemid:105939 values:7 request.values:1 280235:20220217:131530.511 itemid:106125 values:7 request.values:1 280235:20220217:131530.511 itemid:105955 values:7 request.values:1 280235:20220217:131530.511 itemid:105903 values:7 request.values:1 280235:20220217:131530.511 itemid:105851 values:7 request.values:1 280235:20220217:131530.511 itemid:105917 values:7 request.values:1 280235:20220217:131530.511 itemid:106048 values:7 request.values:1 280235:20220217:131530.511 itemid:105930 values:7 request.values:1 280235:20220217:131530.511 itemid:106131 values:7 request.values:1 280235:20220217:131530.511 itemid:105839 values:7 request.values:1 280235:20220217:131530.511 itemid:106130 values:7 request.values:1 280235:20220217:131530.511 itemid:105862 values:7 request.values:1 280235:20220217:131530.511 itemid:105748 values:7 request.values:1 280235:20220217:131530.511 itemid:105985 values:7 request.values:1 280235:20220217:131530.511 itemid:105951 values:7 request.values:1 280235:20220217:131530.511 itemid:105866 values:7 request.values:1 280235:20220217:131530.511 itemid:105831 values:7 request.values:1 280235:20220217:131530.511 itemid:105778 values:7 request.values:1 280235:20220217:131530.511 itemid:105803 values:7 request.values:1 280235:20220217:131530.511 itemid:105926 values:7 request.values:1 280235:20220217:131530.512 itemid:106143 values:7 request.values:1 280235:20220217:131530.512 itemid:105919 values:7 request.values:1 280235:20220217:131530.512 itemid:106100 values:7 request.values:1 280235:20220217:131530.512 itemid:106011 values:7 request.values:1 280235:20220217:131530.512 top.request.values: 280235:20220217:131530.512 itemid:98382 values:6 request.values:2 280235:20220217:131530.512 itemid:98396 values:6 request.values:2 280235:20220217:131530.512 itemid:98395 values:6 request.values:2 280235:20220217:131530.512 itemid:105961 values:7 request.values:1 280235:20220217:131530.512 itemid:105939 values:7 request.values:1 280235:20220217:131530.512 itemid:106125 values:7 request.values:1 280235:20220217:131530.512 itemid:105955 values:7 request.values:1 280235:20220217:131530.512 itemid:105903 values:7 request.values:1 280235:20220217:131530.512 itemid:105851 values:7 request.values:1 280235:20220217:131530.512 itemid:105917 values:7 request.values:1 280235:20220217:131530.512 itemid:106048 values:7 request.values:1 280235:20220217:131530.512 itemid:105930 values:7 request.values:1 280235:20220217:131530.512 itemid:106131 values:7 request.values:1 280235:20220217:131530.512 itemid:105839 values:7 request.values:1 280235:20220217:131530.512 itemid:106130 values:7 request.values:1 280235:20220217:131530.512 itemid:105862 values:7 request.values:1 280235:20220217:131530.512 itemid:105748 values:7 request.values:1 280235:20220217:131530.512 itemid:105985 values:7 request.values:1 280235:20220217:131530.512 itemid:105951 values:7 request.values:1 280235:20220217:131530.512 itemid:105866 values:7 request.values:1 280235:20220217:131530.512 itemid:105831 values:7 request.values:1 280235:20220217:131530.512 itemid:105778 values:7 request.values:1 280235:20220217:131530.513 itemid:105803 values:7 request.values:1 280235:20220217:131530.513 itemid:105926 values:7 request.values:1 280235:20220217:131530.513 itemid:106143 values:7 request.values:1 280235:20220217:131530.513 == 280235:20220217:131530.513 == LLD diagnostic information == 280235:20220217:131530.513 rules:0 values:0 time:0.000558 280235:20220217:131530.513 top.values: 280235:20220217:131530.513 == 280235:20220217:131530.513 == alerting diagnostic information == 280235:20220217:131530.513 alerts:0 time:0.000580 280235:20220217:131530.513 media.alerts: 280235:20220217:131530.513 source.alerts: 280235:20220217:131530.513 == 280235:20220217:131530.560 == history cache diagnostic information == 280235:20220217:131530.560 items:0 values:0 time:0.000503 280235:20220217:131530.560 memory.data: 280235:20220217:131530.560 size: free:1073741440 used:0 280235:20220217:131530.560 chunks: free:1 used:0 min:1073741440 max:1073741440 280235:20220217:131530.560 buckets: 280235:20220217:131530.560 256+:1 280235:20220217:131530.560 memory.index: 280235:20220217:131530.560 size: free:1073187568 used:553744 280235:20220217:131530.560 chunks: free:3 used:5 min:8072 max:1071924400 280235:20220217:131530.560 buckets: 280235:20220217:131530.560 256+:3 280235:20220217:131530.560 top.values: 280235:20220217:131530.560 == 280235:20220217:131530.560 == preprocessing diagnostic information == 280235:20220217:131530.560 values:188375 done:95483 queued:0 processing:7741 pending:85151 time:0.045876 280235:20220217:131530.560 top.values: 280235:20220217:131530.560 itemid:98403 values:12 steps:0 280235:20220217:131530.560 itemid:98433 values:12 steps:1 280235:20220217:131530.560 itemid:98434 values:12 steps:1 280235:20220217:131530.561 itemid:98435 values:12 steps:1 280235:20220217:131530.561 itemid:98436 values:12 steps:1 280235:20220217:131530.561 itemid:98437 values:12 steps:1 280235:20220217:131530.561 itemid:98438 values:12 steps:1 280235:20220217:131530.561 itemid:98439 values:12 steps:1 280235:20220217:131530.561 itemid:98440 values:12 steps:1 280235:20220217:131530.561 itemid:98441 values:12 steps:1 280235:20220217:131530.561 itemid:98442 values:12 steps:1 280235:20220217:131530.561 itemid:98443 values:12 steps:1 280235:20220217:131530.561 itemid:98444 values:12 steps:1 280235:20220217:131530.561 itemid:98445 values:12 steps:1 280235:20220217:131530.561 itemid:98446 values:12 steps:1 280235:20220217:131530.561 itemid:98447 values:12 steps:1 280235:20220217:131530.561 itemid:98448 values:12 steps:1 280235:20220217:131530.561 itemid:98449 values:12 steps:1 280235:20220217:131530.561 itemid:98450 values:12 steps:1 280235:20220217:131530.561 itemid:98451 values:12 steps:1 280235:20220217:131530.561 itemid:98452 values:12 steps:1 280235:20220217:131530.561 itemid:98453 values:12 steps:1 280235:20220217:131530.561 itemid:98454 values:12 steps:1 280235:20220217:131530.561 itemid:98455 values:12 steps:1 280235:20220217:131530.561 itemid:98456 values:12 steps:1 280235:20220217:131530.561 top.oldest.preproc.values: 280235:20220217:131530.561 == 280235:20220217:131530.561 == locks diagnostic information == 280235:20220217:131530.561 locks: 280235:20220217:131530.561 ZBX_MUTEX_LOG:0x7f1786409000 280235:20220217:131530.561 ZBX_MUTEX_CACHE:0x7f1786409028 280235:20220217:131530.561 ZBX_MUTEX_TRENDS:0x7f1786409050 280235:20220217:131530.561 ZBX_MUTEX_CACHE_IDS:0x7f1786409078 280235:20220217:131530.562 ZBX_MUTEX_SELFMON:0x7f17864090a0 280235:20220217:131530.562 ZBX_MUTEX_CPUSTATS:0x7f17864090c8 280235:20220217:131530.562 ZBX_MUTEX_DISKSTATS:0x7f17864090f0 280235:20220217:131530.562 ZBX_MUTEX_VALUECACHE:0x7f1786409118 280235:20220217:131530.562 ZBX_MUTEX_VMWARE:0x7f1786409140 280235:20220217:131530.562 ZBX_MUTEX_SQLITE3:0x7f1786409168 280235:20220217:131530.562 ZBX_MUTEX_PROCSTAT:0x7f1786409190 280235:20220217:131530.562 ZBX_MUTEX_PROXY_HISTORY:0x7f17864091b8 280235:20220217:131530.562 ZBX_MUTEX_MODBUS:0x7f17864091e0 280235:20220217:131530.562 ZBX_MUTEX_TREND_FUNC:0x7f1786409208 280235:20220217:131530.562 ZBX_RWLOCK_CONFIG:0x7f1786409230 280235:20220217:131530.562 ZBX_RWLOCK_VALUECACHE:0x7f1786409268 280235:20220217:131530.562 == zabbix 280257 0.0 0.4 6461932 75808 ? S 12:56 0:01 /usr/sbin/zabbix_server: preprocessing manager #1 [queued 169, processed 0 values, idle 5.004480 sec during 5.004592 sec] zabbix 280258 99.2 0.0 6423824 12188 ? R 12:56 21:05 /usr/sbin/zabbix_server: preprocessing worker #1 started zabbix 280259 60.1 0.0 6420148 10444 ? R 12:56 12:47 /usr/sbin/zabbix_server: preprocessing worker #2 started zabbix 280260 0.0 0.0 6415156 5176 ? S 12:56 0:00 /usr/sbin/zabbix_server: preprocessing worker #3 started zabbix 280261 0.0 0.0 6414740 4036 ? S 12:56 0:00 /usr/sbin/zabbix_server: preprocessing worker #4 started zabbix 280262 0.0 0.0 6414740 4036 ? S 12:56 0:00 /usr/sbin/zabbix_server: preprocessing worker #5 started zabbix 280263 0.0 0.0 6414740 4036 ? S 12:56 0:00 /usr/sbin/zabbix_server: preprocessing worker #6 started zabbix 280264 0.0 0.0 6414740 4036 ? S 12:56 0:00 /usr/sbin/zabbix_server: preprocessing worker #7 started zabbix 280265 0.0 0.0 6414740 4036 ? S 12:56 0:00 /usr/sbin/zabbix_server: preprocessing worker #8 started |
Comment by Aleksey Volodin [ 2022 Feb 17 ] |
Thank you for additional information. Can you please also tell how many dependent items there are, what are dependent item preprocessing options, maybe short example of data. Best regards, |
Comment by Igor Shekhanov [ 2022 Feb 17 ] |
2 hosts, first with only default "RabbitMQ cluster by HTTP" Zabbix Template, and second with only default "RabbitMQ node by HTTP" Cluster host has 8651 items (don't know how many dependent items, may be around 8000+) Some preprocessing from template $[?(@.name == "amq.direct" && @.vhost == "MarketData" && @.type =="direct")].message_stats.ack.first() Some data {"management_version":"3.9.8","rates_mode":"basic","sample_retention_policies":{"global":[600,3600,28800,86400],"basic":[600,3600],"detailed":[600]},"exchange_types":[{"name":"direct","description":"AMQP direct exchange, as per the AMQP specification","enabled":true},{"name":"fanout","description":"AMQP fanout exchange, as per the AMQP specification","enabled":true},{"name":"headers","description":"AMQP headers exchange, as per the AMQP specification","enabled":true},{"name":"topic","description":"AMQP topic exchange, as per the AMQP specification","enabled":true}],"product_version":"3.9.8","product_name":"RabbitMQ","rabbitmq_version":"3.9.8","cluster_name":"[email protected]","erlang_version":"24.0.6","erlang_full_version":"Erlang/OTP 24 [erts-12.0.4] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1]","disable_stats":false,"enable_queue_totals":false,"message_stats":{"ack":6068389,"ack_details":{"rate":12.8},"confirm":1072211,"confirm_details":{"rate":4.6},"deliver":6075147,"deliver_details":{"rate":12.8},"deliver_get":6075204,"deliver_get_details":{"rate":12.8},"deliver_no_ack":0,"deliver_no_ack_details":{"rate":0.0},"disk_reads":20773,"disk_reads_details":{"rate":0.0},"disk_writes":4887165,"disk_writes_details":{"rate":2.8},"drop_unroutable":4372,"drop_unroutable_details":{"rate":0.0},"get":55,"get_details":{"rate":0.0},"get_empty":0,"get_empty_details":{"rate":0.0},"get_no_ack":2,"get_no_ack_details":{"rate":0.0},"publish":6054309,"publish_details":{"rate":28.6},"redeliver":7770,"redeliver_details":{"rate":0.0},"return_unroutable":0,"return_unroutable_details":{"rate":0.0}},"churn_rates":{"channel_closed":856663,"channel_closed_details":{"rate":0.0},"channel_created":856820,"channel_created_details":{"rate":0.0},"connection_closed":237893,"connection_closed_details":{"rate":0.0},"connection_created":184371,"connection_created_details":{"rate":0.0},"queue_created":8,"queue_created_details":{"rate":0.0},"queue_declared":1550,"queue_declared_details":{"rate":0.0},"queue_deleted":4,"queue_deleted_details":{"rate":0.0}},"queue_totals":{"messages":59945,"messages_details":{"rate":-0.4},"messages_ready":59942,"messages_ready_details":{"rate":-0.4},"messages_unacknowledged":3,"messages_unacknowledged_details":{"rate":0.0}},"object_totals":{"channels":157,"connections":1623,"consumers":111,"exchanges":539,"queues":435},"statistics_db_event_queue":0,"node":"rabbit@rabbitmq-inf-p-01","listeners":[{"node":"rabbit@rabbitmq-inf-p-01","protocol":"amqp","ip_address":"::","port":5672,"socket_opts":{"backlog":128,"nodelay":true,"linger":[true,0],"exit_on_close":false}},{"node":"rabbit@rabbitmq-inf-p-02","protocol":"amqp","ip_address":"::","port":5672,"socket_opts":{"backlog":128,"nodelay":true,"linger":[true,0],"exit_on_close":false}},{"node":"rabbit@rabbitmq-inf-p-03","protocol":"amqp","ip_address":"::","port":5672,"socket_opts":{"backlog":128,"nodelay":true,"linger":[true,0],"exit_on_close":false}},{"node":"rabbit@rabbitmq-inf-p-01","protocol":"amqp/ssl","ip_address":"::","port":5671,"socket_opts":{"backlog":128,"nodelay":true,"linger":[true,0],"exit_on_close":false,"versions":["tlsv1.2","tlsv1.1","tlsv1"],"keyfile":"/etc/rabbitmq/key.pem","certfile":"/etc/rabbitmq/cert.pem","cacertfile":"/etc/rabbitmq/cacert.pem","fail_if_no_peer_cert":false,"verify":"verify_none"}},{"node":"rabbit@rabbitmq-inf-p-02","protocol":"amqp/ssl","ip_address":"::","port":5671,"socket_opts":{"backlog":128,"nodelay":true,"linger":[true,0],"exit_on_close":false,"versions":["tlsv1.2","tlsv1.1","tlsv1"],"keyfile":"/etc/rabbitmq/key.pem","certfile":"/etc/rabbitmq/cert.pem","cacertfile":"/etc/rabbitmq/cacert.pem","fail_if_no_peer_cert":false,"verify":"verify_none"}},{"node":"rabbit@rabbitmq-inf-p-03","protocol":"amqp/ssl","ip_address":"::","port":5671,"socket_opts":{"backlog":128,"nodelay":true,"linger":[true,0],"exit_on_close":false,"versions":["tlsv1.2","tlsv1.1","tlsv1"],"keyfile":"/etc/rabbitmq/key.pem","certfile":"/etc/rabbitmq/cert.pem","cacertfile":"/etc/rabbitmq/cacert.pem","fail_if_no_peer_cert":false,"verify":"verify_none"}},{"node":"rabbit@rabbitmq-inf-p-01","protocol":"clustering","ip_address":"::","port":25672,"socket_opts":[]},{"node":"rabbit@rabbitmq-inf-p-02","protocol":"clustering","ip_address":"::","port":25672,"socket_opts":[]},{"node":"rabbit@rabbitmq-inf-p-03","protocol":"clustering","ip_address":"::","port":25672,"socket_opts":[]},{"node":"rabbit@rabbitmq-inf-p-01","protocol":"http","ip_address":"::","port":15672,"socket_opts":{"cowboy_opts":{"sendfile":false},"port":15672}},{"node":"rabbit@rabbitmq-inf-p-02","protocol":"http","ip_address":"::","port":15672,"socket_opts":{"cowboy_opts":{"sendfile":false},"port":15672}},{"node":"rabbit@rabbitmq-inf-p-03","protocol":"http","ip_address":"::","port":15672,"socket_opts":{"cowboy_opts":{"sendfile":false},"port":15672}},{"node":"rabbit@rabbitmq-inf-p-01","protocol":"http/prometheus","ip_address":"::","port":15692,"socket_opts":{"cowboy_opts":{"sendfile":false},"port":15692,"protocol":"http/prometheus"}},{"node":"rabbit@rabbitmq-inf-p-02","protocol":"http/prometheus","ip_address":"::","port":15692,"socket_opts":{"cowboy_opts":{"sendfile":false},"port":15692,"protocol":"http/prometheus"}},{"node":"rabbit@rabbitmq-inf-p-03","protocol":"http/prometheus","ip_address":"::","port":15692,"socket_opts":{"cowboy_opts":{"sendfile":false},"port":15692,"protocol":"http/prometheus"}}],"contexts":[{"ssl_opts":[],"node":"rabbit@rabbitmq-inf-p-01","description":"RabbitMQ Management","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15672"},{"ssl_opts":[],"node":"rabbit@rabbitmq-inf-p-02","description":"RabbitMQ Management","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15672"},{"ssl_opts":[],"node":"rabbit@rabbitmq-inf-p-03","description":"RabbitMQ Management","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15672"},{"ssl_opts":[],"node":"rabbit@rabbitmq-inf-p-01","description":"RabbitMQ Prometheus","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15692","protocol":"'http/prometheus'"},{"ssl_opts":[],"node":"rabbit@rabbitmq-inf-p-02","description":"RabbitMQ Prometheus","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15692","protocol":"'http/prometheus'"},{"ssl_opts":[],"node":"rabbit@rabbitmq-inf-p-03","description":"RabbitMQ Prometheus","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15692","protocol":"'http/prometheus'"}]} |
Comment by Igor Shekhanov [ 2022 Feb 17 ] |
Some sources data used by Cluster template RabbitMQ: Get overview RabbitMQ: Get exchanges |
Comment by Vladislavs Sokurenko [ 2022 Feb 23 ] |
This is happening due to |
Comment by psychomoise [ 2022 Jun 02 ] |
I have the same kind of behaviour (1 preprocessing worker using 100% of 1 core and the others are using a lots of CPU) but I am not monitoring RabbitMQ, it is more network devices (switches + firewalls + cisco WLC + some servers using windows and very few using linux at that moment) So, a lots of preprocessing using a lots of WMI or SNMP items with a lots of discovery done using mainly the templates provided by Zabbix and few from our own on which we might have a lots of our items having preprocessing using "discard unachnged with heartbeat" and finding the correct jsonpath. I do not think the issue is linked to prometheus, but more why there is only very few preprocessing worker working a lot and why the others are not receiving a bit of the load creating a lots of queue. |
Comment by Andris Zeila [ 2022 Jun 02 ] |
All direct dependent items of one master item are now processed by one worker. This was done to implement prometheus (and other preprocessing) caching and reduce overhead of sending the same master item value for each dependent item. |
Comment by psychomoise [ 2022 Jun 06 ] |
Hi @Andris Zeila, I was able to solve the issue on my end as it was in fact exactly what you have indicated and this was linked to the template "ElasticSearch by Cluster HTTP" which is I think not correctly designed and can create this issue. To find what was the cause of the issue, I add to check what the manager was doing and what the preprocessing worker using a lots of CPU was doing too by raising the Log Level for each of those process (using their Process ID through the command " zabbix_proxy -c /etc/zabbix/zabbix_proxy.conf -R log_level_decrease=<PID>", replace <PID> by the PID of the process concerned you can found through the "ps axf" command on the proxy). To conclude, I think there are ways to solve this issue temporarily through template management but it shows there are something not so good in preprocessing management at zabbix process level that need to be reviewed and maybe transformed. For the one who created this issue, I would suggest to do the same thing I have done on one of the proxy which had the issue to try finding which template is in cause and then review if you have the same issue as mine. |
Comment by psychomoise [ 2022 Jun 06 ] |
I think the new way of preprocessing workers and manager, meaning the dependant items of one master item to be processed on the same preprocessing worker to minimize work load by not replicating the master item data on different preprocessing worker processes is good, but that means adapting the templates to minimize the number of dependant items per master item, meaning a lots of templates provided by zabbix team needs to be adapted to this change. |
Comment by Mads Wallin [ 2022 Jun 10 ] |
This bulk preproccesing makes the official mssql odbc template virtually useless for large environments. https://www.zabbix.com/integrations/mssql#mssql_odbc We're using this for some ms sql servers with several hundreds databases - Preprocessing queue increases indefinitely, never catching up to the incoming data. |
Comment by Tuomo Kuure [ 2022 Jun 10 ] |
Noticed the same issue today as I set up a proxy (v6.0.5) for MSSQL monitoring tests, and as soon as you get to 50+ databases per server instance, you start to have problems with this template because preprocessors can't keep up with the incoming data. |
Comment by psychomoise [ 2022 Jun 10 ] |
@Mads Wallin and @Tuomo Kuure, you will need to do the same kind of thing I have done on the ElasticSearch Template provided by Zabbix team. I mean to duplicate the item "MSSQL: Get performance counters" to the discovery where it is used and in the SQL Query to limitate its search to the database discovered. Then you modify each prototype item of the discovery (Database Discovery) to update in mass their master item to the one you created. After that you force your proxies to reload the configuration to get the new configuration for discovery and 1H later again so it will reload the configuration of the item discovered. In the "MSSQL by ODBC" that seems to be the only Discovery rules linked to it (Database discovery) issue linked to it as you can do the math, if you have 50 databases, that means 14*50=700 items discovered + the items of the template, ie 700+63=763 dependant items on the same master item "MSSQL: Get performance counters", on my end I killed the preprocessing processes with way less than this number. EDIT: after the modification of the template, monitor the preprocessing manager and preprocessing workers business on zabbix, just in case you might need to add more preprocessing workers |
Comment by Tuomo Kuure [ 2022 Jun 10 ] |
To clarify the issue with preprocessors: it does not matter how many preprocessors I have, as only one is utilized. I'll probably construct a custom template as you suggested anyway and cherry pick the items, as the amount of databases across our server cluster is quite a handful |
Comment by Mads Wallin [ 2022 Jun 10 ] |
This is a workaround we'll consider. As this means that we'll have several hundred odbc connections to each database from the proxy, instead of a single big transaction once. For now, we have downgraded this particular proxy to version 5.4 where everything works as expected. Except that it can't connect to both our zabbix server instances for redundancy. Really hoping the pre processing is altered on v6+ to perform somewhat similar til v5.4. |
Comment by psychomoise [ 2022 Jun 10 ] |
@Tuomo Kuure, both issues are linked, meaning the Preprocessing Worker is saturated due to too many dependant item for the same master item that are processed on the same preprocessing worker, thus seems to impact the manager that is queueing other items for some reason instead of puting them to be processed on other preprocessing workers. Then queue appears on the preprocessing manager, or at least this is what I imagined is going on.
If Zabbix developpers correct the issue, that means you will still have preprocessing workers saturated but manager will be able to dispatch the items on the other preprocessing workers until all of them are saturated and then queue will again begin. @Mads Wallin, I am totally agree with you, I had the same doubt for my elasticsearch instances, querying them a lot is not my prefered choice but in reality, any database server that cannot support just few new sessions per minute is not a database service that should be used, and if your database service is already at the limit and the monitoring is just the drop that will basculate the availability of your service, that means your service was already too just for the need. If your concerns is also about logs as you might have enabled login logs in SQL Server, then I understand it too as that will then be a lots of noise in the logs and I do not think this can be prevented at this stage. I still encourage you to do the necessary change and maybe test a bit later the original template when the correction will be done on preprocessing processes but pretty sure you will go back to the modified template. Zabbix 6.0 is quite new (just few month since go live), I hope Zabbix developpers will correct the templates impacted so we can still continue to use official templates. |
Comment by Alexandr Paliy [ 2022 Jul 10 ] |
Hi all. Just in case, I was considering to post a comment here, but then decided to create a new issue instead - ZBX-21317 In contrast with OP's situation, after upgrading from 5.2 to 6.0.6 I did not face 100% preprocessing workers utilization, but I faced 100% preprocessing manager utilization instead, causing all existing items' values to stuck in the preprocessing queue. |
Comment by True Rox [ 2022 Jul 11 ] |
@Mads Wallin , please pleaaaase share you modified template MSSQL odbc! |
Comment by Sergey [ 2022 Jul 23 ] |
yep, getting this problem too with 175 DB on mssql server with MSSQL ODBC template activated Version 6.0.6 |
Comment by True Rox [ 2022 Jul 24 ] |
same here (72 DBs) with MSSQL ODBC oficial template !
|
Comment by Mads Wallin [ 2022 Jul 25 ] |
Hello True Rox, We have not edited the MSSQL Template. For now, we're still running our big SQL setups with downgraded Zabbix Proxies (Version 5.4.12). |
Comment by Steve [ 2022 Aug 02 ] |
Commenting to follow this thread. We ran into the same issue using official Elastic Search template. Would be good to see if anyone has already modified (and can share) a new Elastic template (for Zabbix 6) and/or if Zabbix Developers are looking to fix this functionality. |
Comment by psychomoise [ 2022 Aug 02 ] |
you will find here my template modified for the "Elasticsearch Cluster by HTTP" before importing, export your own and use some comparison tool like winmerge to view the differences between them. After that, this is up to you to import the new template or just reproduce the changes I have done. |
Comment by Sergey [ 2022 Aug 02 ] |
Hi psychomoise, maybe you have some template for MSSQL as well? |
Comment by psychomoise [ 2022 Aug 02 ] |
unfortunatly, for SQL Server, we are not using this template as we have no need to go inside SQL Server to get detailed information, just the one from the outside (which I have largely transformed) is enough for us. But I understand that if you have some AlwaysOn enabled for some database, you have almost no choice but to get inside SQL Server to retrieve the status information or at least what SQL Server think about the replication But the idea is the same as what I have done for ElasticSearch, you need to find which items is the master item of a lots of dependant items and change the logic by creating a disvovered item the same way it is already created but with some added filter to reduce the scope of what it is retriving by for example just having the statistics for the discovered database instead of all and then change the master item by this new master item you created in the discovered item The problem with this approach is the number of authentication to the SQL Server infrastructure, there will be a lots of them compared to what it was done previously, so if this is then too much, you will need to change the frequency of the master item and that will certainly solve the number of authentication (if there are any) |
Comment by Steve [ 2022 Aug 03 ] |
Thankyou so much @psychomoise! I will take a look |
Comment by psychomoise [ 2022 Aug 03 ] |
@Sergey , I have checked a bit the template "MSSQL by ODBC" I have on my zabbix, it seems it is a bit worse than ElasticSearch as it is including a bunch of data in the same item, and I mean a lots of data retrieved that have nothing to do with the others ... the idea would then be to split the item "MSSQL: Get performance counters" into multiple items and each item will contains only one kind of data like one for the databases, one for the performances information of the instances, ... This will permit to have a data set lower in size, meaning less working for the preprocessing workers, which is always good. for the database information, as I mention, you will need to add a filter to retrieve only the information for the database targeted in the discovery of the database. I will see if I have some time to do it and then propose a change after having it being tested. |
Comment by Sergey [ 2022 Aug 05 ] |
@psychomoise, that would be very nice of you |
Comment by Andrew Boling [ 2022 Nov 06 ] |
While improving the templates is a good starting point, I don't think that this addresses the underlying problem. Even single-threaded implementations of multitasking are designed with the consideration that a single job has to be given "slices" of work and can't be allowed to exhaust the entire pipeline. I don't think pinning dependent items to a single worker is a bad approach, but I strongly feel that the preprocessing manager needs to be able to swap to other worker threads if a one of them becomes saturated by a single item. It would make sense to log a message about the worker that is deadlocked (+item ID involved) and sacrifice the performance of the bad item in exchange for allowing the rest of the monitoring system to continue running. It would even be possible to create a new internal item that tracks the number of saturated workers so that the problem can show up on the dashboard. I'm aware that this is armchair code design, and it's much easier to talk about these changes than it is to rearchitect the preprocessing pipeline to make it possible. That said, queue-based systems have been around for a long time, and it's well established at this point that queues run best when you are taking things out of them as quickly as possible. If a saturated worker cannot be swapped out, the new worker pinning approach should probably be put on hold until the logic in the preprocessing manager can accommodate this. |
Comment by Andrew Boling [ 2022 Nov 07 ] |
Here are some observations from when we were troubleshooting the issue in our environment last week:
|
Comment by Jeffrey Descan [ 2022 Nov 17 ] |
Please add 'Apache by Zabbix Agent' to rework. It also uses dependent items on item prototypes. |
Comment by Joseph Jaku [ 2022 Nov 17 ] |
Just to add because I had major issues around this today. Absolutely no data was being stored in the DB. To find the problem itemids use ''zabbix_server -R diaginfo=preprocessing'. After clearing out the problem ones server is back up and running smooth. |
Comment by Mehmet Ali Buyukkarakas [ 2023 Jan 18 ] |
I have the same problem with MSSQL server template. Neither the zabbix server or zabbix proxy solves this problem of %100 utilization of preprocessing worker manager. |
Comment by Sergey [ 2023 Jan 18 ] |
still waiting solution implementation |
Comment by Robert Evans [ 2023 Feb 18 ] |
We are implementing Zabbix for the first time ourselves. Starting with version 6 and have run into this issue right out of the box. We are having this problem with the Kubernetes Nodes by HTTP template. The problems described in this issue are exactly what we are experiencing with the proxy deployed in kubernetes by the Helm Chart. preprocessing items only ever increasing. Only 1 pre processing worker seems busy, pegging its worker. All other workers look to be idle. We are unable to use Zabbix to monitor kubernetes as a result of this. |
Comment by Evgenii Gordymov [ 2023 Feb 20 ] |
Hi robermar2, What version you used? Sub-task Kubernetes Nodes by HTTP template fixed. |
Comment by Mehmet Ali Buyukkarakas [ 2023 Feb 22 ] |
We need this immediaetly to be solved |
Comment by Robert Evans [ 2023 Feb 22 ] |
Hi @egordymov, we did end up manually updating the kubernetes template to the latest version from the 6.0 branch (we are on 6.0). It solved the problem. Thanks! |
Comment by Alex Kalimulin [ 2023 Feb 22 ] |
Thanks to 6.4 should eliminate this problem completely as it has a brand new internal architecture of preprocessing. |
Comment by Sergey [ 2023 Feb 27 ] |
hi all. after upgrading server to 6.2.7 - problem solved. i have 201 DB in mssql instance. Will be watching online couple of day. Thanks |