[ZBX-16096] system.cpu.util[<core>] key sometimes returns a wrong huge value (like 307343286799559360.000000) Created: 2019 May 08 Updated: 2019 Jun 01 Resolved: 2019 Jun 01 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 3.0.27, 4.0.7, 4.2.1 |
Fix Version/s: | 4.0.10rc1, 4.2.4rc1, 4.4.0alpha1, 4.4 (plan) |
Type: | Problem report | Priority: | Critical |
Reporter: | Oleksii Zagorskyi | Assignee: | Vladislavs Sokurenko |
Resolution: | Fixed | Votes: | 1 |
Labels: | average, cpu, float | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
ZBX server v4.2.0 on (CentOS Linux release 7.6.1810 (Core)) |
Attachments: | .cproject PS-571-testing_agents_P1_P2-logs.tar.gz ZBX-16096-4.0.diff ZBX-16096_1.diff ZBX-16096_2.diff system_ovn1.domain-cat_proc_stat system_ovn2.domain-cat_proc_stat system_ovn3.domain-cat_proc_stat | ||||
Issue Links: |
|
||||
Team: | Team A | ||||
Sprint: | Sprint 52 (May 2019) | ||||
Story Points: | 0.125 |
Description |
We had this issue quite long time already (on versions 3.4.11 or 4.2.1), reporting it just now as it became too annoying. We have 3 servers, having 80 CPU cores and our zabbix_server.log is filled by messages like these: 10639:20190506:111426.753 item "system_ovn3.domain:system.cpu.util[58,user]" became not supported: Value 307445734561825856.000000 is too small or too large. 10648:20190506:111439.777 item "system_ovn2.domain:system.cpu.util[51,user]" became not supported: Value 308010420332435328.000000 is too small or too large. 10639:20190506:111526.913 item "system_ovn3.domain:system.cpu.util[58,user]" became supported 10643:20190506:111539.947 item "system_ovn2.domain:system.cpu.util[51,user]" became supported 10643:20190506:111606.996 item "system_ovn1.domain:system.cpu.util[58,user]" became not supported: Value 307496984059169024.000000 is too small or too large. 10648:20190506:111706.938 item "system_ovn1.domain:system.cpu.util[58,user]" became supported Each such host has ~600 items, where 328 items are different "system.cpu.util*" keys (80*4 - idle,iowait,system,user + a 8 items for whole CPU). We have run 'system.cpu.util[58,user]' key using zabbix_get tool in a loop with 5 seconds delay, and during 5 minutes received the huge value once for one server: # while true; do echo "`date` - `zabbix_get -s system_ovn1.domain -k 'system.cpu.util[58,user]'`"; sleep 5; done Mon May 6 12:58:32 CEST 2019 - 0.083333 Mon May 6 12:58:37 CEST 2019 - 0.083333 Mon May 6 12:58:42 CEST 2019 - 0.083333 Mon May 6 12:58:47 CEST 2019 - 0.083333 Mon May 6 12:58:52 CEST 2019 - 0.083319 Mon May 6 12:58:57 CEST 2019 - 0.016661 Mon May 6 12:59:02 CEST 2019 - 0.016664 Mon May 6 12:59:07 CEST 2019 - 0.033333 Mon May 6 12:59:12 CEST 2019 - 0.016667 Mon May 6 12:59:17 CEST 2019 - 0.016664 Mon May 6 12:59:22 CEST 2019 - 0.000000 Mon May 6 12:59:27 CEST 2019 - 0.000000 Mon May 6 12:59:32 CEST 2019 - 0.000000 Mon May 6 12:59:37 CEST 2019 - 0.000000 Mon May 6 12:59:42 CEST 2019 - 0.000000 Mon May 6 12:59:47 CEST 2019 - 0.000000 Mon May 6 12:59:52 CEST 2019 - 0.000000 Mon May 6 12:59:57 CEST 2019 - 0.000000 Mon May 6 13:00:02 CEST 2019 - 0.000000 Mon May 6 13:00:07 CEST 2019 - 307343286799559360.000000 Mon May 6 13:00:12 CEST 2019 - 0.000000 Mon May 6 13:00:17 CEST 2019 - 0.000000 Mon May 6 13:00:22 CEST 2019 - 0.000000 Mon May 6 13:00:27 CEST 2019 - 0.000000 Mon May 6 13:00:32 CEST 2019 - 0.000000 Mon May 6 13:00:37 CEST 2019 - 0.066678 Mon May 6 13:00:42 CEST 2019 - 0.083319 Mon May 6 13:00:47 CEST 2019 - 0.083333 Mon May 6 13:00:52 CEST 2019 - 0.083333 Yes, we have many items, which you can say are not so useful/required, but zabbix agent should not return a wrong value.
We tried to run "system.cpu.util" key in similar loop for longer period, on all 3 servers, and did not caught the issue. |
Comments |
Comment by Vladislavs Sokurenko [ 2019 May 16 ] |
Zabbix agent takes this information from "/proc/stat" Could you please be so kind and attach it's contentes ? cat /proc/stat Were any signals sent to the Zabbix agent while it is collecting ? |
Comment by Oleksii Zagorskyi [ 2019 May 16 ] |
No reasons to expect that agent could receive any signal. No idea how to reproduce it on demand. It just randomly happens time to time. Attaching requested data from 3 hosts system_ovn1.domain-cat_proc_stat system_ovn2.domain-cat_proc_stat |
Comment by Vladislavs Sokurenko [ 2019 May 16 ] |
User time is calculated as user minus guest, might have something to do with it. |
Comment by Vladislavs Sokurenko [ 2019 May 17 ] |
Please apply one of attached patches if possible and provide logs, ZBX-16096_1.diff should write into log file only if problem is detect while ZBX-16096_2.diff logs user and guest cpu counters each second for every core. Current suspicion is that counters in /proc/stat go backwards for some reason. |
Comment by Vladislavs Sokurenko [ 2019 May 21 ] |
It looks like counters indeed go backwards 31866:20190519:195000.020 user:41233163 guest:38142874 user-guest:3090289 31866:20190519:195001.023 user:41233166 guest:38142878 user-guest:3090288 Guest time increases in this case by 4 but user time was increased only by 3 while guest time must be included in user. |
Comment by Vladislavs Sokurenko [ 2019 May 21 ] |
Not a bug in Zabbix and issue should be solved in latest kernel versions but propose to add special handling of such situation, if current value is bigger than previous then set count to 0. |
Comment by Oleksii Zagorskyi [ 2019 May 23 ] |
So far the final patch works good on 3 affected servers (with many CPU cores). Let me just add that the error message once also happened for "Zabbix server" agent itself, and the host has 8 CPU cores. The "Zabbix server" is running CentOS 7.6 with most recent system updates & kernel: Vitalization related stuff is probably not related here, key point probably is kernel version. |
Comment by Vladislavs Sokurenko [ 2019 May 23 ] |
Kernel 3.10 was released 6 years ago, I don't have exact bug report or commit but the issue is probably fixed somewhere along those years. |
Comment by Vladislavs Sokurenko [ 2019 May 31 ] |
Fixed in:
|