[ZBX-16096] system.cpu.util[<core>] key sometimes returns a wrong huge value (like 307343286799559360.000000) Created: 2019 May 08  Updated: 2019 Jun 01  Resolved: 2019 Jun 01

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 3.0.27, 4.0.7, 4.2.1
Fix Version/s: 4.0.10rc1, 4.2.4rc1, 4.4.0alpha1, 4.4 (plan)

Type: Problem report Priority: Critical
Reporter: Oleksii Zagorskyi Assignee: Vladislavs Sokurenko
Resolution: Fixed Votes: 1
Labels: average, cpu, float
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ZBX server v4.2.0 on (CentOS Linux release 7.6.1810 (Core))
ZBX agent v4.2.1 (previously v3.4.11 or v3.4.14) on (CentOS Linux release 7.5.1804 (Core))


Attachments: File .cproject     File PS-571-testing_agents_P1_P2-logs.tar.gz     File ZBX-16096-4.0.diff     File ZBX-16096_1.diff     File ZBX-16096_2.diff     File system_ovn1.domain-cat_proc_stat     File system_ovn2.domain-cat_proc_stat     File system_ovn3.domain-cat_proc_stat    
Issue Links:
Duplicate
Team: Team A
Sprint: Sprint 52 (May 2019)
Story Points: 0.125

 Description   

We had this issue quite long time already (on versions 3.4.11 or 4.2.1), reporting it just now as it became too annoying.

We have 3 servers, having 80 CPU cores and our zabbix_server.log is filled by messages like these:

 10639:20190506:111426.753 item "system_ovn3.domain:system.cpu.util[58,user]" became not supported: Value 307445734561825856.000000 is too small or too large.
 10648:20190506:111439.777 item "system_ovn2.domain:system.cpu.util[51,user]" became not supported: Value 308010420332435328.000000 is too small or too large.
 10639:20190506:111526.913 item "system_ovn3.domain:system.cpu.util[58,user]" became supported
 10643:20190506:111539.947 item "system_ovn2.domain:system.cpu.util[51,user]" became supported
 10643:20190506:111606.996 item "system_ovn1.domain:system.cpu.util[58,user]" became not supported: Value 307496984059169024.000000 is too small or too large.
 10648:20190506:111706.938 item "system_ovn1.domain:system.cpu.util[58,user]" became supported

Each such host has ~600 items, where 328 items are different "system.cpu.util*" keys (80*4 - idle,iowait,system,user + a 8 items for whole CPU).
Update interval is 1m for all the "system.cpu.util" items.
Also each host has 3 "system.cpu.load[percpu,avg5]" items (for avg1 avg5 avg15).

We have run 'system.cpu.util[58,user]' key using zabbix_get tool in a loop with 5 seconds delay, and during 5 minutes received the huge value once for one server:

#  while true; do echo "`date` - `zabbix_get -s system_ovn1.domain -k 'system.cpu.util[58,user]'`"; sleep 5; done

Mon May  6 12:58:32 CEST 2019 - 0.083333
Mon May  6 12:58:37 CEST 2019 - 0.083333
Mon May  6 12:58:42 CEST 2019 - 0.083333
Mon May  6 12:58:47 CEST 2019 - 0.083333
Mon May  6 12:58:52 CEST 2019 - 0.083319
Mon May  6 12:58:57 CEST 2019 - 0.016661
Mon May  6 12:59:02 CEST 2019 - 0.016664
Mon May  6 12:59:07 CEST 2019 - 0.033333
Mon May  6 12:59:12 CEST 2019 - 0.016667
Mon May  6 12:59:17 CEST 2019 - 0.016664
Mon May  6 12:59:22 CEST 2019 - 0.000000
Mon May  6 12:59:27 CEST 2019 - 0.000000
Mon May  6 12:59:32 CEST 2019 - 0.000000
Mon May  6 12:59:37 CEST 2019 - 0.000000
Mon May  6 12:59:42 CEST 2019 - 0.000000
Mon May  6 12:59:47 CEST 2019 - 0.000000
Mon May  6 12:59:52 CEST 2019 - 0.000000
Mon May  6 12:59:57 CEST 2019 - 0.000000
Mon May  6 13:00:02 CEST 2019 - 0.000000
Mon May  6 13:00:07 CEST 2019 - 307343286799559360.000000
Mon May  6 13:00:12 CEST 2019 - 0.000000
Mon May  6 13:00:17 CEST 2019 - 0.000000
Mon May  6 13:00:22 CEST 2019 - 0.000000
Mon May  6 13:00:27 CEST 2019 - 0.000000
Mon May  6 13:00:32 CEST 2019 - 0.000000
Mon May  6 13:00:37 CEST 2019 - 0.066678
Mon May  6 13:00:42 CEST 2019 - 0.083319
Mon May  6 13:00:47 CEST 2019 - 0.083333
Mon May  6 13:00:52 CEST 2019 - 0.083333

Yes, we have many items, which you can say are not so useful/required, but zabbix agent should not return a wrong value.

 

We tried to run "system.cpu.util" key in similar loop for longer period, on all 3 servers, and did not caught the issue.
Maybe we were just not luck enough, who knows ...



 Comments   
Comment by Vladislavs Sokurenko [ 2019 May 16 ]

Zabbix agent takes this information from "/proc/stat"

Could you please be so kind and attach it's contentes ?

cat /proc/stat

Were any signals sent to the Zabbix agent while it is collecting ?
Is it possible to reproduce issue on demand ?

Comment by Oleksii Zagorskyi [ 2019 May 16 ]

No reasons to expect that agent could receive any signal.

No idea how to reproduce it on demand. It just randomly happens time to time.

Attaching requested data from 3 hosts 

system_ovn1.domain-cat_proc_stat

system_ovn2.domain-cat_proc_stat
system_ovn3.domain-cat_proc_stat

Comment by Vladislavs Sokurenko [ 2019 May 16 ]

User time is calculated as user minus guest, might have something to do with it.

Comment by Vladislavs Sokurenko [ 2019 May 17 ]

Please apply one of attached patches if possible and provide logs, ZBX-16096_1.diff should write into log file only if problem is detect while ZBX-16096_2.diff logs user and guest cpu counters each second for every core.

Current suspicion is that counters in /proc/stat go backwards for some reason.

Comment by Vladislavs Sokurenko [ 2019 May 21 ]

It looks like counters indeed go backwards

 31866:20190519:195000.020 user:41233163 guest:38142874 user-guest:3090289
 31866:20190519:195001.023 user:41233166 guest:38142878 user-guest:3090288

Guest time increases in this case by 4 but user time was increased only by 3 while guest time must be included in user.

Comment by Vladislavs Sokurenko [ 2019 May 21 ]

Not a bug in Zabbix and issue should be solved in latest kernel versions but propose to add special handling of such situation, if current value is bigger than previous then set count to 0.

Comment by Oleksii Zagorskyi [ 2019 May 23 ]

So far the final patch works good on 3 affected servers (with many CPU cores).

Let me just add that the error message once also happened for "Zabbix server" agent itself, and the host has 8 CPU cores.
It's a VM GUEST (with 8 CPU cores) running on the "system_ovn..." servers (HOST) - which are serving as RH oVirt hosts.

The "Zabbix server" is running CentOS 7.6 with most recent system updates & kernel:
"Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux"

Vitalization related stuff is probably not related here, key point probably is kernel version.

Comment by Vladislavs Sokurenko [ 2019 May 23 ]

Kernel 3.10 was released 6 years ago, I don't have exact bug report or commit but the issue is probably fixed somewhere along those years.
Lots of changes here

Comment by Vladislavs Sokurenko [ 2019 May 31 ]

Fixed in:

  • pre-4.0.9rc1 7d116034b1e
  • pre-4.2.3rc1 b653e0b050f
  • pre-4.4.0alpha1 (trunk) 0a265340271
Generated at Fri Mar 29 10:21:10 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.