[ZBX-19284] Bad values in the agent2 SMART template Created: 2021 Apr 24  Updated: 2024 Apr 10  Resolved: 2021 Oct 07

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent2 plugin (G), Templates (T)
Affects Version/s: 5.2.6, 5.4.2
Fix Version/s: 5.0.17rc1, 5.4.6rc1, 6.0.0alpha4, 6.0 (plan)

Type: Problem report Priority: Major
Reporter: Chris Stackpole Assignee: Maxim Chudinov (Inactive)
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Collected smart.png     PNG File Real smart.png    
Team: Team INT
Sprint: Sprint 80 (Sep 2021), Sprint 81 (Oct 2021)
Story Points: 0.5

 Description   

Greetings,
I found an issue and am recounting my journey in attempting to fix it. It's possible that there's a better way and I thought it might be useful to know how I got there.

I've been investigating the use of the SMART template with agent2 on several test hosts. I noticed something very odd today when working on improving the SMART template. A bunch of disks that I got all around the same time a ~2 years ago were all reporting the EXACT same "Power on hours".
Last value for "SMART [sdc sat]: ID 9 Power_On_Hours" is '85'.
Last value for "SMART [sdc sat]: Power on hours" is "3h 47m 27s".
I checked the graph which I would assume would be changing upward as power on hours increases but instead I found a flat line!

Well - that's not right. And all the drives shouldn't all be the same either. Are other items wrong too?

I went looking at the hard drive itself. This is straight from the smartctl program:

9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13647 (130 84 0)

Ah. That makes sense. The template is pulling back the "VALUE" which is not at all what is needed here. What is needed is the "RAW_VALUE". Which means my suspicions on all the other values being wrong is probably correct too.

What does that look like in json? (snipping only relevant bit)

{
 "id": 9,
 "name": "Power_On_Hours",
 "value": 85,
 "worst": 85,
 "thresh": 0,
 "when_failed": "",
 "flags": {
 "value": 50,
 "string": "-O--CK ",
 "prefailure": false,
 "updated_online": true,
 "performance": false,
 "error_rate": false,
 "event_count": true,
 "auto_keep": true
 },
 "raw": {
 "value": 39973260637519,
 "string": "13647 (36 91 0)"
 }
 },

Capturing the first value isn't as useful to monitoring as the raw value. Now to update the template template_module_smart_agent2.yaml:

53c53
<                     - '$[?(@.disk_name==''{#NAME}'')].ata_smart_attributes.table[?(@.id=={#ID})].value.first()'
---
>                     - '$[?(@.disk_name==''{#NAME}'')].ata_smart_attributes.raw.table[?(@.id=={#ID})].value.first()'

And a recheck...

Well, some values are much better. For example, my disk is NOT running at 60 degrees C! But it is at 40C. That's a plus.

190 Airflow_Temperature_Cel 0x0022 060 052 040 Old_age Always - 40 (Min/Max 33/43)

In fact, a lot of the values make more sense. Better yet, they aren't all the same across every disk!

However, Power_on_hours is still not right. It went from "85" to "19297288074576" when it should be "13647". And I also broke other fields as my Temperature_Celsius (which has the same temp as Airflow_Temperature_Cel) is showing "107374182440"!!!

So clearly something is busted in my fix. Investigating the SMART JSON output again, sure enough some values the "raw value" matches the "string" but in others it really is the raw value. Which means the string might be the better field to read in. Though, I have no idea how to deal with multiple value types inside a single discovery item. And I don't really want to create a new value for every potential item here.

Here's my attempt to fix it again in template_module_smart_agent2.yaml:

53c52,63
<                     - '$[?(@.disk_name==''{#NAME}'')].ata_smart_attributes.table[?(@.id=={#ID})].value.first()'
---
>                     - '$[?(@.disk_name==''{#NAME}'')].ata_smart_attributes.table[?(@.id=={#ID})].raw.string.first()'
>                 -
>                   type: JAVASCRIPT
>                   parameters:
>                     - |
>                       var parsed = value.split(" ").filter(function(e){ return e === 0 || e })[0];
>                       parsed = parsed.split("+").filter(function(e){ return e === 0 || e })[0];
>                       return parsed
>                 -
>                   type: RTRIM
>                   parameters:
>                     - h

Is it the right fix? :shrug: but it works. I'm getting correct values.

Zabbix isn't translating the "SMART [sdc sat]: Power on hours" correctly because it reads it in as seconds and changing that to hours just gives "13.65 Kh" instead of translating that into years/months/days/hours. I just removed it to display the raw value for me but maybe there's a better solution there too. However, everything else is looking much better.

Hopefully this helps others with this template issue.

 



 Comments   
Comment by Chris Stackpole [ 2021 Apr 24 ]

I forgot to mention. The reason for the double filter in the javascript is because items like the temperature returns "40 (0 25 0 0 0)" while power on returns "13555h+05m+21.913s". Thus, I'm only grabbing the most significant part. However, without breaking out each item into it's own item I'm not sure of a better way of handling it.

Comment by Chris Stackpole [ 2021 Apr 24 ]

Dah! This was actually reported in the forums, however, I didn't find anything when I did a bug search. And it is obviously not fixed yet. I apologize if I missed an original bug report somewhere.

https://www.zabbix.com/forum/zabbix-suggestions-and-feedback/415662-discussion-thread-for-official-zabbix-smart-disk-monitoring

Comment by Aleksey Volodin [ 2021 Jul 07 ]

Steps to reproduce:

  1. Install agent2 on host
  2. Add template SMART by Zabbix agent 2 to host
  3. Аdd zabbix ALL=(ALL) NOPASSWD:/usr/sbin/smartctl to /etc/sudoers on host
  4. Perform LLD
  5. Wait for metric collect

Result:

SMART [sda sat]: ID 194 Temperature_Celsius 2021-07-07 11:42:09 53

Expected:

SMART [sda sat]: ID 194 Temperature_Celsius 2021-07-07 11:42:09 47

Possible reason:

Data was read from VALUE instead of RAW_VALUE.

Comment by Maxim Chudinov (Inactive) [ 2021 Sep 28 ]

Hello cstackpole. Your exploration is very helpful.

Unfortunately, not all vendors and disk models provide the raw values in string format as you've written

"raw": {
   "value": 39973260637519,
   "string": "13647 (36 91 0)"
 }

.
I saw ID#9 Power_On_Hours in format

"raw": {
   "value": 21478,
   "string": "21478"
}

or

"raw": {
  "value": 13305293286956522,
  "string": "24042h+51m+37.880s"
} 

and ID#194 Temperature_Celsius in format

"raw": {
  "value": 24,
  "string": "24"
}

.
We don't know all possible variants. So I would suggest just adding an item with raw data from

ata_smart_attributes.table[?(@.id=={#ID})].raw.string.first()

as string value, without Javascript preprocessing. Are you agree?

Comment by Chris Stackpole [ 2021 Sep 28 ]

Adding it as raw string is probably safer. I agree. Thanks!

Comment by Maxim Chudinov (Inactive) [ 2021 Oct 01 ]

Available in:

Generated at Fri May 02 06:59:10 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.