[#ZBXNEXT-6624] S.M.A.R.T. monitoring improvement to agent2

[ZBXNEXT-6624] S.M.A.R.T. monitoring improvement to agent2 Created: 2021 Apr 16 Updated: 2024 Apr 10 Resolved: 2022 Feb 18
Status:	Closed
Project:	ZABBIX FEATURE REQUESTS
Component/s:	Agent2 plugin (G)
Affects Version/s:	5.2.6
Fix Version/s:	6.0.1rc1, 6.2.0alpha1, 6.2 (plan)

Type:

New Feature Request

Priority:

Trivial

Reporter:

Chris Stackpole

Assignee:

Maxim Chudinov (Inactive)

Resolution:

Fixed

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Team:

Team INT

Sprint:

Sprint 76 (May 2021), Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022), Sprint 85 (Feb 2022)

Story Points:

Description

Greetings,
I wrote up a longer feedback comparision between the SMART monitoring tools we use now and the new plugin for agent2.

https://www.zabbix.com/forum/zabbix-suggestions-and-feedback/415662-discussion-thread-for-official-zabbix-smart-disk-monitoring

After some time to think about it with my team, the one thing that the team isn't willing to give up is the drive exit codes. From the smartctl man page:

EXIT STATUS
       The exit statuses of smartctl are defined by a bitmask.  If all is
       well with the disk, the exit status (return  value)  of
       smartctl is 0 (all bits turned off).  If a problem occurs, or an
       error, potential error, or fault is detected, then a non-
       zero status is returned.  In this case, the eight different bits in
       the exit status have the following  meanings  for  ATA
       disks; some of these values may also be returned for SCSI disks.       

       Bit 0: Command line did not parse.
       Bit 1: Device  open failed, device did not return an IDENTIFY DEVICE
              structure, or device is in a low-power mode (see '-n'
              option above).       Bit 2: Some SMART or other ATA command to
              the disk failed, or there was a checksum error in a SMART
              data  structure  (see '-b' option above).
       Bit 3: SMART status check returned "DISK FAILING".
       Bit 4: We found prefail Attributes <= threshold.
       Bit 5: SMART  status  check returned "DISK OK" but we found that some
              (usage or prefail) Attributes have been <= threshold
              at some time in the past.
       Bit 6: The device error log contains records of errors.
       Bit 7: The device self-test log contains records of errors.  [ATA
              only] Failed self-tests outdated by a  newer  successful
              extended self-test are ignored.

This feature request is to add capturing and alerting to the exit codes for the agent2 SMART plugin.
Thank you.

Comments

Comment by Chris Stackpole [ 2021 Apr 23 ]

This is a very basic fix that I've implemented on 5.2 against the template_module_smart_agent2.yaml .

Here's the diff patch to check the exit status and set a basic alert that it is greater then 0.

113a114,141
>               name: 'SMART [{#NAME}]: Exit Status'
>               type: DEPENDENT
>               key: 'smart.disk.es[{#NAME}]'
>               delay: '0'
>               history: 7d
>               trends: '0'
>               value_type: CHAR
>               application_prototypes:
>                 -
>                   name: '{#DISKTYPE} {#NAME}'
>               preprocessing:
>                 -
>                   type: JSONPATH
>                   parameters:
>                     - '$[?(@.disk_name==''{#NAME}'')].smartctl.exit_status.first()'
>                 -
>                   type: DISCARD_UNCHANGED_HEARTBEAT
>                   parameters:
>                     - 6h
>               master_item:
>                 key: smart.disk.get
>               trigger_prototypes:
>                 -
>                   expression: '{last()}>0'
>                   name: 'Exit status greater than zero'
>                   priority: HIGH
>                   description: 'The exit statuses of smartctl are defined by a bitmask. If all is well with the disk, the exit status (return  value) of smartctl is 0 (all bits turned off). If a problem occurs, or an error, potential error, or fault is detected, then a non-zero status is returned. In this case, the eight different bits in the exit status have the following  meanings for ATA disks; some of these values may also be returned for SCSI disks.'
>             -

Comment by Maxim Chudinov (Inactive) [ 2022 Feb 09 ]

Hello cstackpole
Are you sure exit status 2 also indicates a problem with a disk?
As I have seen, this can be in case of not enough permissions or an old disk that doesn't have SMART.

Comment by Chris Stackpole [ 2022 Feb 09 ]

Greetings @mchudinov ,

It makes sense to me to address your comment in reverse order. I hope that's ok. I also tend to err on the side of caution to give as-complete-as-I-can bug reports so I hope it's not too much.

As I have seen, this can be in case of not enough permissions or an old disk that doesn't have SMART.

Under the "EXIT STATUS" section, the smartctl man page also provides this helpful script to figure out what each exit code maps to (which would be amazing to map for Zabbix instead of my above "anything greater then 0" approach, but it's also a touch more work):

This shell script prints all status bits:
       val=$?; mask=1
       for i in 0 1 2 3 4 5 6 7; do
         echo "Bit $i: $(((val & mask) && 1))"
         mask=$((mask << 1))
       done

Which means that:

$ val=2 ; mask=1; for i in 0 1 2 3 4 5 6 7; do
     echo "Bit $i: $(((val & mask) && 1))"
     mask=$((mask << 1))
  done
Bit 0: 0
Bit 1: 1
Bit 2: 0
Bit 3: 0
Bit 4: 0
Bit 5: 0
Bit 6: 0
Bit 7: 0

Where Bit 1 is:

       Bit 1: Device  open failed, device did not return an IDENTIFY DEVICE
              structure, or device is in a low-power mode (see '-n'
              option above).

Thus, there are multiple documented reasons why you might have seen exit 2 in other situations. Again going back to the man page, under "-n POWERMODE[,STATUS]" it says:

By default, exit status 2 is returned if the device is in one of the specified low-power modes.  This status is also returned if the device open or identification failed.

Thus, exit two can mean:

The device can't be opened - it might be missing.
The device can't be opened - it might be permissions issue.
The device did not return a IDENTIFY device structure (aka: really old disk).
The device is in low-power mode

Are you sure exit status 2 also indicates a problem with a disk?

I've done a lot of work with smartctl but these are still my experience and opinions. Overwhelmingly, my answer to your question is a "yes". However, I admit the last case is a caveat. Thus, my answer is: maybe.

In the first case, if the drive goes missing - That's a problem. Generally the disk is a dead parrot (insert Monty Python joke here). Or the intern just yanked the wrong disk out of the array (actual story). Either way, bad things are going on with the disk and an error flagged.

In the second case, IF this is the first time I'm adding a drive to a host or the first time I'm configuring smartctl on the host - yeah. That's not a disk problem. But if the disk has been reporting correctly for a while, then changes to this? Bad things are going on with the disk. Maybe it is just a permissions problem from a change made on the host, but it should be flagged as problem. Either way, it's something that needs to be flagged as an error so that the admin can look at it and fix it.

In the third case, I've only seen this error with really old disks (thus, I'm in agreement with your point). Thus, the admin either needs to make the decision to use another disk or silence this template/item/trigger for this disk as a disk without SMART isn't going to return anything useful really to a SMART monitor template/item/trigger anyway. However, if a previously functioning disk changes status to this after reporting correctly? Then, bad things are going on with the disk and it's worth throwing an error.

In the fourth case, this is a legit return code that is not an error. Of the many-hundreds of servers I've been an admin for over the many years I've relied on smartctl, I have one system that is a "I don't know when it will be used so it has to stay responsive all the time with special hardware that I can't virtualize nor integrate easily into other servers". It's got 4 multi-TB disks that can all go into low-power mode that I do spin down to save energy because it could sit idle for days (or weeks!) then be used heavily the day after. This is not a problem with the disks. So I added a special condition trigger for this host to ignore return code 2 for these disks. I still have plenty of other indicators that are reported back that give me warning that one of those disks might be dying/failing. And I still have all the other return codes as viable trigger flags too.

Personally, I feel that low-power-mode is such an odd-ball use case that it is pretty safe to say that on a disk that has been running and reporting correctly an exit code of 2 is an error. Especially as the market trends toward SSD and Nvme - spinning disks that go into low-power-mode are more likely going to be in the care of admins who can handle their quirks. Thus working disks that change exit status to 2 are MUCH more likely to indicate errors.

Hope that helps.
Thanks!

Comment by Maxim Chudinov (Inactive) [ 2022 Feb 10 ]

Thanks cstackpole
Well, we can do more - create a trigger for each exit status value. This is possible with bitwise functions. For example, it could be an expression in the case of bit 6

bitand(last(//smart.disk.es[{#NAME}]),64)=64 and (bitand(last(//smart.disk.es[{#NAME}]),64) > bitand(last(//smart.disk.es[{#NAME}],#2),64))

What do you think?

Comment by Chris Stackpole [ 2022 Feb 10 ]

I think that is much better approach then my single >0 approach. Thank you @mchudinov !

Comment by Maxim Chudinov (Inactive) [ 2022 Feb 11 ]

cstackpole
Could you test the changed template https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/module/smart_agent2?at=refs%2Fheads%2Ffeature%2FZBXNEXT-6624-5.5 on your devices?

Comment by Chris Stackpole [ 2022 Feb 17 ]

Sorry for the delay. I have tested the template and things look good so far, but my systems are all healthy right now. I've got a stack of busted drives I'm going to put into a test system to get other exit codes, but it might not happen for a few days.

This is great. Thank you so much!

Comment by Maxim Chudinov (Inactive) [ 2022 Feb 17 ]

Available in:

6.0.1rc1 dafeaaf7af6
6.2.0alpha1 (master) 29af4490402

Generated at Wed Jul 02 05:02:26 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBXNEXT-6624] S.M.A.R.T. monitoring improvement to agent2 Created: 2021 Apr 16 Updated: 2024 Apr 10 Resolved: 2022 Feb 18