[ZBXNEXT-6624] S.M.A.R.T. monitoring improvement to agent2 Created: 2021 Apr 16 Updated: 2024 Apr 10 Resolved: 2022 Feb 18 |
|
Status: | Closed |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Agent2 plugin (G) |
Affects Version/s: | 5.2.6 |
Fix Version/s: | 6.0.1rc1, 6.2.0alpha1, 6.2 (plan) |
Type: | New Feature Request | Priority: | Trivial |
Reporter: | Chris Stackpole | Assignee: | Maxim Chudinov (Inactive) |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Team: | |
Sprint: | Sprint 76 (May 2021), Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022), Sprint 85 (Feb 2022) |
Story Points: | 3 |
Description |
Greetings, After some time to think about it with my team, the one thing that the team isn't willing to give up is the drive exit codes. From the smartctl man page: EXIT STATUS The exit statuses of smartctl are defined by a bitmask. If all is well with the disk, the exit status (return value) of smartctl is 0 (all bits turned off). If a problem occurs, or an error, potential error, or fault is detected, then a non- zero status is returned. In this case, the eight different bits in the exit status have the following meanings for ATA disks; some of these values may also be returned for SCSI disks. Bit 0: Command line did not parse. Bit 1: Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode (see '-n' option above). Bit 2: Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure (see '-b' option above). Bit 3: SMART status check returned "DISK FAILING". Bit 4: We found prefail Attributes <= threshold. Bit 5: SMART status check returned "DISK OK" but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past. Bit 6: The device error log contains records of errors. Bit 7: The device self-test log contains records of errors. [ATA only] Failed self-tests outdated by a newer successful extended self-test are ignored. This feature request is to add capturing and alerting to the exit codes for the agent2 SMART plugin. |
Comments |
Comment by Chris Stackpole [ 2021 Apr 23 ] |
This is a very basic fix that I've implemented on 5.2 against the template_module_smart_agent2.yaml . Here's the diff patch to check the exit status and set a basic alert that it is greater then 0. 113a114,141 > name: 'SMART [{#NAME}]: Exit Status' > type: DEPENDENT > key: 'smart.disk.es[{#NAME}]' > delay: '0' > history: 7d > trends: '0' > value_type: CHAR > application_prototypes: > - > name: '{#DISKTYPE} {#NAME}' > preprocessing: > - > type: JSONPATH > parameters: > - '$[?(@.disk_name==''{#NAME}'')].smartctl.exit_status.first()' > - > type: DISCARD_UNCHANGED_HEARTBEAT > parameters: > - 6h > master_item: > key: smart.disk.get > trigger_prototypes: > - > expression: '{last()}>0' > name: 'Exit status greater than zero' > priority: HIGH > description: 'The exit statuses of smartctl are defined by a bitmask. If all is well with the disk, the exit status (return value) of smartctl is 0 (all bits turned off). If a problem occurs, or an error, potential error, or fault is detected, then a non-zero status is returned. In this case, the eight different bits in the exit status have the following meanings for ATA disks; some of these values may also be returned for SCSI disks.' > - |
Comment by Maxim Chudinov (Inactive) [ 2022 Feb 09 ] |
Hello cstackpole |
Comment by Chris Stackpole [ 2022 Feb 09 ] |
Greetings @mchudinov , It makes sense to me to address your comment in reverse order. I hope that's ok. I also tend to err on the side of caution to give as-complete-as-I-can bug reports so I hope it's not too much.
Under the "EXIT STATUS" section, the smartctl man page also provides this helpful script to figure out what each exit code maps to (which would be amazing to map for Zabbix instead of my above "anything greater then 0" approach, but it's also a touch more work): This shell script prints all status bits: val=$?; mask=1 for i in 0 1 2 3 4 5 6 7; do echo "Bit $i: $(((val & mask) && 1))" mask=$((mask << 1)) done Which means that: $ val=2 ; mask=1; for i in 0 1 2 3 4 5 6 7; do echo "Bit $i: $(((val & mask) && 1))" mask=$((mask << 1)) done Bit 0: 0 Bit 1: 1 Bit 2: 0 Bit 3: 0 Bit 4: 0 Bit 5: 0 Bit 6: 0 Bit 7: 0 Where Bit 1 is: Bit 1: Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode (see '-n' option above). Thus, there are multiple documented reasons why you might have seen exit 2 in other situations. Again going back to the man page, under "-n POWERMODE[,STATUS]" it says:
By default, exit status 2 is returned if the device is in one of the specified low-power modes. This status is also returned if the device open or identification failed. Thus, exit two can mean:
I've done a lot of work with smartctl but these are still my experience and opinions. Overwhelmingly, my answer to your question is a "yes". However, I admit the last case is a caveat. Thus, my answer is: maybe. In the first case, if the drive goes missing - That's a problem. Generally the disk is a dead parrot (insert Monty Python joke here). Or the intern just yanked the wrong disk out of the array (actual story). Either way, bad things are going on with the disk and an error flagged. In the second case, IF this is the first time I'm adding a drive to a host or the first time I'm configuring smartctl on the host - yeah. That's not a disk problem. But if the disk has been reporting correctly for a while, then changes to this? Bad things are going on with the disk. Maybe it is just a permissions problem from a change made on the host, but it should be flagged as problem. Either way, it's something that needs to be flagged as an error so that the admin can look at it and fix it. In the third case, I've only seen this error with really old disks (thus, I'm in agreement with your point). Thus, the admin either needs to make the decision to use another disk or silence this template/item/trigger for this disk as a disk without SMART isn't going to return anything useful really to a SMART monitor template/item/trigger anyway. However, if a previously functioning disk changes status to this after reporting correctly? Then, bad things are going on with the disk and it's worth throwing an error. In the fourth case, this is a legit return code that is not an error. Of the many-hundreds of servers I've been an admin for over the many years I've relied on smartctl, I have one system that is a "I don't know when it will be used so it has to stay responsive all the time with special hardware that I can't virtualize nor integrate easily into other servers". It's got 4 multi-TB disks that can all go into low-power mode that I do spin down to save energy because it could sit idle for days (or weeks!) then be used heavily the day after. This is not a problem with the disks. So I added a special condition trigger for this host to ignore return code 2 for these disks. I still have plenty of other indicators that are reported back that give me warning that one of those disks might be dying/failing. And I still have all the other return codes as viable trigger flags too. Personally, I feel that low-power-mode is such an odd-ball use case that it is pretty safe to say that on a disk that has been running and reporting correctly an exit code of 2 is an error. Especially as the market trends toward SSD and Nvme - spinning disks that go into low-power-mode are more likely going to be in the care of admins who can handle their quirks. Thus working disks that change exit status to 2 are MUCH more likely to indicate errors. Hope that helps. |
Comment by Maxim Chudinov (Inactive) [ 2022 Feb 10 ] |
Thanks cstackpole
bitand(last(//smart.disk.es[{#NAME}]),64)=64 and (bitand(last(//smart.disk.es[{#NAME}]),64) > bitand(last(//smart.disk.es[{#NAME}],#2),64))
What do you think? |
Comment by Chris Stackpole [ 2022 Feb 10 ] |
I think that is much better approach then my single >0 approach. Thank you @mchudinov ! |
Comment by Maxim Chudinov (Inactive) [ 2022 Feb 11 ] |
cstackpole |
Comment by Chris Stackpole [ 2022 Feb 17 ] |
Sorry for the delay. I have tested the template and things look good so far, but my systems are all healthy right now. I've got a stack of busted drives I'm going to put into a test system to get other exit codes, but it might not happen for a few days. This is great. Thank you so much! |
Comment by Maxim Chudinov (Inactive) [ 2022 Feb 17 ] |
Available in:
|