  2. ZBX-19311

zabbix-agent2 smart monitoring fails with megaraid


    • Sprint 84 (Jan 2022), Sprint 85 (Feb 2022), Sprint 86 (Mar 2022)
    • 1

      Steps to reproduce:

      1. Deploy and configure zabbix-agent2 on RHEL8
      2. Import the latest template for smart monitoring from git
      3. Create sudo rule for zabbix user and smartctl
      4. Have disk discovery failing


      When trying to run the discovery manually with the agent:

      zabbix_agent2 -v -t smart.disk.discovery


      2021/04/29 12:47:40.125137 [Smart] stopped looking for RAID devices of megaraid type, err:%!(EXTRA *errors.errorString=failed to get disk data from smartctl: Smartctl open device: /dev/bus/0 [megaraid_disk_00] failed: INQUIRY failed)



      /sbin/smartctl --scan
       /dev/sda -d scsi # /dev/sda, SCSI device
       /dev/sdb -d scsi # /dev/sdb, SCSI device
       /dev/sdc -d scsi # /dev/sdc, SCSI device
       /dev/sdd -d scsi # /dev/sdd, SCSI device
       /dev/sde -d scsi # /dev/sde, SCSI device
       /dev/sdf -d scsi # /dev/sdf, SCSI device
       /dev/sdg -d scsi # /dev/sdg, SCSI device
       /dev/sdh -d scsi # /dev/sdh, SCSI device
       /dev/sdi -d scsi # /dev/sdi, SCSI device
       /dev/sdj -d scsi # /dev/sdj, SCSI device
       /dev/sdk -d scsi # /dev/sdk, SCSI device
       /dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
       /dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
       /dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
       /dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
       /dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
       /dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
       /dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
       /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
       /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
       /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
       /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
       /dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device

      NOTE: smartctl uses and outputs that virtual bus device that does not really exist in the filesystem, but this way you are able to return the smart status:


      smartctl -a /dev/bus/0 -d megaraid,1
       smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-240.10.1.el8_3.x86_64] (local build)
       Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
       Model Family: Intel S4510/S4610/S4500/S4600 Series SSDs
       Device Model: INTEL SSDSC2KG038T8
       Serial Number: PHYG025201RH3P8EGN
       LU WWN Device Id: 5 5cd2e4 152613993
       Firmware Version: XCV10120
       User Capacity: 3,840,755,982,336 bytes [3.84 TB]
       Sector Sizes: 512 bytes logical, 4096 bytes physical
       Rotation Rate: Solid State Device
       Form Factor: 2.5 inches
       Device is: In smartctl database [for details use: -P show]
       ATA Version is: ACS-3 T13/2161-D revision 5
       SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
       Local Time is: Thu Apr 29 12:49:44 2021 UTC
       SMART support is: Available - device has SMART capability.
       SMART support is: Enabled
       SMART Status not supported: ATA return descriptor not supported by controller firmware
       SMART overall-health self-assessment test result: PASSED
       Warning: This result is based on an Attribute check.
      General SMART Values:
       Offline data collection status: (0x00) Offline data collection activity
       was never started.
       Auto Offline Data Collection: Disabled.
       Self-test execution status: ( 0) The previous self-test routine completed
       without error or no self-test has ever 
       been run.
       Total time to complete Offline 
       data collection: ( 0) seconds.
       Offline data collection
       capabilities: (0x79) SMART execute Offline immediate.
       No Auto Offline data collection support.
       Suspend Offline collection upon new
       Offline surface scan supported.
       Self-test supported.
       Conveyance Self-test supported.
       Selective Self-test supported.
       SMART capabilities: (0x0003) Saves SMART data before entering
       power-saving mode.
       Supports SMART auto save timer.
       Error logging capability: (0x01) Error logging supported.
       General Purpose Logging supported.
       Short self-test routine 
       recommended polling time: ( 1) minutes.
       Extended self-test routine
       recommended polling time: ( 2) minutes.
       Conveyance self-test routine
       recommended polling time: ( 2) minutes.
       SCT capabilities: (0x003d) SCT Status supported.
       SCT Error Recovery Control supported.
       SCT Feature Control supported.
       SCT Data Table supported.
      SMART Attributes Data Structure revision number: 1
       Vendor Specific SMART Attributes with Thresholds:
       5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 8
       9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2575
       12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
       170 Available_Reservd_Space 0x0033 099 099 010 Pre-fail Always - 0
       171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 2
       172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
       174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 14
       175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2390 (14 65535)
       183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
       184 End-to-End_Error_Count 0x0033 100 100 090 Pre-fail Always - 0
       187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
       190 Drive_Temperature 0x0022 081 075 000 Old_age Always - 19 (Min/Max 16/27)
       192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 14
       194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 19
       197 Pending_Sector_Count 0x0012 100 100 000 Old_age Always - 0
       199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
       225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 3576929
       226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 522
       227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 25
       228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 154396
       232 Available_Reservd_Space 0x0033 099 099 010 Pre-fail Always - 0
       233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
       234 Thermal_Throttle_Status 0x0032 100 100 000 Old_age Always - 0/0
       235 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2390 (14 65535)
       241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 3576929
       242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 1226992
       243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 7374461
      SMART Error Log Version: 1
       No Errors Logged
      SMART Self-test log structure revision number 1
       No self-tests have been logged. [To run self-tests, use: smartctl -t]
      SMART Selective self-test log data structure revision number 1
       1 0 0 Not_testing
       2 0 0 Not_testing
       3 0 0 Not_testing
       4 0 0 Not_testing
       5 0 0 Not_testing
       Selective self-test flags (0x0):

      After scanning selected spans, do NOT read-scan remainder of disk.
      If Selective self-test is pending on power-up, resume after 0 minute delay.



