-
Problem report
-
Resolution: Unresolved
-
Trivial
-
7.2.7
-
Sprint candidates
-
1
tanderson@monit01:~$ sudo zabbix_get -s scgn03 -k nvml.device.ecc.mode["GPU-e180581d-10be-d2a0-9f45-b1c25abfe188"] {"current":true,"pending":true} tanderson@monit01:~$ sudo zabbix_get -s scgn03 -k nvml.device.errors.memory["GPU-e180581d-10be-d2a0-9f45-b1c25abfe188"] ZBX_NOTSUPPORTED: Failed to execute handler: failed to receive the result: failed to get corrected memory errors: failed to get NVML memory error counter: NVML error: The requested operation is not available on target device. tanderson@monit01:~$ sudo zabbix_get -s scgn03 -k nvml.device.errors.register["GPU-e180581d-10be-d2a0-9f45-b1c25abfe188"] ZBX_NOTSUPPORTED: Failed to execute handler: failed to receive the result: failed to get corrected memory errors: failed to get NVML memory error counter: NVML error: The requested operation is not available on target device.
[tanderson@scgn03 ~]$ nvidia-smi -q -d ECC==============NVSMI LOG==============Timestamp : Wed May 28 11:19:01 2025 Driver Version : 570.124.06 CUDA Version : 12.8Attached GPUs : 3 GPU 00000000:25:00.0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 SRAM Threshold Exceeded : No Aggregate Uncorrectable SRAM Sources SRAM L2 : 0 SRAM SM : 0 SRAM Microcontroller : 0 SRAM PCIE : 0 SRAM Other : 0GPU 00000000:81:00.0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 SRAM Threshold Exceeded : No Aggregate Uncorrectable SRAM Sources SRAM L2 : 0 SRAM SM : 0 SRAM Microcontroller : 0 SRAM PCIE : 0 SRAM Other : 0GPU 00000000:E2:00.0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 SRAM Threshold Exceeded : No Aggregate Uncorrectable SRAM Sources SRAM L2 : 0 SRAM SM : 0 SRAM Microcontroller : 0 SRAM PCIE : 0 SRAM Other : 0
As you can see the GPUs have ECC mode enabled.
All other parts of the template are working fine.