[ZBX-25821] zabbix-agent2-plugin-nvidia-gpu kills zabbix-agent2 Created: 2025 Jan 02  Updated: 2025 Jan 15  Resolved: 2025 Jan 15

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent2 plugin (G)
Affects Version/s: None
Fix Version/s: 7.2.3rc1, 7.4.0alpha1

Type: Problem report Priority: Major
Reporter: Ronald Vorstenbosch Assignee: Stanislavs Jurgensons (Inactive)
Resolution: Won't fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 24.04.1 LTS noble, x86_64
zabbix 7.2.1 / MySQL
5x NVIDIA GeForce RTX 3090 / driver NVIDIA_SMI 565.57.01 / CUDA version 12.7


Issue Links:
Related
related to ZBXNEXT-9710 Zabbix agent 2 nvidia plugin better s... Closed
Team: Team INT
Sprint: Sprint candidates

 Description   

Steps to reproduce:

 

/etc/zabbix/zabbix_agent2.d/plugins.d/nvidia.conf:
Plugins.NVIDIA.System.Path=/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu

 

When zabbix_agent2 is starting, syslog is showing:

 

2025-01-02T15:41:43.643960+01:00 <hostname> systemd[1]: Started zabbix-agent2.service - Zabbix Agent 2.
2025-01-02T15:41:46.667783+01:00 <hostname> zabbix_agent2[680498]: 2025/01/02 15:41:46.666613 [NVIDIA] failed to kill plugin /usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu: Failed to kill plugin "/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu" process: os: process already finished.
2025-01-02T15:41:46.668043+01:00 <hostname> zabbix_agent2[680498]: zabbix_agent2 [680498]: ERROR: Cannot register plugins: failed to register metrics of plugin "NVIDIA": failed to start plugin: failed to create connection with plugin /usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu: failed to get connection within the time limit 3000000000.
2025-01-02T15:41:46.671552+01:00 <hostname> systemd[1]: zabbix-agent2.service: Main process exited, code=exited, status=1/FAILURE
2025-01-02T15:41:46.671761+01:00 <hostname> systemd[1]: zabbix-agent2.service: Failed with result 'exit-code'.

 

Agent then keeps restarting but always fails. 

Once I comment out the NVIDIA plugin from the nvidia.conf file:

# Plugins.NVIDIA.System.Path=/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu

zabbix_agent2 starts normally. 



 Comments   
Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 06 ]

To better understand cause of the problem can suggest such steps:

1. Please run 'nvidia-smi' command and provide output of it.
(Explanation: nvidia-smi is cli tool that uses nvml, same library plugin does. If that works, plugin also should.) 
2. Please run 'ldconfig -p | grep libnvidia-ml' command and provide output of it.
(Explanation: will display the path to the NVML library if it is registered with the dynamic linker.)
3. Please run  'locate libnvidia-ml.so' command and provide output of it.
(Explanation: If library exists and not registered on dynamic linker. In case of custom install path.)

Comment by Ronald Vorstenbosch [ 2025 Jan 06 ]

nvidia-smi:

Mon Jan  6 18:13:29 2025       

-----------------------------------------------------------------------------------------

NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7    
-----------------------------------------------------------------------------------+
GPU  Name                 Persistence-M Bus-Id          Disp.A Volatile Uncorr. ECC
Fan  Temp   Perf          Pwr:Usage/Cap           Memory-Usage GPU-Util  Compute M.
                                                                              MIG M.
=======================================================================================
  0  NVIDIA GeForce RTX 3090        On    00000000:01:00.0 Off                   N/A
  0%   29C    P8              9W /  370W       2MiB /  24576MiB       0%      Default
                                                                                  N/A

-----------------------------------------------------------------------------------

  1  NVIDIA GeForce RTX 3090        On    00000000:03:00.0 Off                   N/A
  0%   28C    P8              7W /  370W       2MiB /  24576MiB       0%      Default
                                                                                  N/A

-----------------------------------------------------------------------------------

  2  NVIDIA GeForce RTX 3090        On    00000000:04:00.0 Off                   N/A
  0%   28C    P8             10W /  370W       2MiB /  24576MiB       0%      Default
                                                                                  N/A

-----------------------------------------------------------------------------------

  3  NVIDIA GeForce RTX 3090        On    00000000:05:00.0 Off                   N/A
  0%   27C    P8             18W /  370W       2MiB /  24576MiB       0%      Default
                                                                                  N/A

-----------------------------------------------------------------------------------

  4  NVIDIA GeForce RTX 3090        On    00000000:06:00.0 Off                   N/A
  0%   28C    P8              6W /  370W       2MiB /  24576MiB       0%      Default
                                                                                  N/A

-----------------------------------------------------------------------------------

                                                                                         

-----------------------------------------------------------------------------------------

Processes:                                                                             
  GPU   GI   CI        PID   Type   Process name                              GPU Memory
        ID   ID                                                               Usage     
=========================================================================================
  No running processes found                                                            

-----------------------------------------------------------------------------------------

 

rv@myserver:~$ ldconfig -p | grep libnvidia-ml

libnvidia-ml.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so.1

rv@myserver:~$ locate libnvidia-ml.so

rv@myserver:~$

(no output from that last command!)

 

Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 07 ]

Thank you for your response.

It looks like the issue is caused by the absence of the `libnvidia-ml.so` symbolic link. The following steps can resolve it:
This process creates the missing symbolic link to point to the existing `libnvidia-ml.so.1` library and ensures the plugin functions correctly.

1. Create the symbolic link

If `libnvidia-ml.so.1` exists but `libnvidia-ml.so` is missing, create the symbolic link:

sudo ln -s /lib/x86_64-linux-gnu/libnvidia-ml.so.1 /lib/x86_64-linux-gnu/libnvidia-ml.so

2. Refresh the Dynamic Linker Cache

After creating the symbolic link, update the system’s dynamic linker cache:

sudo ldconfig

3. Verify the symbolic link

Check if the `libnvidia-ml` libraries are now recognized by the system:

ldconfig -p | grep libnvidia-ml

Expected output:

libnvidia-ml.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so

Now the plugin should work.

Please let me know whether this resolves the issue.

Comment by Ronald Vorstenbosch [ 2025 Jan 07 ]

I can confirm that the above resolves the zabbix agents crashing. It does not start properly. However, I was expecting an Nvidia dashboard to appear for my host, but I am not seeing that. Any further manual steps needed to achieve that?

Thanks for the speedy fix!

Comment by Karl [ 2025 Jan 08 ]

Hello, will you apply the corrections in the next versions? If so, when? Thank you.

Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 08 ]

Dear, rvorsten

Dashboard should be available in Monitoring >> Hosts >> dashboards after the attaching to monitoring Host and configuring (if needed) the official zabbix Nvidia template.

Template is available in zabbix version 7.2.1rc1 and higher or from official git repository.
Find template here: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia
Note: template is not updated with the server and should be updated manually.

If host is working and template is linked, but dashboards not working, please open separate issue.

Comment by Ronald Vorstenbosch [ 2025 Jan 08 ]

What's the difference between the active and the passive version of the template?

Importing of both (active & passive) templates fails with an error message:

  • Invalid tag "/zabbix_export/version": unsupported version number.

[Edit]: I changed the version number in the yaml file to 7.2. After that I could import the template. I had to manually assign it to the host with the GPU's but then the dashboard was visible...

 

Thanks,
Ronald

Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 15 ]

The issue mentioned in this ticket currently appears to be user-specific.

Please note that the issue is related to the absence of a required symbolic link, which can be resolved by following the steps outlined earlier in this ticket.

The logging improvement for the NVIDIA GPU plugin in the absence of the NVML will be addressed in a separate ticket, ZBXNEXT-9710, to clearly indicate the cause of the plugin failure.

If you are experiencing the same issue and it is resolved by the steps mentioned here, please inform us by opening a separate issue.

Generated at Sun May 18 06:39:40 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.