[ZBX-25821] zabbix-agent2-plugin-nvidia-gpu kills zabbix-agent2 Created: 2025 Jan 02 Updated: 2025 Jan 15 Resolved: 2025 Jan 15 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent2 plugin (G) |
Affects Version/s: | None |
Fix Version/s: | 7.2.3rc1, 7.4.0alpha1 |
Type: | Problem report | Priority: | Major |
Reporter: | Ronald Vorstenbosch | Assignee: | Stanislavs Jurgensons (Inactive) |
Resolution: | Won't fix | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Ubuntu 24.04.1 LTS noble, x86_64 |
Issue Links: |
|
||||||||
Team: | |||||||||
Sprint: | Sprint candidates |
Description |
Steps to reproduce:
/etc/zabbix/zabbix_agent2.d/plugins.d/nvidia.conf: Plugins.NVIDIA.System.Path=/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu
When zabbix_agent2 is starting, syslog is showing:
2025-01-02T15:41:43.643960+01:00 <hostname> systemd[1]: Started zabbix-agent2.service - Zabbix Agent 2. 2025-01-02T15:41:46.667783+01:00 <hostname> zabbix_agent2[680498]: 2025/01/02 15:41:46.666613 [NVIDIA] failed to kill plugin /usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu: Failed to kill plugin "/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu" process: os: process already finished. 2025-01-02T15:41:46.668043+01:00 <hostname> zabbix_agent2[680498]: zabbix_agent2 [680498]: ERROR: Cannot register plugins: failed to register metrics of plugin "NVIDIA": failed to start plugin: failed to create connection with plugin /usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu: failed to get connection within the time limit 3000000000. 2025-01-02T15:41:46.671552+01:00 <hostname> systemd[1]: zabbix-agent2.service: Main process exited, code=exited, status=1/FAILURE 2025-01-02T15:41:46.671761+01:00 <hostname> systemd[1]: zabbix-agent2.service: Failed with result 'exit-code'.
Agent then keeps restarting but always fails. Once I comment out the NVIDIA plugin from the nvidia.conf file: # Plugins.NVIDIA.System.Path=/usr/libexec/zabbix/zabbix-agent2-plugin-nvidia-gpu zabbix_agent2 starts normally. |
Comments |
Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To better understand cause of the problem can suggest such steps: 1. Please run 'nvidia-smi' command and provide output of it. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Ronald Vorstenbosch [ 2025 Jan 06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
nvidia-smi: Mon Jan 6 18:13:29 2025 -----------------------------------------------------------------------------------------
----------------------------------------
----------------------------------------
----------------------------------------
----------------------------------------
----------------------------------------
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
rv@myserver:~$ ldconfig -p | grep libnvidia-ml libnvidia-ml.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so.1 rv@myserver:~$ locate libnvidia-ml.so rv@myserver:~$ (no output from that last command!)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Thank you for your response. It looks like the issue is caused by the absence of the `libnvidia-ml.so` symbolic link. The following steps can resolve it: 1. Create the symbolic linkIf `libnvidia-ml.so.1` exists but `libnvidia-ml.so` is missing, create the symbolic link: sudo ln -s /lib/x86_64-linux-gnu/libnvidia-ml.so.1 /lib/x86_64-linux-gnu/libnvidia-ml.so 2. Refresh the Dynamic Linker CacheAfter creating the symbolic link, update the system’s dynamic linker cache: sudo ldconfig 3. Verify the symbolic linkCheck if the `libnvidia-ml` libraries are now recognized by the system: ldconfig -p | grep libnvidia-ml Expected output: libnvidia-ml.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so.1 libnvidia-ml.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so Now the plugin should work.Please let me know whether this resolves the issue. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Ronald Vorstenbosch [ 2025 Jan 07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I can confirm that the above resolves the zabbix agents crashing. It does not start properly. However, I was expecting an Nvidia dashboard to appear for my host, but I am not seeing that. Any further manual steps needed to achieve that? Thanks for the speedy fix! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Karl [ 2025 Jan 08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hello, will you apply the corrections in the next versions? If so, when? Thank you. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Dear, rvorsten Dashboard should be available in Monitoring >> Hosts >> dashboards after the attaching to monitoring Host and configuring (if needed) the official zabbix Nvidia template. Template is available in zabbix version 7.2.1rc1 and higher or from official git repository. If host is working and template is linked, but dashboards not working, please open separate issue. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Ronald Vorstenbosch [ 2025 Jan 08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What's the difference between the active and the passive version of the template? Importing of both (active & passive) templates fails with an error message:
[Edit]: I changed the version number in the yaml file to 7.2. After that I could import the template. I had to manually assign it to the host with the GPU's but then the dashboard was visible...
Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comment by Stanislavs Jurgensons (Inactive) [ 2025 Jan 15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The issue mentioned in this ticket currently appears to be user-specific. Please note that the issue is related to the absence of a required symbolic link, which can be resolved by following the steps outlined earlier in this ticket. The logging improvement for the NVIDIA GPU plugin in the absence of the NVML will be addressed in a separate ticket, ZBXNEXT-9710, to clearly indicate the cause of the plugin failure. If you are experiencing the same issue and it is resolved by the steps mentioned here, please inform us by opening a separate issue. |