Loading...

Type: Epic
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: 7.2.7
Component/s: Templates (T)
Labels:
- templates
Environment:
2 Proxmox 8.4 Clusters, setup with the monitoring template from the latest release version. I also compared the 7.4 template, and noted there were not alterations

Epic Name:
Fixing up the proxmox monitoring template

Hi Zabbix Team,

Firstly, thank you for providing the "Proxmox VE by HTTP" template. It's a valuable starting point for monitoring Proxmox environments.

Based on extensive use and troubleshooting with this template (including reviewing the version in the 7.4 beta to understand its current direction), I've identified a few areas where I believe some refinements could broadly improve its out-of-the-box usability and reduce common false positive alerts for many users. These suggestions aim to address issues that might not be unique to a specific environment but rather represent general robustness improvements.

Here are my suggestions:

1. {}"Recently Restarted" Alerts (for VMs, LXCs, and Nodes):{}

{}Current Behavior:{} These alerts can sometimes trigger for entities that are simply powered off (as their uptime becomes 0). If `manual_close` is set to `YES` (as it is in the current beta template), these false alerts then require manual intervention to clear, which can be quite noisy. The dependency on the "Not Running" trigger doesn't always prevent this if there's a brief status flap.
{}Suggested Enhancement:{}
Modify the trigger expression to confirm the entity has been consistently "running" (or "online" for nodes, using "1") for a brief period (e.g., for the last two consecutive checks, using `count(status_item, #2, "eq", "running")=2` or `count(status_item, #2, "eq", "1")=2`) before checking if the uptime is low (e.g., `last(uptime_item) < 10m`).
Remove the `manual_close: 'YES'` setting and any `dependencies` from these informational "restarted" triggers. Their core logic (entity running consistently + low uptime) should be sufficient, and they should be allowed to auto-resolve. This significantly reduces false positives from temporary status glitches for off machines.

2. {}Performance Alerts (High CPU/Memory) for Off VMs/LXCs:{}

{}Current Behavior:{} The "High CPU usage" and "High memory usage" triggers for VMs/LXCs in the current beta template don't explicitly check if the guest is actually running before evaluating performance metrics. This means if an off guest briefly reports an incorrect "running" status (which can occur due to polling timing or API behavior), these triggers can fire based on stale performance data from when the guest was last active, leading to false warnings.
{}Suggested Enhancement:{} Add a condition to these trigger expressions to ensure the VM/LXC has been reported as "running" for at least two consecutive checks (e.g., `count(status_item, #2, "eq", "running")=2`) before evaluating the CPU or memory load conditions. This would make these critical alerts much more reliable and prevent alerts for machines that are truly off.

3. {}"Guest Not Running" Alerts - Making them Smarter:{}

{}Current Behavior:{} The basic `last(status_item) <> "running"` is a good foundation but can be noisy for VMs that are legitimately powered off for extended periods (like templates or archived systems) and doesn't differentiate new VMs that might be failing to start.
{}Suggested Enhancement (more advanced but very beneficial):{}
Consider a logic that differentiates based on history:
{}New/Rarely Started VMs:{} If a VM has very little uptime history (e.g., `count(uptime_item, 7d) < {$MIN_DATAPOINTS_FOR_AVG_UPTIME}` where the macro defines a small threshold like 12-20 data points), it should alert if not running.
{}Established VMs:{} If there's sufficient history, then alert if it's not running AND its average uptime over a longer period (e.g., 7 days via `avg(uptime_item, 7d)`) indicates it's usually an active/running machine (e.g., `avg(uptime_item, 7d) > {$MIN_AVG_UPTIME_ALERT}` where the macro defines a significant uptime like 1 hour in seconds).
This requires new macros for the "min datapoints" and "min average uptime" thresholds. This significantly improves the actionability of "Not Running" alerts by reducing noise from intentionally off machines while ensuring new, expected-to-run machines are flagged if down.

4. {}Contextual Macro Usability for Thresholds:{}

{}Current Behavior (7.4 beta):{} Performance thresholds like `{$PVE.VM.MEMORY.PUSE.MAX.WARN:" {#QEMU.ID}
"}` use the VM/LXC numerical ID for context.

{}Suggested Enhancement:{} For user-friendliness when defining host-level overrides, please ensure that ` {#QEMU.NAME}` or `{#LXC.NAME}` (which are available LLD macros) are used as the context in these threshold macros (e.g., `{$PVE.VM.MEMORY.PUSE.MAX.WARN:" {#QEMU.NAME}
"}`). Defining a host macro like `{$PVE.VM.MEMORY.PUSE.MAX.WARN:my-important-vm}` is often more intuitive and manageable for users than having to look up and use numerical IDs like `qemu/101` or `lxc/102`.

These refinements have made a substantial difference in practical use, leading to more reliable and actionable alerting from the Proxmox API. I believe incorporating these (or similar) logical improvements into the official template would be a great benefit to the wider Zabbix community using this template.

Thank you for your time and for considering this feedback!

Sincerely,
A Zabbix User

Details

Description

Attachments

Activity

People

Dates