The package includes a config file: /etc/zabbix/zabbix_agentd.d/userparameter_mysql.conf
It defines a UserParameter called mysql.ping which makes a good example of why this is so broken.
If MySQL is working properly, the command "mysqladmin ping" will generate the output "mysqld is alive" and pipe over to "grep -c alive" which in turn produces a numeric output "1" sent back on stdout. This is the real output that the Zabbix server is looking for. That's lovely and it gives the sysadmin a nice warm feeling that everything is in good order.
The common template also includes a trigger to report "MySQL server is down" when an output "0" is detected.
Unfortunately, here's what happens when the MySQL server actually does go down: the command "mysqladmin ping" prints a bunch of errors to stderr... hardly surprising since there's a problem. These error messages are merged with the output and sent back to the Zabbix server. They don't look like any sort of number so the server decides no data is available and thus it does not bother to trigger that "MySQL server is down" report and hence nobody gets any notifications.
This is a fundamental flaw with every type of shell command that is defined as UserParameter in any config. We must expect that the entire purpose of a monitoring system is that there will be times when something has gone wrong, therefore any command at any time should be presumed to be sending random values to stderr. These random values might be understood by a human, but they cannot be sent to the Zabbix server if we want any confidence in our alert reporting. Thus, potentially all UserParameter scripts are broken (including the examples provided as part of the standard package) unless someone edits the script to redirect stderr off to /dev/null or possibly to some logfile elsewhere.
Steps to reproduce:
- Just use the out of the box configuration if you want to test.
- Get your MySQL server monitoring, then try shutting down the server and see if you get an alert.
- No alert gets sent.
- Zabbix "Monitoring / Latest Data" page shows that it does not find the numeric value it is expecting.
The exact flush sequence of stderr and stdout is not guaranteed when merging the output streams, especially when we have a shell command consisting of multiple operations piped each to the next (which happens very often). Different system libraries might send stdout first in some cases or might send stderr first.
Correct behaviour should be for Zabbix agent to only send stdout back to the Zabbix server, but if you don't want to throw away the stderr then just log it locally, presumably to the local agent log file. If the agent is smart with buffering those two streams separately, might be possible to guarantee that stdout gets sent FIRST to the Zabbix server and then perhaps send stderr AFTER but this more complex handling of buffers is also more likely to fail in unexpected ways.
At very least, if you really must keep stderr then make sure your standard example scripts and templates are tested properly.