We detected a monitoring issue on one of our machines
While analyzing data in the server I saw some data wasn't retrieved by the server and after an agent restart, even the ping to know if zabbix is up or not wasn't retrieved (all our items are active checks)
In zabbix normal logs there was nothing unusual and items seems to be retrieved properly by the agent. The only thing suspicious was the restart time of zabbix agent (usually instant)
Switching to logging debug mode, we the that the asked items are retrieved from the server and the buffer is filling with all the data retrieved by the agent.
There is whet is working and whet is not :
- The agent asks to the server the items to retrieve => OK
- The agent fills it's internal buffer with the date retrieved => OK
- If the buffer reaches X elements (50 usually, I switched to 30) it sends the data to Zabbix Server => OK
- If the buffer contains data for more than X seconds (5 usually, switched to 1) it sends the data to Zabbix Server => NOT WORKING
- Every 60 seconds the agent retrieve again the item's data => NOT WORKING
- Every 120 seconds Zabbix agent asks again the server for the items to retrieve => NOT WORKING
The buffer is filled well but seems to block at some point and never do anything again, usually each seconds an internal cron checks the state of the buffer but this internal cron or polling seems to be crashed :
And then, no traces of this cron anymore.
Just before the last cron we can see it tries to retrieve an item's data and it fails (which can happens for a lot of reasons, here a file access issue) :
The item should become unsupported and the agent should continue to get data but nothing happens after
If I disable this item and all of the items that interract with a file in the server and restart the agent, all become functionnal again and internal crons do what they should :
Then each seconds, the internal cron checks if it has something to deliver or not :
This looks like a zabbix agent issue where internal crons or poling crashed when an item's data retrieving fails.
The agent was in an old version so I tried to update it to the latest but the behavior is exactly the same and can be easily reproduced.