Type: Incident report
Affects Version/s: 1.8.1
Fix Version/s: None
Component/s: Server (S)
Sun C compiler
Every once and a while, a host will build up a large number of items in the queue, and investigating the issue I found that there would be a network error for the host in the zabbix_server.log:
3800:20100302:021113.207 Item [prod-app.local:perf_counter[\System\File Write Bytes/sec]] error: Get value from agent failed: Cannot connect to [10.10.0.56:10050] [Interrupted system call]
3800:20100302:021113.208 ZABBIX Host [prod-app.local]: first network error, wait for 15 seconds
That will be the only entry for the server, with high error logging enabled. It says it will retry in 15 seconds, but it never does, and the queue time for all the items just grows.
Using "zabbix_get" manually, I can retrieve data just fine:
- /usr/zabbix/bin/zabbix_get -s prod-app.local -k agent.ping
- /usr/zabbix/bin/zabbix_get -s prod-app.local -k "perf_counter[\System\File Write Bytes/sec]"
I have to disable the host, then re-enable, to get the items to work again. After than, it can be days, hours, or weeks before I see the issue again, usually on a different host. The retry doesn't appear to happen.