Server logs this error constantly when polling Solaris 10 hosts with lots (~1300+) of active processes:
zabbix_server[27363]: Zabbix agent item "proc.num[,,run]" on host "testhost" failed: first network error, wait for 15 seconds
This causes issues (like missing data in history, and consequently holes in graphs) not only for the proc.num[,,run], but also for other items on these hosts (which I presume are getting aborted by the above timeout along with proc.num[,,run]).
I did a little analysis of the respective Agent source code (src/libs/zbxsysinfo/solaris/proc.c_) and I believe it can be made more efficient, and therefore avoid that issue (or at least postpone it to occur only with much larger process tables). I plan on writing and then contributing a patch for the agent to implement this.
I'm creating this issue to record the issue and my progress so far, and to server as a focal point for future work on the issue.
|