In environments that have a very high NVPS (3500+), disaster recovery can be very difficult with the current way that Proxies send data to Zabbix. I recently had a situation where my Zabbix DB went offline for a long period of time (about 2 hours). During that time, the proxies cached data as they should. The real Zabbix problem came when my database came back online:
1. I have 19 proxies. During the time that Zabbix was down 12 of those proxies had each cached roughly 1.5 million values. The remaining 7 proxies all had somewhere around 500k values. The total number of values was in the neighborhood of 21.5 million.
2. When the database came back online, the Zabbix Server connected to it just fine and started acting normally.
3. When the proxies realized they could send their data, they all started sending it as fast as they possibly could (1000 values at a time).
4. The server couldn't handle the huge amount of data coming in from the proxies. The history cache filled up and all the history synchers became busy.
5. Once #4 occurred, the overall NVPS of the system fell below normal levels. Usually it was about 3500-4000 on average. It fell to below 2000 NVPS.
6. I waited a good 30 minutes or so to see if Zabbix could recover, but the proxies kept pounding the server. It became so bad that all of my proxies started collecting data faster than they could send it.
7. Because of #6, there was no way that Zabbix could possibly ever recover on its own. I was left with two options: Truncate history on the proxies (wasn't an option I could take) or shut off most of the proxies to give the remaining ones an actual chance to send their data. I ended up going with the latter (because I couldn't do the former).
8. As proxies finished sending in their data, I would start up 2 more and let them catch up. I followed that procedure until all my proxies were caught up. The whole catch up process took about 2 hours.
9. Because of Zabbix's lack of ability to control a large influx of data, I had several thousand hosts that weren't being monitored during the whole catch-up period. That's unacceptable.
I can understand a large quanitity of data taking time to come into the Zabbix server, but at the very least Zabbix should have the intelligence to control that flow of data. If the Proxy and Server daemons were able to communicate with one another to pass health statistics on the server, then the server could notify a proxy to slow down on the amount of data it is sending in or speed up if it could handle more. Maybe even cut off the flow of data from proxies until the history cache reached a certain amount of space free or something.
In this case, the DB was not a bottleneck. As usual in my system, disk IO was normal, CPU was fine and memory was fine. The zabbix server just starts freaking out when it gets too much data and it doesn't know how to handle it (and can't communicate its problems to the source of the data).
To be fair, my load testing for Zabbix 2.0.x showed that it can only realiably handle about 12,500 NVPS, but at the same time, this was a disaster recovery situation, not normal operation. In this case, I had no way of controlling the flow of data other than shutting off proxies.
By the way, my Zabbix 2.2.x load testing showed that it could reliably handle about 30,000 NVPS, but the same problem will still occur in a distaster recovery situation.