[ZBX-7968] The Zabbix server pretty much just stops importing data from proxies Created: 2014 Mar 19 Updated: 2017 May 30 Resolved: 2014 May 26 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.2.2 |
Fix Version/s: | 2.2.4rc1, 2.3.0 |
Type: | Incident report | Priority: | Major |
Reporter: | Corey Shaw | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 3 |
Labels: | patch, proxy | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
CentOS 6.2 x64 |
Attachments: | Internal Processes.png ZBX-7968-update-lastid-2.2.2.patch Zabbix_server_stats.png zabbix_busy_procs.png zabbix_busy_procs2.png zabbix_server.conf | ||||
Issue Links: |
|
Description |
I've run into this issue a few times, but I've never been able to replicate it at will. Here's what appears to cause the issue: 1. My Zabbix proxies will suddenly start getting a massive backlog of data. I really do mean massive. I have 22 proxies and I've seen this issue occur when there are a total of 8+ million values backlogged between them all. This can be a result of various causes (lots of data sent in at once, server was offline for a while, etc.). When this problem is occuring, the load on my database server is almost non-existant. It literally isn't doing anything. The same goes for the Zabbix server. There is nothing to indicate that the server is overly busy. The last time this happened (about 30 minutes ago), my monitoring on the various zabbix processes/caches didn't show anything odd. The history cache usage had grown quite a bit before #4 happened, but that was expected because so much data was coming in from my proxies. At no point did the free space for the history cache reach 0% (at least according to my graphs/data). I have attached a copy of my zabbix_server.conf and a screenshot of the cache, data gathering processes, and internal processes usage. Notice that my history syncers were maxed out, but as I mentioned earlier, they were not doing anything on the DB itself. The DB server was idle. |
Comments |
Comment by Corey Shaw [ 2014 Mar 19 ] |
I should have mentioned that the gap in data in the graphs is from when I restarted the zabbix_server process. Things began working normally after that (although you can see the recovery time after the gap). |
Comment by richlv [ 2014 Mar 19 ] |
in some cases a long config cache updating process has caused something similar - do you know now long does it take for the server to update/populate the config cache ? |
Comment by Corey Shaw [ 2014 Mar 19 ] |
richlv - It takes around 15-20 seconds to update the cache |
Comment by Armacom AG [ 2014 May 05 ] |
Using Zabbix V. 2.0.9 I experience the same behavior (Debian7, MySQL 5.5.31+dfsg-0+wheezy1, StartDBSyncers=8, 2 Nodes, 6 Proxies, 104.97 VPS, 6G RAM, 4x 3.5G CPU, CacheSize=64M, HistoryCacheSize=128M, TrendCacheSize=16M, HistoryTextCacheSize=256M, kernel.shmmax = 536870912, kernel.shmall = 2097152) |
Comment by Armacom AG [ 2014 May 05 ] |
zabbix processes |
Comment by Armacom AG [ 2014 May 05 ] |
another example of the zabbix busy processes while the problem happens |
Comment by Armacom AG [ 2014 May 05 ] |
this discussion sounds pretty like the same issue: https://www.zabbix.com/forum/showthread.php?t=43839 |
Comment by Armacom AG [ 2014 May 07 ] |
For this issue, i've found a possible solution (which at least worked for me): To get arround the problem, this is what I did:
using this method, the contents of the tables listed above within the mysql-statement older than 31 days will be deleted. |
Comment by Corey Shaw [ 2014 May 07 ] |
I know that MySQL 5.6 doesn't resolve the issue since I've been running Zabbix using MySQL 5.6 since the very start. I've also been using innodb_file_per_table which you allude to (that's been around at least since MySQL 5.0). Also, I don't keep more than 28 days worth of data in my database. I use partitioning to take care of that. I'm glad that the steps you took resolved the issue for you, but I don't feel that it is the actual "fix". |
Comment by Corey Shaw [ 2014 May 07 ] |
I should mention along with my previous comment that I have not seen this problem for a while now (it's been about 1.5 months). |
Comment by Corey Shaw [ 2014 May 13 ] |
I'm seeing this issue again now. This time however zabbix server restarts and zabbix proxy restarts don't help at all. My zabbix server didn't fill up its history cache either at all. I've been comparing debug logs from a healthy proxy (one that doesn't have the issue) and an unhealthy proxy (one that isn't sending historical data at all). I've been able to ascertain that the unhealthy proxy is sending availability data for hosts to the server. There is no problem connecting to the server at all. It does also get 1000 rows of data that should be sent to the server, it just never actually sends it. There are no errors in the logs anywhere. My healthy proxy logs show the healthy proxy is sending data without any issues whatsoever. It has all the exact same type of log entries as the unhealthy proxy, it just also happens to send the data to the server (whereas the unhealthy proxy does not). |
Comment by Corey Shaw [ 2014 May 13 ] |
On closer inspection, it looks like no data is actually being returned by the proxy_get_history_data() function. On my unhealthy proxy I get a line like this: End of proxy_get_history_data():0 lastid:1438673158 The interesting thing about it is that the query it runs to get the data is: select id,itemid,clock,ns,timestamp,source,severity,value,logeventid,state from proxy_history where id>1438672158 order by id limit 1000 If I run that query manually in the proxy database, I get 1000 results as expected. In fact, notice in the "End of proxy_get_history_data" line that I provided above that the "lastid" found is 1000 higher than the id number in the select statement. To me, this means that the proxy is in fact finding data, it's just not doing anything with it. |
Comment by Corey Shaw [ 2014 May 14 ] |
Upon further inspection, most likely what is occurring is that DCconfig_get_items_by_itemids is setting errcodes[i] = FAIL for all 1000 items that should be sent to the server. |
Comment by Corey Shaw [ 2014 May 14 ] |
This if statement appears to be the cause of the FAIL in errcodes (dbconfig.c DCconfig_get_items_by_itemids()) => if (NULL == (dc_item = zbx_hashset_search(&config->items, &itemids[i])) || NULL == (dc_host = zbx_hashset_search(&config->hosts, &dc_item->hostid))). I am unable to figure out why all of my items would get an errcode of FAIL. Two of my three proxies that had this issue have suddenly began working again without any intervention on my part (but they had issues for around 6 hours). The third proxy is also working, but I had just barely deleted a row from proxy_history in the database for an itemid that did not have a corresponding item in the items table on the proxy. It could be completely coincidental, but maybe not. |
Comment by Corey Shaw [ 2014 May 14 ] |
I've had a chance now to review my proxy logs and the audit log to see what potentially caused this problem to occur. The issue started for me at 9:02am PDT right when a large number of hosts were moved around between proxies. In fact, I have 9 proxies that all had hosts moved around on them. Just after that change was made, my problem started on three of the nine proxies that had changes. Six hours later at about 3:02pm PDT those same nine proxies all had another change where hosts were moved around between them. Just after that change was made the three proxies that had issues began working just fine again. The other six proxies continued working normally. It appears there is some sort of corner case that I'm hitting when it comes to proxies receiving changes to their configuration. The next time I begin having issues with this, I'll pay close attention to the debug logs when a config sync happens that causes the problem and when the problem goes away. Unfortunately, I don't have any debug logs for either of those events this time around. |
Comment by Corey Shaw [ 2014 May 20 ] |
I have been experiencing this issue again since around 10:45 PDT on May 19. Again this issue occurred after a large number of hosts had been moved around on proxies. |
Comment by Corey Shaw [ 2014 May 20 ] |
I finally found what causes this issue. It appears that when hosts are moved around proxies there is the potential for old data to get "stuck" on proxies. When this happens, the proxy_history table will have data for items that the "items" table no longer has. I believe this causes the hashset searches that I mentioned in an earlier comment to constantly be looking at data for which there are no hashes, so no data ever gets sent. I worked around the issue in this case by deleting the old data from the proxy_history table. The instant I did that all the "stuck" data on my proxy began to be sent to the zabbix server. I know this is fairly in-efficient for now, but here are the queries that I used to find the data and then delete it: CREATE TABLE test SELECT DISTINCT itemid FROM proxy_history WHERE itemid NOT IN (SELECT itemid FROM items); |
Comment by Corey Shaw [ 2014 May 20 ] |
If this particular problem works how I theorize it does, then pretty much anytime hosts are moved around proxies increases the risk of this issue occurring. I believe there is a failure on the part of the proxy to delete data from proxy_history for which it no longer has the items. I theorize that this failure is also perpetuated by the housekeeper not deleting this old data due to my ProxyOfflineBuffer being set to 24. The reason data never ends up getting sent is a result of how the data sender process works (from what I know). It grabs the next 1000 values from the proxy_history table, parses through the data, and then sends it on to the server. In my case, the 1000 values that the data sender process grabbed were all old data. As a result, the proxy would not send it to the server, but the data was never deleted. That meant that all subsequent runs of the data sender process would keep grabbing the same 1000 values over and over - hence, no valid data could ever be sent to the server. The other problem this methodology introduces is that while the the proxy grabs 1000 values at a time, if 250 of those are invalid items then the actual throughput of my proxy is 750 values per server connection. It just gets worse and worse as more invalid items end up in the list of 1000. |
Comment by Corey Shaw [ 2014 May 20 ] |
One last thought and I promise I'll shutup for at least one day . Would it be a possible solution to just make the "nextid" column in the "ids" table increment even if none of the items were able to be sent to the zabbix server (due to not being in the items table)? This would avoid this particular problem entirely and the invalid data would still get cleaned up by the housekeeper later on. |
Comment by Corey Shaw [ 2014 May 21 ] |
The attached patch potentially fixes this edge case that only occurs if there are 1000 contiguous values in proxy_history that are no longer valid items for the proxy. |
Comment by Corey Shaw [ 2014 May 21 ] |
I have had a chance to test the attached patch in my staging environment. I am unable to cause the issue with the patched code, but before patching I could cause the issue at will by simply adding 1000 bogus itemid history entries in the proxy_history table. It looks like this at least solves the issue. I'm not aware of any side-effects, but please let me know if this is a bad way to fix the issue. |
Comment by Andris Zeila [ 2014 May 26 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-7968 |
Comment by Andris Zeila [ 2014 May 27 ] |
Corey, Thanks for debugging the issue and providing patch! Released in: |