[ZBX-7968] The Zabbix server pretty much just stops importing data from proxies Created: 2014 Mar 19  Updated: 2017 May 30  Resolved: 2014 May 26

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.2
Fix Version/s: 2.2.4rc1, 2.3.0

Type: Incident report Priority: Major
Reporter: Corey Shaw Assignee: Unassigned
Resolution: Fixed Votes: 3
Labels: patch, proxy
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.2 x64
Percona MySQL 5.6.15
24 DBSyncers


Attachments: PNG File Internal Processes.png     File ZBX-7968-update-lastid-2.2.2.patch     PNG File Zabbix_server_stats.png     PNG File zabbix_busy_procs.png     PNG File zabbix_busy_procs2.png     File zabbix_server.conf    
Issue Links:
Duplicate

 Description   

I've run into this issue a few times, but I've never been able to replicate it at will. Here's what appears to cause the issue:

1. My Zabbix proxies will suddenly start getting a massive backlog of data. I really do mean massive. I have 22 proxies and I've seen this issue occur when there are a total of 8+ million values backlogged between them all. This can be a result of various causes (lots of data sent in at once, server was offline for a while, etc.).
2. When #1 occurs, the zabbix server begins accepting all the data (as expected).
3. I can see all of my history syncers inserting data as fast as they can into the database (as expected).
4. After a while (variable time), the database load will suddenly disappear. At any given time I might see one insert query for the DBSyncers, by that's it.
5. After the load has disappeared, I can see that items that are polled/pushed to/from the zabbix server itself are updated.
6. At the same time as #5, I can see that barely any data is being retrieved from the Zabbix proxies at all. The backlog of data keeps growing larger and larger.
7. After waiting for a while without the problem going away, I'll give up and restart the zabbix_server process.
8. After restarting the zabbix_server process, the server immediately begins to process data like crazy from the proxies (like it should have been doing BEFORE the restart).

When this problem is occuring, the load on my database server is almost non-existant. It literally isn't doing anything. The same goes for the Zabbix server. There is nothing to indicate that the server is overly busy. The last time this happened (about 30 minutes ago), my monitoring on the various zabbix processes/caches didn't show anything odd. The history cache usage had grown quite a bit before #4 happened, but that was expected because so much data was coming in from my proxies. At no point did the free space for the history cache reach 0% (at least according to my graphs/data).

I have attached a copy of my zabbix_server.conf and a screenshot of the cache, data gathering processes, and internal processes usage. Notice that my history syncers were maxed out, but as I mentioned earlier, they were not doing anything on the DB itself. The DB server was idle.



 Comments   
Comment by Corey Shaw [ 2014 Mar 19 ]

I should have mentioned that the gap in data in the graphs is from when I restarted the zabbix_server process. Things began working normally after that (although you can see the recovery time after the gap).

Comment by richlv [ 2014 Mar 19 ]

in some cases a long config cache updating process has caused something similar - do you know now long does it take for the server to update/populate the config cache ?

Comment by Corey Shaw [ 2014 Mar 19 ]

richlv - It takes around 15-20 seconds to update the cache

Comment by Armacom AG [ 2014 May 05 ]

Using Zabbix V. 2.0.9 I experience the same behavior (Debian7, MySQL 5.5.31+dfsg-0+wheezy1, StartDBSyncers=8, 2 Nodes, 6 Proxies, 104.97 VPS, 6G RAM, 4x 3.5G CPU, CacheSize=64M, HistoryCacheSize=128M, TrendCacheSize=16M, HistoryTextCacheSize=256M, kernel.shmmax = 536870912, kernel.shmall = 2097152)

Comment by Armacom AG [ 2014 May 05 ]

zabbix processes

Comment by Armacom AG [ 2014 May 05 ]

another example of the zabbix busy processes while the problem happens

Comment by Armacom AG [ 2014 May 05 ]

this discussion sounds pretty like the same issue: https://www.zabbix.com/forum/showthread.php?t=43839

Comment by Armacom AG [ 2014 May 07 ]

For this issue, i've found a possible solution (which at least worked for me):
There is a performance gap using mysql and big databases like zabbix may produce.
Possibly this could be eliminated using mysql > 5.6 (you can configure MySQL so that each table, including its indexes, is stored as a separate file. In that way ibdata1 will not grow as large. This is enabled by default as of version 5.6 of MySQL).

To get arround the problem, this is what I did:

  • stop zabbix-processes
  • go to mysql (root account)
  • switch to the zabbix database
  • do a manual housekeeping:
    delete from alerts where clock<unix_timestamp() - 31*86400; delete from events where clock<unix_timestamp() - 31*86400; delete from history where clock < unix_timestamp() - 31*86400; delete from history_uint where clock < unix_timestamp() - 31*86400; delete from history_str where clock < unix_timestamp() - 31*86400; delete from history_text where clock < unix_timestamp() - 31*86400; delete from history_log where clock < unix_timestamp() - 31*86400; delete from trends where clock < unix_timestamp() - 31*86400; delete from trends_uint where clock < unix_timestamp() - 31*86400;
  • exit from mysql
  • start the zabbix-processes
  • restart zabbix proxy-process on your proxies

using this method, the contents of the tables listed above within the mysql-statement older than 31 days will be deleted.
afterwards zabbix will perform very well (like expected).

Comment by Corey Shaw [ 2014 May 07 ]

I know that MySQL 5.6 doesn't resolve the issue since I've been running Zabbix using MySQL 5.6 since the very start. I've also been using innodb_file_per_table which you allude to (that's been around at least since MySQL 5.0). Also, I don't keep more than 28 days worth of data in my database. I use partitioning to take care of that. I'm glad that the steps you took resolved the issue for you, but I don't feel that it is the actual "fix".

Comment by Corey Shaw [ 2014 May 07 ]

I should mention along with my previous comment that I have not seen this problem for a while now (it's been about 1.5 months).

Comment by Corey Shaw [ 2014 May 13 ]

I'm seeing this issue again now. This time however zabbix server restarts and zabbix proxy restarts don't help at all. My zabbix server didn't fill up its history cache either at all. I've been comparing debug logs from a healthy proxy (one that doesn't have the issue) and an unhealthy proxy (one that isn't sending historical data at all). I've been able to ascertain that the unhealthy proxy is sending availability data for hosts to the server. There is no problem connecting to the server at all. It does also get 1000 rows of data that should be sent to the server, it just never actually sends it. There are no errors in the logs anywhere.

My healthy proxy logs show the healthy proxy is sending data without any issues whatsoever. It has all the exact same type of log entries as the unhealthy proxy, it just also happens to send the data to the server (whereas the unhealthy proxy does not).

Comment by Corey Shaw [ 2014 May 13 ]

On closer inspection, it looks like no data is actually being returned by the proxy_get_history_data() function. On my unhealthy proxy I get a line like this:

End of proxy_get_history_data():0 lastid:1438673158

The interesting thing about it is that the query it runs to get the data is:

select id,itemid,clock,ns,timestamp,source,severity,value,logeventid,state from proxy_history where id>1438672158 order by id limit 1000

If I run that query manually in the proxy database, I get 1000 results as expected. In fact, notice in the "End of proxy_get_history_data" line that I provided above that the "lastid" found is 1000 higher than the id number in the select statement. To me, this means that the proxy is in fact finding data, it's just not doing anything with it.

Comment by Corey Shaw [ 2014 May 14 ]

Upon further inspection, most likely what is occurring is that DCconfig_get_items_by_itemids is setting errcodes[i] = FAIL for all 1000 items that should be sent to the server.

Comment by Corey Shaw [ 2014 May 14 ]

This if statement appears to be the cause of the FAIL in errcodes (dbconfig.c DCconfig_get_items_by_itemids()) => if (NULL == (dc_item = zbx_hashset_search(&config->items, &itemids[i])) || NULL == (dc_host = zbx_hashset_search(&config->hosts, &dc_item->hostid))).

I am unable to figure out why all of my items would get an errcode of FAIL. Two of my three proxies that had this issue have suddenly began working again without any intervention on my part (but they had issues for around 6 hours). The third proxy is also working, but I had just barely deleted a row from proxy_history in the database for an itemid that did not have a corresponding item in the items table on the proxy. It could be completely coincidental, but maybe not.

Comment by Corey Shaw [ 2014 May 14 ]

I've had a chance now to review my proxy logs and the audit log to see what potentially caused this problem to occur. The issue started for me at 9:02am PDT right when a large number of hosts were moved around between proxies. In fact, I have 9 proxies that all had hosts moved around on them. Just after that change was made, my problem started on three of the nine proxies that had changes. Six hours later at about 3:02pm PDT those same nine proxies all had another change where hosts were moved around between them. Just after that change was made the three proxies that had issues began working just fine again. The other six proxies continued working normally.

It appears there is some sort of corner case that I'm hitting when it comes to proxies receiving changes to their configuration. The next time I begin having issues with this, I'll pay close attention to the debug logs when a config sync happens that causes the problem and when the problem goes away. Unfortunately, I don't have any debug logs for either of those events this time around.

Comment by Corey Shaw [ 2014 May 20 ]

I have been experiencing this issue again since around 10:45 PDT on May 19. Again this issue occurred after a large number of hosts had been moved around on proxies.

Comment by Corey Shaw [ 2014 May 20 ]

I finally found what causes this issue. It appears that when hosts are moved around proxies there is the potential for old data to get "stuck" on proxies. When this happens, the proxy_history table will have data for items that the "items" table no longer has. I believe this causes the hashset searches that I mentioned in an earlier comment to constantly be looking at data for which there are no hashes, so no data ever gets sent. I worked around the issue in this case by deleting the old data from the proxy_history table. The instant I did that all the "stuck" data on my proxy began to be sent to the zabbix server.

I know this is fairly in-efficient for now, but here are the queries that I used to find the data and then delete it:

CREATE TABLE test SELECT DISTINCT itemid FROM proxy_history WHERE itemid NOT IN (SELECT itemid FROM items);
DELETE ph FROM proxy_history ph INNER JOIN test t ON t.itemid=ph.itemid;

Comment by Corey Shaw [ 2014 May 20 ]

If this particular problem works how I theorize it does, then pretty much anytime hosts are moved around proxies increases the risk of this issue occurring. I believe there is a failure on the part of the proxy to delete data from proxy_history for which it no longer has the items. I theorize that this failure is also perpetuated by the housekeeper not deleting this old data due to my ProxyOfflineBuffer being set to 24.

The reason data never ends up getting sent is a result of how the data sender process works (from what I know). It grabs the next 1000 values from the proxy_history table, parses through the data, and then sends it on to the server. In my case, the 1000 values that the data sender process grabbed were all old data. As a result, the proxy would not send it to the server, but the data was never deleted. That meant that all subsequent runs of the data sender process would keep grabbing the same 1000 values over and over - hence, no valid data could ever be sent to the server.

The other problem this methodology introduces is that while the the proxy grabs 1000 values at a time, if 250 of those are invalid items then the actual throughput of my proxy is 750 values per server connection. It just gets worse and worse as more invalid items end up in the list of 1000.

Comment by Corey Shaw [ 2014 May 20 ]

One last thought and I promise I'll shutup for at least one day . Would it be a possible solution to just make the "nextid" column in the "ids" table increment even if none of the items were able to be sent to the zabbix server (due to not being in the items table)? This would avoid this particular problem entirely and the invalid data would still get cleaned up by the housekeeper later on.

Comment by Corey Shaw [ 2014 May 21 ]

The attached patch potentially fixes this edge case that only occurs if there are 1000 contiguous values in proxy_history that are no longer valid items for the proxy.

Comment by Corey Shaw [ 2014 May 21 ]

I have had a chance to test the attached patch in my staging environment. I am unable to cause the issue with the patched code, but before patching I could cause the issue at will by simply adding 1000 bogus itemid history entries in the proxy_history table. It looks like this at least solves the issue. I'm not aware of any side-effects, but please let me know if this is a bad way to fix the issue.

Comment by Andris Zeila [ 2014 May 26 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-7968

Comment by Andris Zeila [ 2014 May 27 ]

Corey, Thanks for debugging the issue and providing patch!

Released in:
pre-2.2.4rc1 r45877
pre-2.3.1 r45878

Generated at Wed Apr 24 12:58:35 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.