[ZBX-323] Zabbix 1.4.4 - server suddenly stops collecting data Created: 2008 Mar 06  Updated: 2017 May 30  Resolved: 2008 Apr 02

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 1.4
Fix Version/s: None

Type: Incident report Priority: Blocker
Reporter: brendon Assignee: Alexei Vladishev
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.22-3-amd64 #1 SMP Mon Nov 12 17:53:18 UTC 2007 x86_64 GNU/Linux
All agents run in active mode


Attachments: Text File zabbix_server.log     File zabbix_server.zip    

 Description   

All triggers with nodata functions change to ON after a server is running anywhere from 1 to 7 days. It use to be a weekly problem, but is now a daily issue with our server. After closer inspection, this happens because the Zabbix server stops recording ALL items. I checked 3 servers, and they all stop collecting information at the same time, thus causing the working portion of zabbix to trigger alerts.

One thing to note is that simple checks like icmpping still collect data when this issue occurs.

After the above happens the zabbix server continues to run, except the nodata triggers are ON and no data related to agents is being collected by the server.



 Comments   
Comment by brendon [ 2008 Mar 06 ]

To resolve this, I run a simple /etc/init.d/zabbix-server restart.

I have also looked through the logs, and have not found anything. I'll attach logs next time I catch them before they are over-written.

Comment by Torsten Sorger [ 2008 Mar 19 ]

This problem is somehow caused by the active agent handling in the server code. I changed my agent configuration from passive to active. Then the server randomly (2-48h) stops collecting data.

some logfile excerpts:

zabbix_agentd.log
28311:20080319:141202 Timeout while answering request
28311:20080319:141202 Getting list of active checks failed. Will retry after 60 seconds
28311:20080319:141305 Timeout while answering request
28311:20080319:141305 Getting list of active checks failed. Will retry after 60 seconds
28311:20080319:141408 Timeout while answering request
28311:20080319:141408 Getting list of active checks failed. Will retry after 60 seconds

zabbix_server.log
9366:20080318:200821 Active parameter [net.if.in[eth0,bytes]] is not supported by agent on host [ZABBIX-Server]
9366:20080318:200821 Active parameter [net.if.out[eth0,bytes]] is not supported by agent on host [ZABBIX-Server]
9371:20080318:203743 Executing housekeeper
9371:20080318:203746 Deleted 3631 records from history and trends
9371:20080318:213847 Executing housekeeper
9371:20080318:213850 Deleted 3652 records from history and trends
(only housekeeper messages after this)

I'll attach some server logs in a next post with debuglevel=5

Which might be interesting is, that I use a virtual server for zabbix (virtuozzo enviroment). Don't know if this is important.

Comment by Alexei Vladishev [ 2008 Mar 19 ]

This is already fixed in pre-1.4.5 code.

Alexei

Comment by Torsten Sorger [ 2008 Mar 19 ]

Logfile (debuglevel=4) of 1.4.4 zabbix_server with active agents causing the server to stop collecting data (error still exist in pre-1.4.5 from 17.3.2008)

Comment by brendon [ 2008 Mar 19 ]

I opened this ticket awhile ago once the server stopped accepting (or possibly recording) data from agents.

I sent Alexei my logs and after closer inspection, the only thing I can relate this to is a busy mysql server. It happens at about 4 AM almost every night. Almost every day, zabbix needs to be restarted and I can't enable actions, because all of the high severity nodata actions are triggered when data is no longer collected.

Comment by Torsten Sorger [ 2008 Mar 20 ]

The server stopped this night again. Actually I doubt that the MySQL Server is the problem. Zabbix is the only process, that uses the database. I tried the nightly build from 19.3.2008 for the 1.4 branch.
Before 'make install' I cleaned /usr from all files named zabbix* just to be sure.

I will attach new server logs, if someone is willing to look into them...

Comment by Torsten Sorger [ 2008 Mar 20 ]

zabbix_server.log from pre-1.4 (19.3.2008) with debuglevel=4
The Server died between 22:04 and 22:06 so I have uploaded only this timeslot.

Comment by brendon [ 2008 Mar 20 ]

Torsten- Can you correlate your problem with high server load or high disk IO? I'm 99% sure that when my server disk io is very busy at night, it causes zabbix to malfunction.

I installed SAR, and right when it gets very busy disk IO, is about the time zabbix stops working properly. From the data below, it zabbix broke at about 4 AM.

SAR busy disk from march 18:
12:00:01 AM tps rtps wtps bread/s bwrtn/s
03:55:01 AM 466.96 14.74 452.22 206.62 5890.57
04:05:25 AM 1457.06 84.53 1372.53 1255.55 15686.59
04:15:01 AM 1331.55 591.61 739.93 8005.62 9049.32
04:25:01 AM 1657.70 613.01 1044.69 6307.10 13085.33

SAR normal disk from march 18:
02:05:01 PM 306.73 3.16 303.57 43.45 3491.67
02:15:01 PM 281.48 3.44 278.04 34.28 3145.71
02:25:01 PM 286.40 4.95 281.45 45.20 3155.84
02:35:01 PM 277.97 3.60 274.37 34.95 3162.64

Comment by Torsten Sorger [ 2008 Mar 22 ]

Actually I have problems getting this IO statistics because zabbix server is running on a vserver (virtuozzo). I'll give it another try next week.

The strange thing is, that this problem only arises when I use active agents. Is this the same situation for you?

Comment by Sylvain Coutant [ 2008 Mar 27 ]

I encounter this behavior almost once per day with 1.5 from early March. It happens when our backup process starts and put pressure on disk. Something breaks at this point. I have to restart Zabbix server to get rid of that.

Comment by Torsten Sorger [ 2008 Apr 02 ]

seems fixed in 1.4.5. - thanks!

Comment by Alexei Vladishev [ 2008 Apr 02 ]

I close it.

Comment by brendon [ 2008 Apr 02 ]

Solved? I'm still experiencing this issue...

Comment by Alexei Vladishev [ 2008 Apr 03 ]

I am pretty sure the problem no longer exists in 1.4.5. I need evidence if you do not agree

Alexei

Generated at Mon May 05 07:27:39 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.