[ZBX-12549] Items can be stuck when host becomes reachable Created: 2017 Aug 22  Updated: 2024 Apr 10  Resolved: 2017 Aug 24

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: None
Fix Version/s: 3.4.1rc1, 3.4 (plan), 4.0.0alpha1

Type: Problem report Priority: Blocker
Reporter: Andris Zeila Assignee: Unassigned
Resolution: Fixed Votes: 15
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Dashboard.png     PNG File Host Screen.png     PNG File Internal Processes.png     PNG File lastest Data.png    
Issue Links:
Duplicate
is duplicated by ZBX-12564 Agent Ping does not resume after host... Closed
is duplicated by ZBX-12579 no connection from zabbix server to a... Closed
is duplicated by ZBX-12584 monitor zabbix_agent unreachable Closed
is duplicated by ZBX-12596 SNMP itens stops collecting data Closed
is duplicated by ZBX-12767 No Data collection from Agent by Prox... Closed
is duplicated by ZBX-12575 Redeploy proxy after agent restart. Closed
is duplicated by ZBX-12624 zabbix-agent cannot get item value af... Closed
is duplicated by ZBX-12568 Some data does not update after agent... Closed
Team: Team A
Team: Team A
Sprint: Sprint 15

 Description   

Steps to repeat:

  1. create zabbix agent item
  2. start agent & server - observe the item being polled
  3. shutdown agent
  4. wait for host to become unreachable
  5. start agent

After agent is started item is polled once and host becomes reachable. However the item is not polled after this until server restart.



 Comments   
Comment by Andris Zeila [ 2017 Aug 22 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-12549

Comment by Kamil Porembinski [ 2017 Aug 23 ]

This is very critical issue.

In my case ALL agents after start are still unreachable till server restart.

Comment by dimir [ 2017 Aug 23 ]

Yes, paszczak000, we know it's very critical. The fix is being tested and will be available in 3.4.1rc1 ASAP.

Comment by Mikhail Shepelev [ 2017 Aug 23 ]

I confirm this issue, but I restart proxy server.

Comment by sles [ 2017 Aug 23 ]

>Fixed in development branch

Is it possible to get patch for 3.4.0?

Comment by sles [ 2017 Aug 23 ]

OK, looks like dbconfig.c from svn works, after one agent restart everything works.
Looks like we need 3.4.1 or at least 3.4.0a as soon as possible.

Comment by Constantine Volodin [ 2017 Aug 23 ]

Is there any progress? All very very much waiting.

Comment by Andrey A. Pestretsov [ 2017 Aug 23 ]

Waiting for RPM server-mysql in repository EL6 x64

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 23 ]

We understand the severity of the issue and are working on it. As you see in status, it's currently in testing. Unfortunately, your numerous comments can't make this process faster and only distract involved people. Please be patient.

Comment by Alexey Asemov [ 2017 Aug 24 ]

No press release about fatal showstopper bug. Release not revoked till the fix is available, and no warnings. Is everything ok?

Comment by Konstantin Barinov [ 2017 Aug 24 ]

Please release hotfix. This is indeed very serious bug.

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 24 ]

Successfully tested!

Comment by Andris Zeila [ 2017 Aug 24 ]

Released in:

  • 3.4.1rc1 r71659
  • 4.0.0alpha1 71660
Comment by Hilton Kevin de Carvalho [ 2017 Aug 24 ]

Do you have any prediction of when the .deb package will be made available with this fix?

Comment by Constantine Volodin [ 2017 Aug 24 ]

When can I expect a docker image?

Comment by Giorgio Biondi [ 2017 Aug 24 ]

Hi,

have some ideas about rpm in Zabbix repo?

Thanks a lot for yours job.

Comment by Hilton Kevin de Carvalho [ 2017 Aug 24 ]

When the fix will be available in the repositories?

Comment by Rodrigo Moreira [ 2017 Aug 25 ]

I upgraded to rc1 and the error still persists

Comment by Konstantin Barinov [ 2017 Aug 25 ]

Please make fixed version available in repositories. Thank you!

Comment by Rob Dekkers [ 2017 Aug 25 ]

I upgraded to rc1 and stop one agent. After it comes unavailable i started the agent and the host comes back up. Nice job!

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 25 ]

Dear insanemor, can you provide more information?

Comment by dimir [ 2017 Aug 25 ]

sbr2004 unfortunately it was decided not to release rc1 packages. However we plan to release 3.4.1 packages on Monday (28.08).

Comment by Rodrigo Moreira [ 2017 Aug 25 ]

Glebs Ivanovskis .

After it comes unavailable, hosts not comes back up ...

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 25 ]

Dear insanemor, have you upgraded server, proxy or agent?

Comment by Rodrigo Moreira [ 2017 Aug 25 ]

i have upgrade server !

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 25 ]

Is agent monitored by server or by proxy?

Comment by Rodrigo Moreira [ 2017 Aug 25 ]

by server ...

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 25 ]

Maybe we should move our questionnaire into IRC.

Comment by Rodrigo Moreira [ 2017 Aug 25 ]

how do I do that ?

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 25 ]

Check http://zabbix.org/wiki/Getting_help#IRC or go directly to https://webchat.freenode.net/?channels=#zabbix

Comment by Rodrigo Moreira [ 2017 Aug 25 ]

Glebs

tks !!!

Comment by Christian Hagemeier [ 2017 Aug 25 ]

Hi, also had this issue. How to fix it?
Installed with repo deb packages on debian.

Comment by Naxiwer Lee [ 2017 Aug 26 ]

I use the source file - 3.4.1rc1.tgz , just replace /usr/sbin/zabbix_server with the compiled.
then restart zabbix-server . It's working now~
you can try.

Comment by Christian Hagemeier [ 2017 Aug 26 ]

Thanks, i compiled with 3.4.1rc2 and replaced zabbix_server binary.
Now testing.

Comment by Andrey A. Pestretsov [ 2017 Aug 26 ]

Confirm, binary from rc2 resolved bug

Comment by Glebs Ivanovskis (Inactive) [ 2017 Aug 26 ]

Thank everyone for testing! Glad to hear the issue is fixed. Hopefully, insanemor's problem is resolved too.

Yet again, huge apologies for this mishap.

Comment by Hilton Kevin de Carvalho [ 2017 Aug 26 ]

Glebs, we thank you guys for all the work, I hope that on Monday an update will be available for debian repo.

Comment by Jiří Káša [ 2017 Aug 28 ]

not sure if this is related problem but some of my items are in state "Enabled" but they doesn't have any data, but same item on another host have or from discovery for filesystem C: not working and for D: working.... any help ?

Edit: restarted zabbix-server and now i get's values hope it will not freeze again

Comment by Nicki Bo Otte [ 2017 Aug 28 ]

^ Same problem here.
I will upgrade to compiled rc2 version today, to see if it works after.
Edit: We are not going to upgrade to rc2, because it's a big production system.
We expect there to be a released debian package version, within a short periode of time.
@Jiří Káša, After restart of the server, it seems to temporarily have solved the issue. Like Jiří Káša, i hope it doesn't freeze again until release.

Comment by Giorgio Biondi [ 2017 Aug 28 ]

Hi,

I wait package version for Redhat system. In meanwhile I have solved restart zabbix-server every hour via crontab.
Somebody have idea when will released new version patched for solved this issue?

All the best.

Giorgio Biondi.

Comment by Giorgio Biondi [ 2017 Aug 28 ]

Hi at all,

great job.. Now are available package rpm!!!

All the best.

Comment by Christian Hagemeier [ 2017 Aug 28 ]

Debian packages too.
Now on 3.4.1 seems to fix a lot.
Thanks for support.

Comment by Misak Khachatryan [ 2017 Aug 29 ]

Hi,

after upgrade to 3.4.1 i see the same behavior. It's not fixed, at least for me.

CentOS 7.3, zabbix repo packages, postgresql on separate host.

Comment by Constantine Volodin [ 2017 Aug 31 ]

When can you expect a docker image?

Comment by Ilmar Soobik [ 2017 Sep 18 ]

After upgrade to 3.4.1 we see the same behavior as well.
Not fixed.

Comment by Andris Zeila [ 2017 Sep 18 ]

Could you please give more information about your setup (or more specifically - about the problematic host) ?

Are the host monitored by proxy or directly by server?
What item types are being monitored - Zabbix agent, SNMP agent, IPMI agent or JMX agent?
How many items are monitored by the host?

Comment by Ilmar Soobik [ 2017 Sep 18 ]

Monitored directly by server.
Zabbix agent.ping fails after roughly 10 minutes on all of them.
556 hosts, 35712 enabled items total. Roughly 200 agents being pinged.

Comment by Andris Zeila [ 2017 Sep 18 ]

Are there any more passive agent items on those hosts or only agent.ping?
'After roughly 10 minutes' - is that after server start? Or after host becomes reachable?

agent.ping failing on all hosts - does that meant that all hosts become unreachable and then reachable again? And all of them at the same time?

Comment by Ilmar Soobik [ 2017 Sep 18 ]

agent.hostname and agent.version are also enabled.
Happens after server start. The hosts are never displayed as unreachable, the ZBX status icon is always green.
Under latest data - agent.ping values just stop on all hosts at the exact same time.

Comment by Andris Zeila [ 2017 Sep 18 ]

I assume the agent.hostname and agent.version stops being updated too? What is the update interval of agent.ping check?

How many pollers are being used?
What is your 'Refresh unsupported items' value in Administration/General/Other ?
Are there any warnings reported in log file?
I understand after server restart everything is functioning again for ~10 minutes?

It's really strange behaviour, and might not even be because of internal requeueing bug (which was the cause of this problem report). Can you strace a poller after agent.ping starts to fail? It might be that pollers are getting stuck somewhere and data gathering stops when server runs out of pollers. (You could also check in process list if zabbix poller process title is being updated).

Comment by Ilmar Soobik [ 2017 Sep 18 ]

You assume correctly. Update interval is 20 seconds. (Made no difference when it was 1m)

500 pollers. 1000 ICMP pingers. (Increased for testing, no difference with 300/50)
Refresh unsupported: 3600 seconds
No warnings.
Correct.

They don't start to fail exactly - they seem to fail all at once.
But other kinds of ping checks still seem to keep running, although irregularly producing data.

Comment by Sebastien [ 2017 Sep 22 ]

after zabbix 3.2.6 to 3.4.1 upgrade, every zabbix server restart triggers nodata alarm even if data was collected by proxy.
agent.ping.nodata(600) and sysUpTime.nodata(600) so it's ZBX-12584 for us.

Issue still there in 3.4.1

Comment by Glebs Ivanovskis (Inactive) [ 2017 Sep 22 ]

Dear sfl, your issue is different. Please create a separate bug report.

Comment by Elvar [ 2017 Oct 21 ]

I see this is marked as Fixed and Closed but I am seeing this exact same behavior right now in 3.4.3. I have a number of hosts that are showing an active 'agent.ping.nodata(10m)}=1' but I can see agent.ping returning successfully despite the triggers not recovering. Did this issue resurface in 3.4.3?

Comment by Glebs Ivanovskis (Inactive) [ 2017 Oct 22 ]

Dear elvar, if you see data coming in, your issue is different. Maybe ZBX-12251.

Comment by Elvar [ 2017 Oct 23 ]

Hi Glebs, that definitely sounds similar, thanks!

Comment by Sascha Guilliard [ 2017 Oct 23 ]

i'm running version 3.4.3 and since this version i got some hosts that alert agent.ping.nodata and I don't see any agent ping at "latest data" from those hosts but when i run the command via zabbix_get I get a response
Did this bug resurface?

Comment by Glebs Ivanovskis (Inactive) [ 2017 Oct 23 ]

Dear sguilliard, there are number of reasons why you could see such behaviour not necessarily related to this bug. No, as far as I know, there were no reasons for this bug to "resurface" in 3.4.3.

Comment by Merphis Ellis [ 2018 Jan 23 ]

I am having this same issue with 3.4.6 on two different systems.
I did a 3.2.6 to 3.4.6 upgrade. Now after about an hour, all agents become unreachable.
I found this as a bug but it said it should have been fixed at 3.4.1rc1. This is the second install I have seen this issue.
I have php 5.6.31, mariadb 10.0.33 and Zabbix 3.4.6.
If I restart the Zabbix Server it will clear up in 20mins but then stops working.
When I look in the logs I do not see any errors.

I have 1 server, 275 agents. 35611 items 11339 triggers
The servers config:
StartPollers=52
StartPollersunreachable=8
StartTrappers=48
StartPingers=12
Start Escalators=26
CacheSize=42M
StartDBSynces=6
HistoryCacheSize=64M
HistoryIndexCacheSize=18M
TraendCacheSize=24M
ValueCacheSize=36M
TimeOut=30

I did not have any issues with the 3.2.6.

Comment by Glebs Ivanovskis (Inactive) [ 2018 Jan 23 ]

Dear mellis3, please describe in a bit more detail what exactly do you experience. When hosts become reachable/unreachable there must be messages in the log. Would be nice to see Latest data of affected items.

Comment by Ilmar Soobik [ 2018 Jan 24 ]

Your config lacks: StartPreprocessors=
Try some 300 preprocessors for starters.

Newer versions of zabbix will have that in the config by default, but if you've been upgrading and keeping your old config file, then it could be missing.

Comment by Merphis Ellis [ 2018 Jan 24 ]

Good Morning
Due to the issues and because this is a production system I want to start over clean. I remove the Zabbix and Zabbix database. Did a clean install using the yum. repositories.

I had exported the templates and host, so I did an import.

After about 2 hours I started to see unreachable problems on systems that were available.

After a few, I started to see one by one every host became unavailable, I have adjusted the time to 15mins due to the network performance on some of the remote sites., I do not see how to post a screenshot,.

at 09:37:00 host k20S01 display on the dashboard.
When I look at the latest data screen I see timestamp for different items example cpu from 09:19:04, 09:19:05 and 09:21:02 and the agent ping at 09:24:06

config file edits
StartPollers=20
StartPreprocessors=10
StarPollersunreachable=8
StarTrappers=20
Cachesize=24M
HistroryCacheSize=32M
HistoryIndexCacheSize=6M
TrendCacheSize=8M
ValueCacheSize=12M
Timeout=30

Looking on the server I do find that host, but do not see any error's around that transaction,,,, I do not find any errors at all in the server logs. I do have debug at 5.
I have debug on the Mariadb 10.0.33 I see all the selects and inserts no erro rmessages.

One other thing I have noticed. if you look at the internal processing graph the lines will stop about the same time as the host become so on this last restart the processes lasted about 25 mins.

It seems to me that the zabbix-web is having issues getting the data or the data writes are to slow.

This is a VM on a HP360 servers, 4 vCPU's 7200rpm disk. the other VM is just file storage.

Comment by Ilmar Soobik [ 2018 Jan 24 ]

If you look at my previous comment:
"556 hosts, 35712 enabled items total."

This amount of load was handled by 24 cores, 64GB RAM and 7200RPM disks.
The symptoms went away after significantly adjusting processes and caches upwards.
(Even though the Queue screen seemed normal - the preprocessors and DB processes were all out of resources.)

Our configuration:

StartPollers=150
StartPollersUnreachable=120
StartTrappers=10
StartPingers=220
StartDiscoverers=50
StartHTTPPollers=10
StartTimers=10
StartEscalators=1
StartVMwareCollectors=150
VMwareFrequency=180
VMwarePerfFrequency=30
VMwareCacheSize=512M
VMwareTimeout=15
StartSNMPTrapper=1
HousekeepingFrequency=1
MaxHousekeeperDelete=70
CacheSize=4G
CacheUpdateFrequency=30
StartDBSyncers=50
HistoryCacheSize=1G
HistoryIndexCacheSize=1G
TrendCacheSize=1G
ValueCacheSize=1G
Timeout=10
TrapperTimeout=300
UnreachablePeriod=60
UnavailableDelay=10
UnreachableDelay=15
StartPreprocessors=300

Comment by Glebs Ivanovskis (Inactive) [ 2018 Feb 10 ]

Dear mellis3, please see available ways of getting help.

Dear illukas, StartPreprocessors=300 is an overkill IMHO. There is very little sense in having more preprocessor worker processes than logical CPU cores you have available.

Generated at Tue Apr 16 12:26:58 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.