[ZBX-4156] Zabbix agent service crash/hang Created: 2011 Sep 20  Updated: 2017 May 30  Resolved: 2011 Sep 22

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 1.8.6, 1.8.7
Fix Version/s: 1.8.9, 1.9.7 (beta)

Type: Incident report Priority: Critical
Reporter: Alexandru Nica Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: agent
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows


Attachments: File zabbix_agentd.exe    

 Description   

After updating agent to 1.8.7.rc1 (revision 21392) I get the following errors in the log.

5308:20110920:130647.210 PerfCounter 'Jýþ' FAILED: invalid format
5712:20110920:130716.351 PdhLookupPerfNameByIndex failed: [0x800007D0] unable to find message text [0x0000013D]

These may occur several times and at random times the agent may hang for a few minutes (so long as to trigger a "system down" PROBLEM in zabbix) and after a few minutes it resumes work like nothing happened (and triggers a "system down" OK in zabbix)

I have a set of general items for monitoring CPUs like "perf_counter[\Processor(X)\% Processor Time, 300]" with 0<=X<=7. Of course not all systems have 8 CPUs, they may have just 4, as is the case with the server in question, and perf_counter instances for CPUs with X>4 would be invalid.

I understand that part of the perfcounter code was rewritten in 1.8.6.
Prior to 1.8.6 the items for non-existent CPUs (X>4) would just return 0, even though the perf_counter instance is invalid.
With 1.8.6 and 1.8.7 I understand this has been fixed and the item should return ZBX_NOTSUPPORTED as the perf_counter is invalid. It seems to work ok, in the frontend I get ZBX_NOTSUPPORTED for those items but the agent always logs that error, always with the same 'Jýþ' string and the same 0x800007D0. The 0x0000013D is variable.



 Comments   
Comment by Alexandru Nica [ 2011 Sep 20 ]

I can confirm that the issue only affects Windows 2008 R2. I have looked at several Windows 2003 servers and there is are no error messages, even with invalid perf_counter instances.
There is another issue I have observed only on Windows 2008 R2: Sometimes items with UserParameters return script time out even though the script is really small and should return immediately. Maybe the same hang is the cause for both issues.

Comment by richlv [ 2011 Sep 21 ]

timeouts could be a different issue - ZBX-4104

Comment by Rudolfs Kreicbergs [ 2011 Sep 21 ]

There indeed is a problem regarding the message formatting and that on it's own should not hang the agent.

The unknown error message is:
[0x800007D0] Unable to connect to the specified computer, or the computer is offline
Could you please try running the agent with DebugLevel=4 to catch the the log file during one of these "hang" situations?

<rudolfs> REPRODUCED - it seems that I have reproduced the problem when agent hangs, will investigate that.

Comment by Alexandru Nica [ 2011 Sep 21 ]

With debug level 4 I get something of a cleaner output:

4452:20110921:123026.589 In PERF_COUNTER()
4452:20110921:123026.589 In add_perf_counter() counter:'\Processor(5)% Processor Time' interval:300
4452:20110921:123026.605 add_perf_counter(): unable to add PerfCounter '\Processor(5)% Processor Time': [0x800007D1] The specified instance is not present.
4452:20110921:123026.605 PERF_COUNTER(): unable to add PerfCounter '\Processor(5)% Processor Time': [0x800007D1] The specified instance is not present.
4452:20110921:123026.605 End of PERF_COUNTER()

I dont' get the following errors anymore:
5308:20110920:130647.210 PerfCounter 'Jýþ' FAILED: invalid format
5712:20110920:130716.351 PdhLookupPerfNameByIndex failed: [0x800007D0] unable to find message text [0x0000013D]
The first one seems to me like an access violation on memory read. Could it be that the error message routine is the one performing an access violation and hanging the process? It tries to get the name of the failed perfcounter but reads invalid memory?

No hang until now, will restart the agent a few more times and wait another hour.
After that I will try with default debuglevel and see if that causes hangs.

Comment by Rudolfs Kreicbergs [ 2011 Sep 21 ]

That in fact is a memory violation on read. Both error messages are fixed in dev branch: svn://svn.zabbix.com/branches/dev/ZBX-4156

Could you please try to repeat the "hanging" problem with that branch (it is based on 1.8.8rc2)? It seems that I was wrong in did NOT REPRODUCE the problem.

Comment by Alexandru Nica [ 2011 Sep 21 ]

Did not manage to hang it with DebugLevel=4
Definetly error message + perfcounter related, managed to hang it with DebugLevel=default on fist run.

5404:20110921:152246.685 Starting Zabbix Agent [BITVMH1]. Zabbix 1.8.7rc1 (revision 21392).
356:20110921:152246.701 agent #0 started [collector]
4436:20110921:152246.716 agent #1 started [listener]
5616:20110921:152246.716 agent #2 started [listener]
5328:20110921:152246.716 agent #3 started [listener]
5400:20110921:152246.716 agent #4 started [active checks]
5400:20110921:152358.274 PerfCounter 'qýþ' FAILED: invalid format
5400:20110921:152358.274 Active check [perf_counter[\Memory\Available Bytes, 300]] is not supported. Disabled.
5400:20110921:152649.315 Active check [perf_counter[\Memory\Page Faults/sec, 300]] is not supported. Disabled.
5400:20110921:152850.497 Active check [perf_counter[\Memory\Pages/sec, 300]] is not supported. Disabled.
5400:20110921:153051.415 Active check [perf_counter[\Network Interface(Virtual Network [LAN])\Bytes Received/sec, 180]] is not supported. Disabled.
5400:20110921:153252.478 Active check [perf_counter[\Network Interface(Virtual Network [LAN])\Bytes Sent/sec, 180]] is not supported. Disabled.
5400:20110921:153553.503 Active check [perf_counter[\Network Interface(Virtual Network [WAN])\Bytes Received/sec, 180]] is not supported. Disabled.
5400:20110921:153653.532 PerfCounter 'qýþ' FAILED: invalid format
5400:20110921:153653.673 Active check [perf_counter[\Network Interface(Virtual Network [WAN])\Bytes Sent/sec, 180]] is not supported. Disabled.
5400:20110921:153753.687 PerfCounter 'qýþ' FAILED: invalid format
5400:20110921:153753.702 Active check [perf_counter[\Network Interface(undefined2)\Bytes Received/sec, 180]] is not supported. Disabled.
----> this is where it hangs.

Will try with the dev branch you mentioned. Any windows svn client you recommend? Tortoise keeps crashing on me.

Another thing I just saw is that with 1.8.7 I get "Active check [perf_counter[\Memory\Pages/sec, 300]] is not supported. Disabled." for a counter which is actually valid and should not be disabled. Will report on this after trying the dev branch.

Comment by Alexandru Nica [ 2011 Sep 21 ]

Running zabbix 1.8.8rc2 (revision 21676).

Still get the error message on debuglevel=default

5092:20110921:164042.938 PerfCounter 'qýþ' FAILED: invalid format
5092:20110921:164042.984 Active check [perf_counter[\Memory\Available Bytes, 300]] is not supported. Disabled.
5860:20110921:164053.717 PdhLookupPerfNameByIndex failed: [0x800007D0] unable to find message text [0x0000013D]
5860:20110921:164054.123 PerfCounter 'qýþ' FAILED: invalid format

Valid counters DO get disabled, but it seems that only after the error message.
Just had my first hang during this edit

4152:20110921:164358.393 agent #0 started [collector]
5852:20110921:164358.393 agent #1 started [listener]
4928:20110921:164358.393 agent #2 started [listener]
4540:20110921:164358.393 agent #3 started [listener]
1740:20110921:164358.393 agent #4 started [active checks]
1740:20110921:164538.998 PerfCounter 'qýþ' FAILED: invalid format
1740:20110921:164538.998 Active check [perf_counter[\Memory\Committed Bytes, 300]] is not supported. Disabled.
1740:20110921:164639.011 PerfCounter 'qýþ' FAILED: invalid format
1740:20110921:164639.089 Active check [perf_counter[\Memory\Page Faults/sec, 300]] is not supported. Disabled.
1740:20110921:164739.102 PerfCounter 'qýþ' FAILED: invalid format
1740:20110921:164739.118 Active check [perf_counter[\Memory\Pages/sec, 300]] is not supported. Disabled.
4152:20110921:164758.633 PdhLookupPerfNameByIndex failed: [0x800007D0] unable to find message text [0x0000013D]
4152:20110921:164759.055 PerfCounter 'qýþ' FAILED: invalid format
1740:20110921:164859.100 PerfCounter 'qýþ' FAILED: invalid format
1740:20110921:164859.100 Active check [perf_counter[\Memory\Pool Paged Bytes, 300]] is not supported. Disabled.
----> here it hangs
4568:20110921:165232.854 Zabbix Agent shutdown requested
4568:20110921:165233.868 Zabbix Agent stopped. Zabbix 1.8.8rc2 (revision 21676).
-----> I stop the service and it shuts down gracefully

Comment by Rudolfs Kreicbergs [ 2011 Sep 21 ]

Sorry, I did not compile the agent in the dev branch, will update the branch in a couple of minutes

<rudolfs> DONE in r21799 at svn://svn.zabbix.com/branches/dev/ZBX-4156

Comment by Rudolfs Kreicbergs [ 2011 Sep 21 ]

Are you using 32bit Win? I can compile an attach the .exe to the issue.
We use SlikSVN on our Windows test boxes:
http://www.sliksvn.com/en/download

Comment by Alexandru Nica [ 2011 Sep 21 ]

Would you please attach the x64 version also?

Comment by Rudolfs Kreicbergs [ 2011 Sep 21 ]

Fair enough, it was a 50-50 chance Attached 64bit Agent Windows binary

Comment by Alexandru Nica [ 2011 Sep 21 ]

Thank you for the binary, now running Zabbix 1.8.8 (revision

{ZABBIX_REVISION}

).
I'll see how it goes, post back tomorrow

Comment by Alexandru Nica [ 2011 Sep 22 ]

So far so good, no nasty error messages, just a clean "not supported, disabled". No hangs, no script timeouts, I'm really happy with this.
Can you leave the issue opened for another day, just to be sure?

Comment by Rudolfs Kreicbergs [ 2011 Sep 22 ]

I'll move forward with reviewing and testing the fix since the it is likely a separate issue from the hangs. Though it will not be closed till tomorrow anyhow and please feel free to reopen the issue if the problem occurs even after closing the issue.

Comment by Alexandru Nica [ 2011 Sep 22 ]

Sorry, still not fixed.

  • I randomly get errors for perf_counters which are actually valid.
  • It still hangs sometimes. It is more rarely than with 1.8.7rc1 which was practically unusable – kept hanging or crashing (service terminated unexpectedly) at 20 minutes interval. With this revision of 1.8.8 I got 24+ hours with no problems.

I will install this version on a win2003 box and let you know if they behave the same. I have a feeling this is 2008 specific, some sort of memory corruption that went by unnoticed in win2003.

Extract from log with debuglevel=default

4256:20110922:130637.267 Zabbix Agent stopped. Zabbix 1.8.8 (revision

{ZABBIX_REVISION}).
5048:20110922:130645.785 Starting Zabbix Agent [BITVMH1]. Zabbix 1.8.8 (revision {ZABBIX_REVISION}

).
3228:20110922:130645.816 agent #0 started [collector]
4228:20110922:130645.816 agent #1 started [listener]
4224:20110922:130645.816 agent #2 started [listener]
5552:20110922:130645.816 agent #3 started [listener]
5264:20110922:130645.816 agent #4 started [active checks]
5264:20110922:130826.717 PerfCounter '\Memory\Available Bytes' FAILED: invalid format
5264:20110922:130826.733 Active check [perf_counter[\Memory\Available Bytes, 300]] is not supported. Disabled.
5264:20110922:130926.747 PerfCounter '\Memory\Committed Bytes' FAILED: invalid format
5264:20110922:130926.793 Active check [perf_counter[\Memory\Committed Bytes, 300]] is not supported. Disabled.
5264:20110922:131026.807 PerfCounter '\Memory\Page Faults/sec' FAILED: invalid format
5264:20110922:131026.823 Active check [perf_counter[\Memory\Page Faults/sec, 300]] is not supported. Disabled.
3228:20110922:131046.057 PdhLookupPerfNameByIndex() failed: [0x800007D0] Unable to connect to the specified computer or the computer is offline.
3228:20110922:131046.479 PerfCounter '\UnknownPerformanceCounter(_Total)% Processor Time' FAILED: invalid format
5264:20110922:131246.677 Active check [perf_counter[\Memory\Pages/sec, 300]] is not supported. Disabled.
5264:20110922:131447.609 Active check [perf_counter[\Network Interface(Virtual Network [LAN])\Bytes Received/sec, 180]] is not supported. Disabled.
5264:20110922:131648.588 Active check [perf_counter[\Network Interface(Virtual Network [LAN])\Bytes Sent/sec, 180]] is not supported. Disabled.
5264:20110922:131849.598 Active check [perf_counter[\Network Interface(Virtual Network [WAN])\Bytes Received/sec, 180]] is not supported. Disabled.
5264:20110922:131949.612 PerfCounter '\Network Interface(Virtual Network [WAN])\Bytes Sent/sec' FAILED: invalid format
5264:20110922:131949.643 Active check [perf_counter[\Network Interface(Virtual Network [WAN])\Bytes Sent/sec, 180]] is not supported. Disabled.
5264:20110922:132049.657 PerfCounter '\Paging File(_Total)% Usage' FAILED: invalid format
5264:20110922:132049.688 Active check [perf_counter[\Paging File(_Total)\% Usage, 300]] is not supported. Disabled.
5264:20110922:132149.702 PerfCounter '\PhysicalDisk(_Total)\Avg. Disk sec/Read' FAILED: invalid format
5264:20110922:132149.780 Active check [perf_counter[\PhysicalDisk(_Total)\Avg. Disk sec/Read, 180]] is not supported. Disabled.
5264:20110922:132249.793 PerfCounter '\PhysicalDisk(_Total)\Avg. Disk sec/Write' FAILED: invalid format

Comment by Rudolfs Kreicbergs [ 2011 Sep 28 ]

Crash fixed in pre-1.8.6 r21973 and pre-1.9.7 r21976.

Nica, please separate the hanging problem in a separate ZBX.

Generated at Fri Mar 29 15:46:48 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.