[ZBX-3547] agent returns wrong value for perf_counter(p, X), X>60 after host restart Created: 2011 Feb 20  Updated: 2017 May 30  Resolved: 2011 May 03

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 1.8.3
Fix Version/s: 1.8.6, 1.9.4 (alpha)

Type: Incident report Priority: Major
Reporter: Alexandru Nica Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 2003 R2 x64


Attachments: File bug_4.log     PNG File bug_5.png     PNG File compare-load.png     PNG File compare-util.png     Text File outlook_closed.txt     Text File zabbix_agentd_5sec_pollig.log    

 Description   

It seems that when using an active check of type perf_counter(whatever, X) with a large X value, after a host restart, the agent sometimes gets stuck sending trash information.

I have seen this happen 2 times with these counters, monitoring different servers.
perf_counter[\MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length, 300] – it would sometimes get stuck sending 17050624
perf_counter[\MSExchangeIS\RPC Averaged Latency, 300] – it would get stuck sending 100

As far as I can tell it happens only when X is larger than 60 (I only use 30, 60 or 300 in my templates) and gets back to normal after an agent restart. It also seems to happen just after a server restart, so it may be that the windows perfcounter API simply does not have enough data to generate an average for 300sec.



 Comments   
Comment by Alexandru Nica [ 2011 Mar 10 ]

Thank you, I will test this version and post back.
It most probably IS indeed related to zabbix requesting perfcounters before exchange initializes them.

Only had the issues with the exchange template I was building. I have used "normal" perfcounters in a bunch of other templates and have never had problems.
I also reduced the period for the perfcounter from 300 to 60 and got the same results.

Comment by Alexandru Nica [ 2011 Mar 24 ]

I have 4 exchange servers that I have put the test version on.

However I can only restart/stop exchange and test them during weekends.
Will restart them tomorrow and see if there are any more issues.

Should I be running the agents at DebugLevel=4 to get the messages you mentioned?

Comment by Alexandru Nica [ 2011 Mar 26 ]

OK, I have restarted all 4 servers and the issue manifested itself only on one of them.
Here is the log data:

>> 2132:20110326:142112.671 In add_perf_counter() [counter:\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length] [interval:300]
>> 2132:20110326:142112.671 PerfCounter '\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length' successfully added. Interval 300 seconds
>> 2132:20110326:142112.671 Can't collect data "\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length": No data to return.
>> 2132:20110326:142112.671 For key [perf_counter[\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length, 300]] received value [ZBX_NOTSUPPORTED]
>> 2132:20110326:142112.671 In process_value() key:'NPIDC1:perf_counter[\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length, 300]' value:'ZBX_NOTSUPPORTED'

This was until 14:21:12 when the MSExchange Transport service was not started yet.
I looked up any messages in the event log and found:
14:23:58 - Transport Service successfuly started
14:25 - Loadperf event 1000 + 1001
14:27 - Loadperf event 1000 + 1001

>> 2132:20110326:143557.828 For key [perf_counter[\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length, 300]] received value [9153761201933518800.000000]
>> 2132:20110326:143557.828 In process_value() key:'NPIDC1:perf_counter[\MSExchangeTransport Queues(_total)\Retry Mailbox Delivery Queue Length, 300]' value:'9153761201933518800.000000'

And long after the MSExchange Transport service started, tha zabbix agent does another check that returns 9153761201933518800.000000. When this reaches the server it is actually recorded as value 16439680 even though the item is stored as Numeric (float).
No sign of the error messages you mentioned.
Restarting zabbix agent fixed it.

>> 4232:20110326:150833.892 In add_perf_counter() [counter:\MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length] [interval:300]
>> 4232:20110326:150833.907 PerfCounter '\MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length' successfully added. Interval 300 seconds
>> 4232:20110326:150833.907 For key [perf_counter[\MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length, 300]] received value [1.000000]
>> 4232:20110326:150833.907 In process_value() key:'NPIDC1:perf_counter[\MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length, 300]' value:'1.000000'

Let me try manually stopping/restarting exchange transport and see what happens and I will post back.

Comment by Alexandru Nica [ 2011 Mar 26 ]

These are the steps I followed:

>>exchange transport started
>>zabbix agent started
— ALL OK, zabbix returns valid data (0 in this case) as expected
>>exchange transport stopped
— FAIL, zabbix starts returning -3257211866954332200.000000. server deactivates the item as ZBX_NOTSUPPORTED as not a valid float value (though in several other instances, when the value returned was > 0, the server did not deactivate the item)
>>exchange transport started again and reactivated items in zabbix frontend
---FAIL, zabbix returns -8767986893661929500.000000. server deactivates the item as ZBX_NOTSUPPORTED as not a valid float value.
>>restarted zabbix
--ALL OK, zabbix returns valid data

From what I can tell, zabbix agent does not check if the instance of the perfcounter is actually valid.
I start exchange transport and perfom and add the counter [\MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length]. Once I stop the exchange transport service, perfmon catches on and disables the perfcounter because the (_total) instance of the counter is no longer valid. Zabbix agent fails to detect this and continues to happily check the counter probably using an already invalid data buffer (do you zero out buffers before passing to API?).

I noticed this when monitoring per CPU time. Perfmon won't let me monitor an absurd percounter like \\Processor(99)% Idle time because the 99th processor/core on the system does not exist. Yet, zabbix agent happily returns 0 instead of ZBX_NOTSUPPORTED.

Hope this helps

Comment by Alexandru Nica [ 2011 Mar 30 ]

Thank you very much for fixing it.
At least I nailed it that it was instance related

How was the behaviour changed? Does is return nodata (and disabled the item with ZBX_NOTSUPPORTED) when the instance no longer exists or does it return 0s?

Comment by Rudolfs Kreicbergs [ 2011 Mar 30 ]

See ZBX-3667.

Comment by Oleksii Zagorskyi [ 2011 Apr 04 ]

special for Rudolfs. "outlook_closed.txt" without comment - as is.

Comment by Rudolfs Kreicbergs [ 2011 Apr 08 ]

Due to the many bugs that would require dirty fixes, performance counter code is being rewritten completely,

Comment by Rudolfs Kreicbergs [ 2011 Apr 08 ]

Performance counters and cpu stat collection for Windows has been rewritten in dev branch:
svn://svn.zabbix.com/branches/dev/ZBX-3547

Comment by Oleksii Zagorskyi [ 2011 Apr 12 ]

Tested latest binary r18973
This keys worked before, but now at the first call (only !!!) after agent starts it is ZBX_NOTSUPPORTED. Next call returns a correct values:

Agent debuglog:
5844:20110412:010827.625 Requested [perf_counter[\Процессор(_Total)\% загруженности процессора,300]]
5844:20110412:010827.625 In PERF_COUNTER()
5844:20110412:010827.641 In add_perf_counter() [counter:\Процессор(_Total)\% загруженности процессора] [interval:300]
5844:20110412:010827.641 add_perf_counter(): PerfCounter '\Процессор(_Total)% загруженности процессора' successfully added
5844:20110412:010827.641 End of PERF_COUNTER()
5844:20110412:010827.656 Sending back [ZBX_NOTSUPPORTED]
...
5000:20110412:010827.937 Sleeping for 1 seconds
380:20110412:010828.219 In collect_perfstat()
3112:20110412:010828.719 Processing request.
3112:20110412:010828.828 Requested [perf_counter[\Процессор(_Total)\% загруженности процессора,300]]
3112:20110412:010828.844 In PERF_COUNTER()
3112:20110412:010828.844 End of PERF_COUNTER()
3112:20110412:010828.844 Sending back [0.000000]
....
5844:20110412:010829.984 Processing request.
5844:20110412:010830.141 Requested [perf_counter[\Процессор(_Total)\% загруженности процессора,300]]
5844:20110412:010830.141 In PERF_COUNTER()
5844:20110412:010830.156 End of PERF_COUNTER()
5844:20110412:010830.156 Sending back [3.906250]

and

2788:20110412:010833.359 Requested [perf_counter[\Процесс(OUTLOOK)\Прошло времени (сек),300]]
2788:20110412:010833.359 In PERF_COUNTER()
2788:20110412:010833.375 In add_perf_counter() [counter:\Процесс(OUTLOOK)\Прошло времени (сек)] [interval:300]
2788:20110412:010833.375 add_perf_counter(): PerfCounter '\Процесс(OUTLOOK)\Прошло времени (сек)' successfully added
2788:20110412:010833.391 End of PERF_COUNTER()
2788:20110412:010833.391 Sending back [ZBX_NOTSUPPORTED]
...
5000:20110412:010834.031 Sleeping for 1 seconds
380:20110412:010834.250 In collect_perfstat()
5844:20110412:010834.297 Processing request.
5844:20110412:010834.469 Requested [perf_counter[\Процесс(OUTLOOK)\Прошло времени (сек),300]]
5844:20110412:010834.469 In PERF_COUNTER()
5844:20110412:010834.469 End of PERF_COUNTER()
5844:20110412:010834.484 Sending back [55194.459546]
....
5000:20110412:010835.047 Sleeping for 1 seconds
3112:20110412:010835.234 Processing request.
380:20110412:010835.250 In collect_perfstat()
3112:20110412:010835.375 Requested [perf_counter[\Процесс(OUTLOOK)\Прошло времени (сек),300]]
3112:20110412:010835.375 In PERF_COUNTER()
3112:20110412:010835.375 End of PERF_COUNTER()
3112:20110412:010835.391 Sending back [55194.959546]

Closing, running, closing (OUTLOOK i mean), etc, when agent works - the counter perf_counter[\Процесс(OUTLOOK)\Прошло времени (сек),300] works well.

And, at the first second of work, agent independently added something:
380:20110412:010506.359 In add_perf_counter() [counter:\Процессор(_Total)\% загруженности процессора] [interval:900]
380:20110412:010506.359 add_perf_counter(): PerfCounter '\Процессор(_Total)% загруженности процессора' successfully added
What is that? Temporary debug?
<rudolfs> Thanks for pointing these problems out. The additional counter you see are made for system.cpu.load and system.cpu.util items - they were previously managed differently - though there should have been more of those. The both issues were because of the same bug

Comment by Alexei Vladishev [ 2011 Apr 12 ]

Perhaps we should not return NOT SUPPORTED on the first run doing two sequential calls with a sleep(1) in between instead?
<rudolfs> FIXED. that is the the intended procedure every time a NOTSUPPORTED counter is queried

Comment by Oleksii Zagorskyi [ 2011 Apr 12 ]

latest binary r19006 tested.

In the tests with the key "perf_counter[\Процессор(_Total)\% загруженности процессора,5]" i can periodically (every ~ 5 seconds) get 1 second answer delay and next FAST requested value is 0.0000
I think we have some problems in the "In collect_perfstat()" with the calculation.
See to the attached "zabbix_agentd_5sec_pollig.log". You can reproduce the problem when request every 7 seconds (more than 5sec - item key parameter)
864:20110412:212432.733 PERF_COUNTER(): Can't calculate counter statistics "\Процессор(_Total)% загруженности процессора": Пропущен или неверен обязательный аргумент.
This is NEW problem., it's not reproducible in the "stable-agent".

Different tests (even described on top) with the keys:
perf_counter[\Процесс(OUTLOOK)\Прошло времени (сек),5]
perf_counter[\Процесс(OUTLOOK)\Прошло времени (сек)]
perf_counter[\Очередь печати(_Total)\Всего напечатано страниц,5]"
perf_counter[\Очередь печати(_Total)\Всего напечатано страниц,]"
works well (i mean a start, stop processes, etc)

Heh, I'm tired of these deep tests of one ZBX .

Comment by Rudolfs Kreicbergs [ 2011 May 03 ]

Fixed/available in pre-1.8.6 r19321 and pre-1.9.4 r19322

Comment by Oleksii Zagorskyi [ 2012 Feb 02 ]

Note: these changes caused memory leak in agent in some cases. See ZBX-4195 for more details.

Generated at Fri Apr 19 21:55:03 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.