[ZBX-15301] System.localtime on Windows monotonically lags behind Created: 2018 Dec 10 Updated: 2024 Jul 17 Resolved: 2019 Jan 20 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 4.0.2 |
Fix Version/s: | 4.0.4rc1, 4.2.0alpha3, 4.2 (plan) |
Type: | Problem report | Priority: | Major |
Reporter: | Alexander Ivanes | Assignee: | Michael Veksler |
Resolution: | Fixed | Votes: | 4 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Server: |
Attachments: | echo-time-active-item-incoming-speed.png echo-windows-time.win.conf echo-windows-time.xml system-localtime-backend-p7.PNG system-localtime-zabbix-server.PNG time-deviation-7days.png time-deviation-8.12.png time-deviation-backend-p7.PNG time-deviation-settings.PNG time-deviation.png zabbix_agent-4.0.4rc1-win-amd64-openssl.zip zabbix_agent-4.0.4rc1-win-amd64.zip zabbix_agent-4.0.4rc1-win-i386-openssl.zip zabbix_agent-4.0.4rc1-win-i386.zip | ||||||||||||||||
Issue Links: |
|
||||||||||||||||
Team: | Team A | ||||||||||||||||
Sprint: | Sprint 47, Dec 2018, Sprint 48, Jan 2019 | ||||||||||||||||
Story Points: | 1 |
Description |
After moving from 3.4 to 4.0 system.localtime start lagging behind from it's actual system value and this lag monotonically increases through time (on machines with more load it happens faster than on the other) Time deviation is a custom calculated item (difference between localtime on host and zabbix-server) |
Comments |
Comment by Andrei Gushchin (Inactive) [ 2018 Dec 11 ] | ||||||||||||||||
How would you measure that deviation? | ||||||||||||||||
Comment by Ingus Vilnis [ 2018 Dec 11 ] | ||||||||||||||||
Not 100% sure but could be because of this: | ||||||||||||||||
Comment by Alexander Ivanes [ 2018 Dec 11 ] | ||||||||||||||||
We don't use proxy if that change only related to it.
Basically it's just a subtraction of host system.localtime and a zabbix-server system.localtime: min("system.localtime", 600)-min("rainbow1-3:system.localtime", 600) It's not perfect by any means but can give you approximate picture. For trigger we using fuzzytime.
Again, i would say that problem is not with this item but with localtime values on the windows host (if you need more i can query a DB): Localtime values (zabbix-server):
Was: 3.4.11 | ||||||||||||||||
Comment by Alexander Ivanes [ 2018 Dec 11 ] | ||||||||||||||||
Didn't notice the issue report. Yes, it's really similar and yes system.localtime is an active type item.
| ||||||||||||||||
Comment by Vladislavs Sokurenko [ 2018 Dec 11 ] | ||||||||||||||||
Regarding | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 12 ] | ||||||||||||||||
Hi all! First of all, as vso correctly pointed out, changes Secondy, after Anyway, the way you calculate the time difference should be robust enough with respect to small timestamp adjustments. Could the drift be caused by the way active agent schedules checks? | ||||||||||||||||
Comment by Alexander Ivanes [ 2018 Dec 12 ] | ||||||||||||||||
Hi!
No we don't use a proxy. Switch to non active checks indeed solved this problem.
The problem is that with active check difference between timestamps increases over time and returns to normal after agent restart. On Linux, however, it stays the same all the time. Seems like some kind of leak or queue to me. | ||||||||||||||||
Comment by Ingus Vilnis [ 2018 Dec 12 ] | ||||||||||||||||
Hi guys, Gleb, according to what you are saying, example 10 here is no longer valid in case the item is active? And the high queue for active items sub 1 minute on 4.0 is also related to | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 12 ] | ||||||||||||||||
Dear ingus.vilnis, yes and (probably) no, respectively. Not sure about the latter, could you point me to the specific issue? | ||||||||||||||||
Comment by Ingus Vilnis [ 2018 Dec 12 ] | ||||||||||||||||
Gleb, about the localtime active becoming pointless you made me very sad, as active-only templates are widely used where I am now. And that is also one of the item-trigger combos. Too bad yet another good feature got broken after working flawlessly for years. About the queue we might go totally off topic from the original post and bug tracker is not the right place to troubleshoot this but in short my assumption is based on the observations that on one 3.0.12 instance @1200nvps queue under 5 minutes is about 100 active items. On a brand new 4.0.2 clean install built according to the best feng shui I know @450 nvps and growing (to replace 3.0 in parallel) this queue for active items under 5 minutes is about 11K! I have not checked if there is exact ZBX but I have spotted folks going about this in the forum too. I thought the reason was | ||||||||||||||||
Comment by Alexander Ivanes [ 2018 Dec 12 ] | ||||||||||||||||
Ingus, where did you found queue length for that? Is that zabbix[queue] internal item? We have about ~1250 values processed per second with avg queue length ~40. 99% of our items are active or trappers types. | ||||||||||||||||
Comment by Ingus Vilnis [ 2018 Dec 12 ] | ||||||||||||||||
Alexander, Administration -> Queue -> Overview page is what I am talking about, but you can also graph the changes over time, check the optional parameters for zabbix[queue,<from>,<to>] item. https://www.zabbix.com/documentation/4.0/manual/config/items/itemtypes/internal
Dom, what are you talking about? | ||||||||||||||||
Comment by Alexander Ivanes [ 2018 Dec 12 ] | ||||||||||||||||
As i thought. We doens't have that much of a queue then. | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 12 ] | ||||||||||||||||
Ingus, I'm taking this personally, because I insisted on the way On top of that, messing with value timestamps was causing a number of other difficult to troubleshoot issues. If the queue is something that bothers you, then you should definitely create a ticket and we can discuss it there. | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 12 ] | ||||||||||||||||
Back to the original issue. I think I have an explanation for this:
Actually, for the value of system.localtime Zabbix agent uses time() function based on system clock which can be adjusted by NTP service or manually by user. But for time stamping values Zabbix agent on Windows uses QueryPerformanceCounter() which according to Microsoft docs:
This explains the difference between system.localtime value and its timestamp and the fact that time drift is "re-set" on agent restart. This may also explain the queue of active checks if only Windows hosts are affected (Ingus, is it so?) since Zabbix server counts item as delayed starting with 6 seconds behind schedule. | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 12 ] | ||||||||||||||||
Dear vergil, can you show a close-up of this? Are both system.localtime items updated every 15 seconds? | ||||||||||||||||
Comment by Ingus Vilnis [ 2018 Dec 12 ] | ||||||||||||||||
Oh, Gleb, don't take this personally. Knowing you I believe you did improve the issues related to timestamps the best way possible. I agree that Zabbix should not be a replacement for NTP, but there are so many cases when Zabbix was the product able to tell about problems in other fields, including NTP. I am more about the fact how this all looks and is presented to the loyal end users of Zabbix. Upgrade notes do not mention anything what you just said. User by reading the notes would think, ok, I don't really know what the timestamp correction actually means, unlikely I make any use of it anyways, so whatever. But turns out the stuff is broken. And I was also convinced by the docs that active checks are good and triggering for drifted clock is also a good idea. Apparently not. Zabbix was doing a good job in becoming a "do-it-all-and-do-it-good" product. Just with release of 4.0 users do have actually mixed feelings whether the improvements are worth the tradeoffs. Mostly they are but the latter feel noticeable too. About the queue - yes, on that particular instance 90% are Windows hosts. But they simultaneously exist also on the old 3.0 where the queue is 100 times smaller. It does not bother me on that 4.0 as the performance is flying, just curious what was causing it and doubt whether it will not cause any real issues as the system will be growing. I will keep an eye on that and create another ticket if needed. Thank you for all! | ||||||||||||||||
Comment by Alexander Ivanes [ 2018 Dec 12 ] | ||||||||||||||||
I think you can look at GetSystemTimePreciseAsFileTime for this job (it's windows 8\2012+, yes, but at least for that versions it's pretty good candidate).
Yes, you're right. Update interval is 15s.
Here's picture for the last 7 days (it'll help with the next one) 10.12 at 6pm was the agent restart, 11.12 at 3pm i changed item type to passive: And here's this graph for 08.12: If you need more precise one, just ask. | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 12 ] | ||||||||||||||||
Thank you for the screenshots! On your hosts timestamp of system.localtime is actually ahead of its value. Let's just take the first one from here:
$ TZ=":Europe/Moscow" date -d @1544521534 --rfc-3339=seconds 2018-12-11 12:45:34+03:00 So the clock Zabbix agent uses for time stamping ticks faster than it should. It basically shifts all the data points to the right (into the future) along time axis. This brings relatively "older" values in the interval of min(..., 600) and these values are smaller. That's why you observe it as "system.localtime on Windows monotonically lags behind". So it all boils down to wrong time stamping by agent. Does this explanation make sense to you guys? | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 13 ] | ||||||||||||||||
ingus.vilnis, could you check system.localtime on these hosts? Do values and their timestamps differ? | ||||||||||||||||
Comment by Ingus Vilnis [ 2018 Dec 13 ] | ||||||||||||||||
Gleb, they do. Interestingly enough the received values are 15-70 seconds ahead of time displayed by Zabbix server (differs from host to host). Time on Zabbix server is synced by NTP and manual time check confirmed the clock is correct there. I did the comparison in Latest data -> Values -> CSV for a few randomly picked hosts. Or is there another way you'd suggest? | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 13 ] | ||||||||||||||||
Values represent system time on the host which is synchronized and should be OK. Agent uses this clock to schedule checks, so the schedule should be fine too. Timestamps represent that strange QueryPerformanceCounter() clock agent uses to stamp values. If values are ahead, then QueryPerformanceCounter() is behind and Zabbix server sees timestamps which are 15-70 seconds smaller, than it expects according to schedule. And this is the reason why you see so many items in queue. Everything boils down to the wrong time source used by active agent to stamp values on Windows. Dear vso, are you following? | ||||||||||||||||
Comment by Vladislavs Sokurenko [ 2018 Dec 13 ] | ||||||||||||||||
Thanks cyclone for looking into this, we need to find a suitable alternative to QueryPerformanceCounter() ? | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 13 ] | ||||||||||||||||
Have a look at the link I've posted above. It suggests using GetSystemTimePreciseAsFileTime(). | ||||||||||||||||
Comment by Vladislavs Sokurenko [ 2018 Dec 13 ] | ||||||||||||||||
I think it's settled then, thanks once again ! | ||||||||||||||||
Comment by Michael Veksler [ 2018 Dec 14 ] | ||||||||||||||||
It will not be easy. Because:
It is mean that when we start the agent, we have to find previous loaded library kernel.dll and than try to find pointer to function by name. https://docs.microsoft.com/en-us/windows/desktop/SysInfo/acquiring-high-resolution-time-stamps or something like this | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 14 ] | ||||||||||||||||
Maybe a lower resolution clock can be used? What's the point of knowing nanoseconds anyway if QueryPerformanceCounter() drifts by 5 seconds a day? | ||||||||||||||||
Comment by Michael Veksler [ 2018 Dec 20 ] | ||||||||||||||||
As a solution I propose to combine both methods: get the time through _ftime() and compute micro & nano as difference between current, previous call of QueryPerformanceCounter(), _ftime() and 'last_ts' _ftime(&tb); ts->sec = (int)tb.time; ts->ns = tb.millitm * 1000000; if (TRUE == (rc = QueryPerformanceFrequency(&tickPerSecond))) { if (TRUE == (rc = QueryPerformanceCounter(&tick))) { int ns = 0; if (0 < last_tick.QuadPart) { LARGE_INTEGER qpc_tick, ntp_tick = {0}; /* _ftime () returns precision in milliseconds, but 'ns' could be increased */ if (ts->sec == last_ts->sec && ts->ns < last_ts->ns) { ts->ns = last_ts->ns; } else { ntp_tick.QuadPart = tickPerSecond.QuadPart * (ts->sec - last_ts->sec) + tickPerSecond.QuadPart * (ts->ns - last_ts->ns) / 1000000000; } qpc_tick.QuadPart = tick.QuadPart - last_tick.QuadPart - ntp_tick.QuadPart; if (0 < qpc_tick.QuadPart && 0 == (qpc_tick.QuadPart / tickPerSecond.QuadPart)) ns = (int)(1000000000 * qpc_tick.QuadPart / tickPerSecond.QuadPart); } if (0 < ns && 1000000 > ns) /* value less than 1 millisecond */ { ts->ns += ns; while (ts->ns >= 1000000000) { ts->sec++; ts->ns -= 1000000000; } } last_tick = tick; } }
Available for testing in svn://svn.zabbix.com/branches/dev/ZBX-15301-40 | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 20 ] | ||||||||||||||||
Looks too complicated Wouldn't this be enough? _ftime(&tb); ts->sec = (int)tb.time; ts->ns = tb.millitm * 1000000; if (ts->sec == last_ts->sec && ts->ns / 1000000 == last_ts->ns / 1000000) /* if previous call returned time stamp in the same millisecond */ ts->ns = last_ts->ns + 1; /* ensure that issued time stamps monotonically increase */ /* (in the absence of corrections by NTP service) so that */ /* correct order of values can be restored afterwards */ | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 20 ] | ||||||||||||||||
Dear MVekslers, could you please elaborate? Both QueryPerformanceCounter() and GetSystemTimePreciseAsFileTime() are from kernel32.dll according to their docs. MVekslers Yes. QueryPerformanceCounter from w2k and QueryPerformanceCounter from w8/ws12 cyclone ...and since Zabbix still supports Windows 2000 and provides only one binary for all possible Windows versions you can't just include two pieces of code under #ifdef's. Got it. Thanks! MVekslers ... from XP | ||||||||||||||||
Comment by Michael Veksler [ 2018 Dec 20 ] | ||||||||||||||||
And we get different behavior for Win and Linux. Btw, at the end of zbx_timespec() is present the logic for monotonically time increasing: if (last_ts->ns == ts->ns && last_ts->sec == ts->sec) { ts->ns += ++corr; while (ts->ns >= 1000000000) { ts->sec++; ts->ns -= 1000000000; } } else { last_ts->sec = ts->sec; last_ts->ns = ts->ns; corr = 0; } Therefore the fix without 'ns' is easy... | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2018 Dec 20 ] | ||||||||||||||||
Depends on what you mean by "behaviour"
Looks like this function violates SRP. I would suggest making accurate timestamps (as close as possible to "absolute" clock) the only responsibility of zbx_timespec() and offload the task of precise timestamps (monotonic and with highest resolution possible for measuring small time intervals) to zbx_time(). The latter should be rewritten not to rely on zbx_timespec(), at least on Windows. | ||||||||||||||||
Comment by richlv [ 2018 Dec 20 ] | ||||||||||||||||
Win 2000 is older than XP by ~2 years, but I might be too pedantic there | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 03 ] | ||||||||||||||||
Related forum post https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/371408-zabbix-agent-sends-wrong-time-localtime-utc | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 03 ] | ||||||||||||||||
It could be that Zabbix agent reports incorrect system.localtime[] on Windows... or it could be that time on Windows is not correct. To find out we can try to collect localtime in an alternative way:
When you see that system.localtime[] has drifted away, try to compare it with time_from_shell - have both of them drifted ? Only one of them ? | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 03 ] | ||||||||||||||||
I was not able to reproduce monotonic time drift but observed large fluctuations of local time under heavy load. Here are some observations from Windows 10 Pro virtual machine, hosted on VirtualBox (Linux). With VirtualBox settings "System -> Acceleration -> Paravirtualization interface": "Default", 1 CPU, Execution cap: 44% (to make CPU closer to 100% busy as shown by Task Manager). $ VBoxManage guestproperty set "Windows 10" "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold" 100 Results Windows 10 virtual host OpenSSL: time_from_shell 2019-01-03 15:27:34 1546522054 "15:27:17,70" 4.39 2019-01-03 15:27:26 1546522046 "15:27:13,31" 7.49 2019-01-03 15:27:15 1546522035 "15:27:05,82" 11.87 2019-01-03 15:27:04 1546522024 "15:26:53,95" 10.58 2019-01-03 15:26:55 1546522015 "15:26:43,37" 11.45 2019-01-03 15:26:44 1546522004 "15:26:31,92" 11.42 2019-01-03 15:26:34 1546521994 "15:26:20,50" 10.95 2019-01-03 15:26:24 1546521984 "15:26:09,55" 11.27 2019-01-03 15:26:14 1546521974 "15:25:58,28" 5.83 2019-01-03 15:26:05 1546521965 "15:25:52,45" 7.40 2019-01-03 15:25:54 1546521954 "15:25:45,05" 8.43 2019-01-03 15:25:45 1546521945 "15:25:36,62" 9.22 2019-01-03 15:25:35 1546521935 "15:25:27,40" 8.43 2019-01-03 15:25:25 1546521925 "15:25:18,97" 12.75 2019-01-03 15:25:14 1546521914 "15:25:06,22" 12.45 2019-01-03 15:25:04 1546521904 "15:24:53,77" 12.25 2019-01-03 15:24:54 1546521894 "15:24:41,52" 12.64 2019-01-03 15:24:44 1546521884 "15:24:28,88" 12.41 2019-01-03 15:24:34 1546521874 "15:24:16,47" - Every 10 sec (approximately) Zabbix server does a passive check of UserParameter=time_from_shell,echo %time%, connection is PSK-based encrypted. The right-most column shows computed difference (in seconds) between adjacent results. Ideally it should be always 10.00. Surprisingly, sometimes it is as small as 4 sec or as large as 12 secs. So, in this example it is not that Windows local time is monotonically lagging behind. Sometimes clock goes faster, sometime slower - under heavy load. If load is small, the difference tends to be 10.00 as expected. Less deviations were observed with VirtualBox settings "System -> Acceleration -> Paravirtualization interface": "None" - this keeps time better. | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 03 ] | ||||||||||||||||
2 previous comments are not intended to say that system.localtime[] item on WIndows is quite good. We are working to improve algorithm for getting system.localtime. | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2019 Jan 03 ] | ||||||||||||||||
As far as I understand the issue is not about values of system.localtime being bad, but about timestamps of all active checks on Windows being inaccurate. | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 03 ] | ||||||||||||||||
Timestamps for active checks come from the same source as system.localtime - from zbx_timespec(). | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2019 Jan 03 ] | ||||||||||||||||
No, system.localtime comes from time(). | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 04 ] | ||||||||||||||||
Ok, from _ftime() on Windows. | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2019 Jan 04 ] | ||||||||||||||||
With no parameters or default "utc" time() is used regardless of the platform. | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 04 ] | ||||||||||||||||
Thanks, Gleb! You're right. | ||||||||||||||||
Comment by gofree [ 2019 Jan 07 ] | ||||||||||||||||
Hi guys, maybe a dummy question - after this will be resolved will the active agent work again with this key ? Thanks | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2019 Jan 07 ] | ||||||||||||||||
I'd guess no. | ||||||||||||||||
Comment by brendon [ 2019 Jan 08 ] | ||||||||||||||||
I installed the attached zabbix_agentd.exe.gz on server running active mode and I'm no longer getting data from the item. I ran in test mode for system.localtime and it did work. Its unfortunate. I guess I'll start looking for a powershell cmd to replace this item. Does anyone know if the unix agent is going to be converted to passive only as well? | ||||||||||||||||
Comment by brendon [ 2019 Jan 08 ] | ||||||||||||||||
This might be a workaround. [Math]::Floor((get-date ([DateTime]::UtcNow) -UFormat +%s)) | ||||||||||||||||
Comment by Glebs Ivanovskis [ 2019 Jan 09 ] | ||||||||||||||||
I'm afraid this approach will not work either. Let me elaborate. Here is the process of "monitoring" system clock using Zabbix.
To summarize:
I'm actually surprised myself that from these calculations follows that passive check and old active check were not the same! The latter was effectively checking time difference at a different time point - when agent sent the data, not when agent collected it. Math does not lie. And I have to say that in case of passive check, the time it takes agent to get the value is not accounted for. In the worst scenario it can take up to Timeout and is in my opinion the main contribution factor of fluctuations observed by andris above. My super subjective opinion is that everyone should abandon system.localtime + fuzzytime() combination for the purpose of monitoring system clock and switch to monitoring NTP service logs instead. NTP is highly sophisticated and efficient protocol which accounts for network delays and other factors. By monitoring how frequently NTP service is able to successfully sync system clock with external reference and how big are the adjustments it has to make you will get a pretty good picture of how healthy host's system clock is. I would try this template or something similar. | ||||||||||||||||
Comment by richlv [ 2019 Jan 09 ] | ||||||||||||||||
Gleb, thank you for the great analysis and the detailed writeup. Reminiscent of the "old times" when Zabbix developers would post those interesting, useful comments <dimir> Ye, where's the "Like" button? | ||||||||||||||||
Comment by Aigars Kadikis [ 2019 Jan 09 ] | ||||||||||||||||
Hello bbrendon
Please elaborate. What command did you use to test? What is the operating system?
| ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 09 ] | ||||||||||||||||
Attachments | ||||||||||||||||
Comment by brendon [ 2019 Jan 09 ] | ||||||||||||||||
Previously I tested on 2008r2 and 2012r2 on an active server. When I got a successful test it was using the -t parameter. I tested zabbix_agent-4.0.4rc1-win-amd64.zip and it appears to be working fine so far on an active server. In my environment I'm only concerned about 2 minutes of drift so it took awhile before the official release started triggering false positives. | ||||||||||||||||
Comment by Aigars Kadikis [ 2019 Jan 10 ] | ||||||||||||||||
For 24h result, you can integrate a dummy echo.time key echo-windows-time.win.conf and use template echo-windows-time.xml. The time should never be drifted an interval so big as 1 second:
Regards, | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 10 ] | ||||||||||||||||
Successfully tested. | ||||||||||||||||
Comment by Michael Veksler [ 2019 Jan 11 ] | ||||||||||||||||
Available in:
| ||||||||||||||||
Comment by Glebs Ivanovskis [ 2019 Jan 11 ] | ||||||||||||||||
Shouldn't + operator be left on the first line here? | ||||||||||||||||
Comment by Andris Mednis [ 2019 Jan 11 ] | ||||||||||||||||
Yes. Also one more tab for indentation. MVekslers Done. r88529,r88530 |