[ZBX-15301] System.localtime on Windows monotonically lags behind Created: 2018 Dec 10  Updated: 2024 Apr 10  Resolved: 2019 Jan 20

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 4.0.2
Fix Version/s: 4.0.4rc1, 4.2.0alpha3, 4.2 (plan)

Type: Problem report Priority: Major
Reporter: Alexander Ivanes Assignee: Michael Veksler
Resolution: Fixed Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Server:
Linux 4.15.0-42-generic #45~16.04.1-Ubuntu SMP Mon Nov 19 13:02:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Zabbix-server 4.0.2
PostgreSQL 9.3.25
Agent:
Zabbix-agent 4.0.2
OS: various windows versions (WS2012, WS2012R2, WS2016)


Attachments: PNG File echo-time-active-item-incoming-speed.png     File echo-windows-time.win.conf     XML File echo-windows-time.xml     PNG File system-localtime-backend-p7.PNG     PNG File system-localtime-zabbix-server.PNG     PNG File time-deviation-7days.png     PNG File time-deviation-8.12.png     PNG File time-deviation-backend-p7.PNG     PNG File time-deviation-settings.PNG     PNG File time-deviation.png     Zip Archive zabbix_agent-4.0.4rc1-win-amd64-openssl.zip     Zip Archive zabbix_agent-4.0.4rc1-win-amd64.zip     Zip Archive zabbix_agent-4.0.4rc1-win-i386-openssl.zip     Zip Archive zabbix_agent-4.0.4rc1-win-i386.zip    
Issue Links:
Causes
caused by ZBX-12957 zabbix agentd use active mode traffic... Closed
Duplicate
is duplicated by ZBX-15441 Zabbix queue is growing with items fr... Closed
Team: Team A
Sprint: Sprint 47, Dec 2018, Sprint 48, Jan 2019
Story Points: 1

 Description   

After moving from 3.4 to 4.0 system.localtime start lagging behind from it's actual system value and this lag monotonically increases through time (on machines with more load it happens faster than on the other)

Time deviation is a custom calculated item (difference between localtime on host and zabbix-server)
On november, 21 we moved to 4.0. On agent restart localtime value resets to normal.



 Comments   
Comment by Andrei Gushchin (Inactive) [ 2018 Dec 11 ]

How would you measure that deviation?
Could you please show item settings and last history as values sequence.
What zabbix agent version was and current?

Comment by Ingus Vilnis [ 2018 Dec 11 ]

Not 100% sure but could be because of this:
https://www.zabbix.com/documentation/4.0/manual/installation/upgrade_notes_400#timestamp_correction
ZBX-12957

Comment by Alexander Ivanes [ 2018 Dec 11 ]

Not 100% sure but could be because of this:
https://www.zabbix.com/documentation/4.0/manual/installation/upgrade_notes_400#timestamp_correction

We don't use proxy if that change only related to it.

How would you measure that deviation?

Basically it's just a subtraction of host system.localtime and a zabbix-server system.localtime:

min("system.localtime", 600)-min("rainbow1-3:system.localtime", 600)

It's not perfect by any means but can give you approximate picture. For trigger we using fuzzytime.

Could you please show item settings and last history as values sequence.

Item settings:

Item values:

Again, i would say that problem is not with this item but with localtime values on the windows host (if you need more i can query a DB):

Localtime values (windows):

Localtime values (zabbix-server):

What zabbix agent version was and current?

Was: 3.4.11
Current: 4.0.2 (4.0.1 also affected)

Comment by Alexander Ivanes [ 2018 Dec 11 ]

Not 100% sure but could be because of this:
https://www.zabbix.com/documentation/4.0/manual/installation/upgrade_notes_400#timestamp_correction
ZBX-12957

Didn't notice the issue report. Yes, it's really similar and yes system.localtime is an active type item.

 

Comment by Vladislavs Sokurenko [ 2018 Dec 11 ]

Regarding ZBX-12957, from the source code it looks like also related to receiving history from agent see changes in parse_history_data()

Comment by Glebs Ivanovskis [ 2018 Dec 12 ]

Hi all!

First of all, as vso correctly pointed out, changes ZBX-12957 also affect active agents, so docs are not entirely correct.

Secondy, after ZBX-12957 system.localtime item does not make any sense as active check. Since server trusts timestamps of the agent, timestamp value of system.localtime will exactly match values of system.localtime. However, in your case they differ for some reason. Is that host monitored by proxy? If yes, what is proxy's version?

Anyway, the way you calculate the time difference should be robust enough with respect to small timestamp adjustments. Could the drift be caused by the way active agent schedules checks?

Comment by Alexander Ivanes [ 2018 Dec 12 ]

Hi!

Secondy, after ZBX-12957 system.localtime item does not make any sense as active check. Since server trusts timestamps of the agent, timestamp value of system.localtime will exactly match values of system.localtime. However, in your case they differ for some reason. Is that host monitored by proxy? If yes, what is proxy's version?

No we don't use a proxy. Switch to non active checks indeed solved this problem.

Anyway, the way you calculate the time difference should be robust enough with respect to small timestamp adjustments. Could the drift be caused by the way active agent schedules checks?

The problem is that with active check difference between timestamps increases over time and returns to normal after agent restart. On Linux, however, it stays the same all the time. Seems like some kind of leak or queue to me.

Comment by Ingus Vilnis [ 2018 Dec 12 ]

Hi guys,

Gleb, according to what you are saying, example 10 here is no longer valid in case the item is active?
https://www.zabbix.com/documentation/4.0/manual/config/triggers/expression?s[]=fuzzytime

And the high queue for active items sub 1 minute on 4.0 is also related to ZBX-12957, right?

Comment by Glebs Ivanovskis [ 2018 Dec 12 ]

Dear ingus.vilnis, yes and (probably) no, respectively. Not sure about the latter, could you point me to the specific issue?

Comment by Ingus Vilnis [ 2018 Dec 12 ]

Gleb, about the localtime active becoming pointless you made me very sad, as active-only templates are widely used where I am now. And that is also one of the item-trigger combos. Too bad yet another good feature got broken after working flawlessly for years. 

About the queue we might go totally off topic from the original post and bug tracker is not the right place to troubleshoot this but in short my assumption is based on the observations that on one 3.0.12 instance @1200nvps queue under 5 minutes is about 100 active items. 

On a brand new 4.0.2 clean install built according to the best feng shui I know @450 nvps and growing (to replace 3.0 in parallel) this queue for active items under 5 minutes is about 11K! 

I have not checked if there is exact ZBX but I have spotted folks going about this in the forum too. I thought the reason was ZBX-12957 but now looks like it is not.

Comment by Alexander Ivanes [ 2018 Dec 12 ]

On a brand new 4.0.2 clean install built according to the best feng shui I know @450 nvps and growing (to replace 3.0 in parallel) this queue for active items under 5 minutes is about 11K! 

Ingus, where did you found queue length for that? Is that zabbix[queue] internal item?

We have about ~1250 values processed per second with avg queue length ~40.

99% of our items are active or trappers types.

Comment by Ingus Vilnis [ 2018 Dec 12 ]

Alexander, Administration -> Queue -> Overview page is what I am talking about, but you can also graph the changes over time, check the optional parameters for zabbix[queue,<from>,<to>] item. 

https://www.zabbix.com/documentation/4.0/manual/config/items/itemtypes/internal

 

Dom, what are you talking about?

Comment by Alexander Ivanes [ 2018 Dec 12 ]

Alexander, Administration -> Queue -> Overview page is what I am talking about, but you can also graph the changes over time, check the optional parameters for zabbix[queue,<from>,<to>] item.

https://www.zabbix.com/documentation/4.0/manual/config/items/itemtypes/internal

As i thought. We doens't have that much of a queue then.

Comment by Glebs Ivanovskis [ 2018 Dec 12 ]

Ingus, I'm taking this personally, because I insisted on the way ZBX-12957 was resolved. In my opinion it is pointless trying to be "do-it-all" product (including being NTP substitute) because eventually it ends up in "do-it-all-poorly" kind of product. I'm sure that system.localtime + fuzzytime() wasn't working "flawlessly" simply because Zabbix has never had a proper way of accounting for network delays or CPU overloads. By the way, system.localtime in passive mode is still subjected to this factor. NTP has clever algorithms to account for network delays and in my opinion you should enable NTP service and monitor its logs in order to be sure about precise system time on your hosts.

On top of that, messing with value timestamps was causing a number of other difficult to troubleshoot issues.

If the queue is something that bothers you, then you should definitely create a ticket and we can discuss it there.

Comment by Glebs Ivanovskis [ 2018 Dec 12 ]

Back to the original issue. I think I have an explanation for this:

...timestamp value of system.localtime will exactly match values of system.localtime. However, in your case they differ for some reason.

Actually, for the value of system.localtime Zabbix agent uses time() function based on system clock which can be adjusted by NTP service or manually by user. But for time stamping values Zabbix agent on Windows uses QueryPerformanceCounter() which according to Microsoft docs:

...is independent of and isn't synchronized to any external time reference.

This explains the difference between system.localtime value and its timestamp and the fact that time drift is "re-set" on agent restart. This may also explain the queue of active checks if only Windows hosts are affected (Ingus, is it so?) since Zabbix server counts item as delayed starting with 6 seconds behind schedule.

Comment by Glebs Ivanovskis [ 2018 Dec 12 ]

Dear vergil, can you show a close-up of this?

I would like to see the size of individual "jumps".

Are both system.localtime items updated every 15 seconds?

Comment by Ingus Vilnis [ 2018 Dec 12 ]

Oh, Gleb, don't take this personally. Knowing you I believe you did improve the issues related to timestamps the best way possible. 

I agree that Zabbix should not be a replacement for NTP, but there are so many cases when Zabbix was the product able to tell about problems in other fields, including NTP. 

I am more about the fact how this all looks and is presented to the loyal end users of Zabbix. Upgrade notes do not mention anything what you just said. User by reading the notes would think, ok, I don't really know what the timestamp correction actually means, unlikely I make any use of it anyways, so whatever. But turns out the stuff is broken. And I was also convinced by the docs that active checks are good and triggering for drifted clock is also a good idea. Apparently not. 

Zabbix was doing a good job in becoming a "do-it-all-and-do-it-good" product. Just with release of 4.0 users do have actually mixed feelings whether the improvements are worth the tradeoffs. Mostly they are but the latter feel noticeable too.

About the queue - yes, on that particular instance 90% are Windows hosts. But they simultaneously exist also on the old 3.0 where the queue is 100 times smaller. It does not bother me on that 4.0 as the performance is flying, just curious what was causing it and doubt whether it will not cause any real issues as the system will be growing. I will keep an eye on that and create another ticket if needed. 

Thank you for all!  

Comment by Alexander Ivanes [ 2018 Dec 12 ]

Actually, for the value of system.localtime Zabbix agent uses time() function based on system clock which can be adjusted by NTP service or manually by user. But for time stamping values Zabbix agent on Windows uses QueryPerformanceCounter()

I think you can look at GetSystemTimePreciseAsFileTime for this job (it's windows 8\2012+, yes, but at least for that versions it's pretty good candidate).

Are both system.localtime items updated every 15 seconds?

Yes, you're right. Update interval is 15s.

I would like to see the size of individual "jumps".

Here's picture for the last 7 days (it'll help with the next one) 10.12 at 6pm was the agent restart, 11.12 at 3pm i changed item type to passive:

And here's this graph for 08.12:

If you need more precise one, just ask.

Comment by Glebs Ivanovskis [ 2018 Dec 12 ]

Thank you for the screenshots! On your hosts timestamp of system.localtime is actually ahead of its value. Let's just take the first one from here:

Timestamp Value
2018-12-11 12:45:57 1544521534
$ TZ=":Europe/Moscow" date -d @1544521534 --rfc-3339=seconds
2018-12-11 12:45:34+03:00

So the clock Zabbix agent uses for time stamping ticks faster than it should. It basically shifts all the data points to the right (into the future) along time axis. This brings relatively "older" values in the interval of min(..., 600) and these values are smaller. That's why you observe it as "system.localtime on Windows monotonically lags behind". So it all boils down to wrong time stamping by agent.

Does this explanation make sense to you guys?

Comment by Glebs Ivanovskis [ 2018 Dec 13 ]

About the queue - yes, on that particular instance 90% are Windows hosts.

ingus.vilnis, could you check system.localtime on these hosts? Do values and their timestamps differ?

Comment by Ingus Vilnis [ 2018 Dec 13 ]

Gleb, they do. Interestingly enough the received values are 15-70 seconds ahead of time displayed by Zabbix server (differs from host to host). Time on Zabbix server is synced by NTP and manual time check confirmed the clock is correct there. 

I did the comparison in Latest data -> Values -> CSV for a few randomly picked hosts. Or is there another way you'd suggest?

Comment by Glebs Ivanovskis [ 2018 Dec 13 ]

Values represent system time on the host which is synchronized and should be OK. Agent uses this clock to schedule checks, so the schedule should be fine too. Timestamps represent that strange QueryPerformanceCounter() clock agent uses to stamp values. If values are ahead, then QueryPerformanceCounter() is behind and Zabbix server sees timestamps which are 15-70 seconds smaller, than it expects according to schedule. And this is the reason why you see so many items in queue.

Everything boils down to the wrong time source used by active agent to stamp values on Windows. Dear vso, are you following?

Comment by Vladislavs Sokurenko [ 2018 Dec 13 ]

Thanks cyclone for looking into this, we need to find a suitable alternative to QueryPerformanceCounter() ?

Comment by Glebs Ivanovskis [ 2018 Dec 13 ]

Have a look at the link I've posted above. It suggests using GetSystemTimePreciseAsFileTime().

Comment by Vladislavs Sokurenko [ 2018 Dec 13 ]

I think it's settled then, thanks once again !

Comment by Michael Veksler [ 2018 Dec 14 ]

It will not be easy. Because:

  1. GetSystemTimePreciseAsFileTime() start with w8.
  2. the function is part of kernel.dll (we can't repeat the trick with /DELAYLOAD)

It is mean that when we start  the agent, we have to find previous loaded library kernel.dll and than try to find pointer to function by name.

https://docs.microsoft.com/en-us/windows/desktop/SysInfo/acquiring-high-resolution-time-stamps

or something like this

Comment by Glebs Ivanovskis [ 2018 Dec 14 ]

Maybe a lower resolution clock can be used? What's the point of knowing nanoseconds anyway if QueryPerformanceCounter() drifts by 5 seconds a day?

Comment by Michael Veksler [ 2018 Dec 20 ]

As a solution I propose to combine both methods: get the time through _ftime() and compute micro & nano as difference between current, previous call of QueryPerformanceCounter(), _ftime() and 'last_ts'
If the computed difference is < 1 millisecond, we add this value to the time from _ftime().

	_ftime(&tb);

	ts->sec = (int)tb.time;
	ts->ns = tb.millitm * 1000000;

	if (TRUE == (rc = QueryPerformanceFrequency(&tickPerSecond)))
	{
		if (TRUE == (rc = QueryPerformanceCounter(&tick)))
		{
			int	ns = 0;

			if (0 < last_tick.QuadPart)
			{
				LARGE_INTEGER	qpc_tick, ntp_tick = {0};

				/* _ftime () returns precision in milliseconds, but 'ns' could be increased */
				if (ts->sec == last_ts->sec && ts->ns < last_ts->ns)
				{
					ts->ns = last_ts->ns;
				}
				else
				{
					ntp_tick.QuadPart = tickPerSecond.QuadPart * (ts->sec - last_ts->sec)
						+ tickPerSecond.QuadPart * (ts->ns - last_ts->ns) / 1000000000;
				}

				qpc_tick.QuadPart = tick.QuadPart - last_tick.QuadPart - ntp_tick.QuadPart;

				if (0 < qpc_tick.QuadPart && 0 == (qpc_tick.QuadPart / tickPerSecond.QuadPart))
					ns = (int)(1000000000 * qpc_tick.QuadPart / tickPerSecond.QuadPart);
			}

			if (0 < ns && 1000000 > ns)	/*  value less than 1 millisecond */
			{
				ts->ns += ns;

				while (ts->ns >= 1000000000)
				{
					ts->sec++;
					ts->ns -= 1000000000;
				}
			}

			last_tick = tick;
		}
	}

 

Available for testing in svn://svn.zabbix.com/branches/dev/ZBX-15301-40

Comment by Glebs Ivanovskis [ 2018 Dec 20 ]

Looks too complicated

Wouldn't this be enough?

_ftime(&tb);

ts->sec = (int)tb.time;
ts->ns = tb.millitm * 1000000;

if (ts->sec == last_ts->sec && ts->ns / 1000000 == last_ts->ns / 1000000)	/* if previous call returned time stamp in the same millisecond */
	ts->ns = last_ts->ns + 1;						/* ensure that issued time stamps monotonically increase */
										/* (in the absence of corrections by NTP service) so that */
										/* correct order of values can be restored afterwards */
Comment by Glebs Ivanovskis [ 2018 Dec 20 ]

the function is part of kernel.dll (we can't repeat the trick with /DELAYLOAD)

Dear MVekslers, could you please elaborate? Both QueryPerformanceCounter() and GetSystemTimePreciseAsFileTime() are from kernel32.dll according to their docs.

MVekslers Yes. QueryPerformanceCounter from w2k and QueryPerformanceCounter from w8/ws12

cyclone ...and since Zabbix still supports Windows 2000 and provides only one binary for all possible Windows versions you can't just include two pieces of code under #ifdef's. Got it. Thanks!

MVekslers  ... from XP

Comment by Michael Veksler [ 2018 Dec 20 ]

And we get different behavior for Win and Linux.
If we take into account only 'system.localtime' - I totally agree, but zbx_timespec() is a more general place. Therefore I am trying to recover ns for timestamp.

Btw, at the end of zbx_timespec() is present the logic for monotonically time increasing:

	if (last_ts->ns == ts->ns && last_ts->sec == ts->sec)
	{
		ts->ns += ++corr;

		while (ts->ns >= 1000000000)
		{
			ts->sec++;
			ts->ns -= 1000000000;
		}
	}
	else
	{
		last_ts->sec = ts->sec;
		last_ts->ns = ts->ns;
		corr = 0;
	}

Therefore the fix without 'ns' is easy...

Comment by Glebs Ivanovskis [ 2018 Dec 20 ]

And we get different behavior for Win and Linux.

Depends on what you mean by "behaviour"

zbx_timespec() is a more general place

Looks like this function violates SRP. I would suggest making accurate timestamps (as close as possible to "absolute" clock) the only responsibility of zbx_timespec() and offload the task of precise timestamps (monotonic and with highest resolution possible for measuring small time intervals) to zbx_time(). The latter should be rewritten not to rely on zbx_timespec(), at least on Windows.

Comment by richlv [ 2018 Dec 20 ]

Win 2000 is older than XP by ~2 years, but I might be too pedantic there

Comment by Andris Mednis [ 2019 Jan 03 ]

Related forum post https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/371408-zabbix-agent-sends-wrong-time-localtime-utc

Comment by Andris Mednis [ 2019 Jan 03 ]

It could be that Zabbix agent reports incorrect system.localtime[] on Windows... or it could be that time on Windows is not correct.

To find out we can try to collect localtime in an alternative way:

  • add a line
    UserParameter=time_from_shell,echo %time%
    

    to zabbix_agentd.conf on Windows, restart agent.

  • create item
    Name: time_from_shell
    Type: Zabbix agent
    Key: time_from_shell
    Type of information: Character
    Update interval: 10s <--- or other as you prefer

When you see that system.localtime[] has drifted away, try to compare it with time_from_shell - have both of them drifted ? Only one of them ?

Comment by Andris Mednis [ 2019 Jan 03 ]

I was not able to reproduce monotonic time drift but observed large fluctuations of local time under heavy load.

Here are some observations from Windows 10 Pro virtual machine, hosted  on VirtualBox (Linux).

With VirtualBox settings "System -> Acceleration -> Paravirtualization interface": "Default", 1 CPU, Execution cap: 44% (to make CPU closer to 100% busy as shown by Task Manager).
For better time keeping I also set

$ VBoxManage guestproperty set "Windows 10" "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold" 100

Results

 Windows 10 virtual host OpenSSL: time_from_shell

2019-01-03 15:27:34 1546522054 "15:27:17,70"   4.39
2019-01-03 15:27:26 1546522046 "15:27:13,31"   7.49
2019-01-03 15:27:15 1546522035 "15:27:05,82"  11.87
2019-01-03 15:27:04 1546522024 "15:26:53,95"  10.58
2019-01-03 15:26:55 1546522015 "15:26:43,37"  11.45
2019-01-03 15:26:44 1546522004 "15:26:31,92"  11.42
2019-01-03 15:26:34 1546521994 "15:26:20,50"  10.95
2019-01-03 15:26:24 1546521984 "15:26:09,55"  11.27
2019-01-03 15:26:14 1546521974 "15:25:58,28"   5.83
2019-01-03 15:26:05 1546521965 "15:25:52,45"   7.40
2019-01-03 15:25:54 1546521954 "15:25:45,05"   8.43
2019-01-03 15:25:45 1546521945 "15:25:36,62"   9.22
2019-01-03 15:25:35 1546521935 "15:25:27,40"   8.43
2019-01-03 15:25:25 1546521925 "15:25:18,97"  12.75
2019-01-03 15:25:14 1546521914 "15:25:06,22"  12.45
2019-01-03 15:25:04 1546521904 "15:24:53,77"  12.25
2019-01-03 15:24:54 1546521894 "15:24:41,52"  12.64
2019-01-03 15:24:44 1546521884 "15:24:28,88"  12.41
2019-01-03 15:24:34 1546521874 "15:24:16,47"      -

Every 10 sec (approximately) Zabbix server does a passive check of UserParameter=time_from_shell,echo %time%, connection is PSK-based encrypted. The right-most column shows computed difference (in seconds) between adjacent results. Ideally it should be always 10.00.

Surprisingly, sometimes it is as small as 4 sec or as large as 12 secs. So, in this example it is not that Windows local time is monotonically lagging behind. Sometimes clock goes faster, sometime slower - under heavy load. If load is small, the difference tends to be 10.00 as expected.

Less deviations were observed with VirtualBox settings "System -> Acceleration -> Paravirtualization interface": "None" - this keeps time better.

Comment by Andris Mednis [ 2019 Jan 03 ]

2 previous comments are not intended to say that system.localtime[] item on WIndows is quite good. We are working to improve algorithm for getting system.localtime.

Comment by Glebs Ivanovskis [ 2019 Jan 03 ]

As far as I understand the issue is not about values of system.localtime being bad, but about timestamps of all active checks on Windows being inaccurate.

Comment by Andris Mednis [ 2019 Jan 03 ]

Timestamps for active checks come from the same source as system.localtime - from zbx_timespec().

Comment by Glebs Ivanovskis [ 2019 Jan 03 ]

No, system.localtime comes from time().

Comment by Andris Mednis [ 2019 Jan 04 ]

Ok, from _ftime() on Windows.

Comment by Glebs Ivanovskis [ 2019 Jan 04 ]

With no parameters or default "utc" time() is used regardless of the platform.

Comment by Andris Mednis [ 2019 Jan 04 ]

Thanks, Gleb! You're right.

Comment by gofree [ 2019 Jan 07 ]

Hi guys, maybe a dummy question - after this will be resolved will the active agent work again with this key ?

Thanks

Comment by Glebs Ivanovskis [ 2019 Jan 07 ]

I'd guess no.

Comment by brendon [ 2019 Jan 08 ]

I installed the attached zabbix_agentd.exe.gz on server running active mode and I'm no longer getting data from the item.  I ran in test mode for system.localtime and it did work. Its unfortunate. I guess I'll start looking for a powershell cmd to replace this item. 

Does anyone know if the unix agent is going to be converted to passive only as well?

Comment by brendon [ 2019 Jan 08 ]

This might be a workaround.

[Math]::Floor((get-date ([DateTime]::UtcNow) -UFormat +%s))

Comment by Glebs Ivanovskis [ 2019 Jan 09 ]

I'm afraid this approach will not work either. Let me elaborate. Here is the process of "monitoring" system clock using Zabbix.

  1. Getting value of system clock on the host. In all cases it's done by the agent. Passive or active, system.localtime item or custom script, ... it does not matter. It is time of the host t‍agent.
  2. Time stamping the value.
    • In case of passive check this is done by the server (or proxy, but let's keep it simple) and the result is t‍server.
    • In case of active check this is done by agent itself and results in t‍agent.
      • Before the aforementioned ZBX-12957 there was an extra step done by server (or proxy, but we are simplifying). Active agent sends in bulk values and their timestamps as well as attaches it's current clock t'agent. Server takes a look it's current clock t'server and adds the difference to the timestamps assuming the difference hasn't changed since the value had been collected and neglecting all delays in the network. As a result the timestamp is t‍agent + (t'server - t'agent).
  3. Processing value in the trigger expression. When server calculates trigger expression based on a new value it counts values, takes all intervals and all time shifts of trigger functions with respect to the timestamp of the new value, because sometimes values are delayed for quite a while and it needs to "replay" the history of events as it was unfolding. (nodata() is always evaluated with respect to the current server's time, but let's forget it now.)
  4. fuzzytime() takes the time point expression is being evaluated with respect to (i.e. the timestamp), subtracts the value, discards the sign and compares to a predefined parameter.

To summarize:

  value timestamp in fuzzytime()
Passive check t‍agent t‍server |t‍agent - t‍server|‍‍
Active check (before ZBX-12957) t‍agent t‍agent + (t'server - t'agent) |t‍agent - t‍agent + (t'server - t'agent)| = |t'server - t'agent|‍‍
Active check (after ZBX-12957) t‍agent t‍agent |t‍agent - t‍agent| = 0

I'm actually surprised myself that from these calculations follows that passive check and old active check were not the same! The latter was effectively checking time difference at a different time point - when agent sent the data, not when agent collected it. Math does not lie.

And I have to say that in case of passive check, the time it takes agent to get the value is not accounted for. In the worst scenario it can take up to Timeout and is in my opinion the main contribution factor of fluctuations observed by andris above.

My super subjective opinion is that everyone should abandon system.localtime + fuzzytime() combination for the purpose of monitoring system clock and switch to monitoring NTP service logs instead. NTP is highly sophisticated and efficient protocol which accounts for network delays and other factors. By monitoring how frequently NTP service is able to successfully sync system clock with external reference and how big are the adjustments it has to make you will get a pretty good picture of how healthy host's system clock is. I would try this template or something similar.

Comment by richlv [ 2019 Jan 09 ]

Gleb, thank you for the great analysis and the detailed writeup. Reminiscent of the "old times" when Zabbix developers would post those interesting, useful comments

<dimir> Ye, where's the "Like" button? 

Comment by Aigars Kadikis [ 2019 Jan 09 ]

Hello bbrendon

I installed the attached zabbix_agentd.exe.gz on server running active mode and I'm no longer getting data from the item. I ran in test mode for system.localtime and it did work. Its unfortunate. I guess I'll start looking for a powershell cmd to replace this item.

Please elaborate. What command did you use to test? What is the operating system?

 

Comment by Andris Mednis [ 2019 Jan 09 ]

Attachments
zabbix_agent-4.0.4rc1-win-amd64-openssl.zip
zabbix_agent-4.0.4rc1-win-amd64.zip
zabbix_agent-4.0.4rc1-win-i386-openssl.zip
zabbix_agent-4.0.4rc1-win-i386.zip
are the latest fix to try - from svn://svn.zabbix.com/branches/dev/ZBX-15301-40 r88428.

Comment by brendon [ 2019 Jan 09 ]

Please elaborate. What command did you use to test? What is the operating system?

Previously I tested on 2008r2 and 2012r2 on an active server. When I got a successful test it was using the -t parameter. I tested zabbix_agent-4.0.4rc1-win-amd64.zip and it appears to be working fine so far on an active server. In my environment I'm only concerned about 2 minutes of drift so it took awhile before the official release started triggering false positives.

Comment by Aigars Kadikis [ 2019 Jan 10 ]

For 24h result, you can integrate a dummy echo.time key echo-windows-time.win.conf and use template echo-windows-time.xml.

The time should never be drifted an interval so big as 1 second:

Regards,

Comment by Andris Mednis [ 2019 Jan 10 ]

Successfully tested.

Comment by Michael Veksler [ 2019 Jan 11 ]

Available in:

  • 4.0.4rc1 r88529
  • 4.2.0alpha3 (trunk) r88530.
Comment by Glebs Ivanovskis [ 2019 Jan 11 ]

Shouldn't + operator be left on the first line here?

Comment by Andris Mednis [ 2019 Jan 11 ]

Yes. Also one more tab for indentation.

MVekslers Done. r88529,r88530

Generated at Wed Apr 24 19:47:22 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.