[ZBX-24658] Incorrect server+proxy interoperability when proxy load balancing is used Created: 2024 Jun 14  Updated: 2024 Aug 30  Resolved: 2024 Jul 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 7.0.0
Fix Version/s: 7.0.2rc2, 7.2.0alpha1

Type: Problem report Priority: Trivial
Reporter: Matt Deeds Assignee: Andris Zeila
Resolution: Fixed Votes: 4
Labels: load-balancing, proxy
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Zabbix 7.0.0, Zabbix Proxy 7.0.0, Zabbix Agent 2 7.0.0, Hosts are Windows server 2016/2019/2022


Attachments: Text File 24658_zabbix_agent2_config.txt     File HOST_Newly Assigned Proxy.csv     File HOST_Originally Assigned Proxy.csv     File PROXY_Newly Assigned Proxy.csv     File PROXY_Originally Assigned Proxy.csv     Text File zabbix_agent2-1.log     Text File zabbix_agent2.log     File zabbix_proxy_sanitized.conf    
Issue Links:
Duplicate
Team: Team A
Sprint: S24-W30/31
Story Points: 0.25

 Description   

Steps to reproduce:

  1. Update Zabbix Agent 2 to version 7.0.0
  2. Host are pointed at proxy groups in active mode
  3. Assign Template "Windows by Zabbix agent active" version "Zabbix, 6.4-0"
  4. Wait some random amount of time...
  5. Agent logs start repeating "no active checks on server [PROXY_ID:PORT]: host [HOSTNAME] not found"
  6. Windows service shows agent is still running but agent activity halts and no data is sent to proxy/server
  7. Restarting the agent resolves the issue 

Result:
See attached log file.
Actual proxy ip address/port and host hostname replace with PROXY_IP:PORT and HOSTNAME.
Expected:
Zabbix agent does not randomly freeze on Windows hosts 



 Comments   
Comment by Markku Leiniö [ 2024 Jun 15 ]

This is not the community forum, but a few questions as a community member anyway.

What do you mean by "proxy groups" in plural? You should only point one host's ServerActive to member(s) of one proxy group.

At the time of "no active checks on server <proxy>", what does the Zabbix server proxy list say about the load sharing: which proxy should be monitoring that host?

Comment by Edgar Akhmetshin [ 2024 Jun 17 ]

Hello Matt

In addition to the Markku questions please provide Agent configuration used.

Regards,
Edgar

Comment by Matt Deeds [ 2024 Jun 17 ]

"proxy groups" is a typo. We just have the one proxy group consisting of two proxies. 

I don't have an answer to Markku's second question yet. Next time this issue happens, I can provide that answer. 

I've attached the agent configuration file 24658_zabbix_agent2_config.txt

Comment by Matt Deeds [ 2024 Jun 18 ]

We've been able to capture the agent logs during the time of the issue with debug level 4. zabbix_agent2-1.log
(zabbix_agent2-1.log)

 

2024/06/17 07:40:42.192649 is when we first see  "no active checks on server [PROXY_B_IP:10051]", this is around the time we stop seeing data. We see this this message for about 5 minutes in the logs.

During this time, we see "In refreshActiveChecks() from PROXY_B_IP:10051,PROXY_A_IP:10051".

Then at 2024/06/17 07:45:14.187490 we see the last "no active checks on server [PROXY_B_IP:10051]" message. 

Shortly after this time we see a strange version of "In refreshActiveChecks()" but with PROXY_A duplicated: 

"2024/06/17 07:45:26.178372 [101] In refreshActiveChecks() from PROXY_A_IP:10051,PROXY_B_IP:10051,PROXY_A_IP:10051"

To me this seems strange. This duplicated PROXY_A message continues until we restart the agent at 2024/06/17 08:52:57.210613. After the restart the duplicated PROXY_A goes away.

Thank you,
Matt

Comment by Matt Deeds [ 2024 Jun 19 ]

Searching through our firewall logs we were able to see that the host switches from PROXY_A to PROXY_B around 7:40:38.000, right when we start seeing "no active checks on server [PROXY_B_IP:10051]".

The host switches back over to PROXY_A at 7:45:20.000, around the time we stop seeing the  "no active checks on server" and start seeing the duplicated PROXY_A message. 

Comment by Markku Leiniö [ 2024 Jun 19 ]

Just to be sure: Is the presented agent configuration file the "final" configuration? Asking because you have:

Include=C:\Program Files\Server\Managed\Resources\zabbix_team.conf
Include=C:\Program Files\Zabbix Agent 2\zabbix_agent2.d\
Include=C:\Program Files\Server\Customer\zabbix_parameters.conf

Comment by Matt Deeds [ 2024 Jun 24 ]

Hello Markku, 

This is the final configuration. 

We can also confirm that this same behavior is happening on linux agents as well. It continues happening until we restart the proxy service. 

Comment by Markku Leiniö [ 2024 Jun 24 ]

How are the proxies configured?

In the Zabbix community forum there was similar problem where the root cause was incorrect proxy configurations: they both had the same Hostname directive configured.

Comment by Matt Deeds [ 2024 Jun 24 ]

That doesn't look like the issue we are having. Both of our proxies have unique hostnames defined. I've attached a proxy config file zabbix_proxy_sanitized.conf

Comment by Matt Deeds [ 2024 Jun 26 ]

After some more digging it seems like this zabbix blog post by Markku (https://blog.zabbix.com/zabbix-7-0-proxy-load-balancing/28173/) describes the issue we are having. Specifically the section "Proxy is online but unreachable from the active agent": 

"This is a non-recoverable situation (at least with the current Zabbix 7.0.0) while the reachability issue persists: The agent keeps on contacting Proxy 1, keeps receiving the redirection, and the same repeats over and over again."

This seems like a bug where our only solution is to regularly restart the proxy service. 
Is this how the proxy load balancing works by design or an issue that is being looked into? 

Thanks

Comment by Markku Leiniö [ 2024 Jun 26 ]

In the blog post testing I never got your "no active checks on server [PROXY_ID:PORT]: host [HOSTNAME] not found" error in the agent (non-2) log. Basically the proxy is responding with a redirection to the correct proxy (based on the server-induced knowledge of the proxy group state), not with an error message.

Here is the agent log from the case:

 23378:20240611:222340.599 Starting Zabbix Agent [Zabbix70-agent]. Zabbix 7.0.0 (revision 49955f1fb5c).
...

23390:20240611:222858.799 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out]
 23390:20240611:222901.804 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out]
 23390:20240611:222901.806 Active check configuration update started to fail
 23390:20240611:222905.809 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out]
 23390:20240611:222908.813 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out]
 23390:20240611:222908.814 Active check data upload started to fail
 23390:20240611:222955.873 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out]
 23390:20240611:222958.877 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out]
 23390:20240611:222958.878 Unable to send heartbeat message to [192.168.7.82]:10051 [sequential redirect responses detected]
 23390:20240611:223133.004 active check data upload to [192.168.7.81:10051] is working again
 23390:20240611:223134.008 Active check configuration update from [192.168.7.81:10051] is working again
 23390:20240611:223146.014 Successfully sent heartbeat message to [192.168.7.81]:10051

Comment by Matt Deeds [ 2024 Jun 28 ]

We were able to get some tcpdump captures showing that when a host is automatically redistributed to another proxy in the group, it appears to never tell the agent this.

In this situation the host was originally pointed at PROXY_A and it was then redistributed to PROXY_B. PROXY_B was now shown as the assigned proxy in the host configuration in the Zabbix GUI. Our tcpdump captures show the zabbix agent on the host continue to try and connect with PROXY_A ( HOST_Originally Assigned Proxy.csv). The only traffic coming from PROXY_B on the host is because there is a single TCP Port check (which is passive) going to 3389 ( HOST_Newly Assigned Proxy.csv).

So passive checks appear to work as expected with proxy groups, it's just active checks which are failing which is consistent to what we were seeing in the agent logs. The tcpdump captures from both the proxies and the host show that the agent is still sending traffic to the originally assigned proxy (PROXY_A) and not the new one that it was shifted over to (PROXY_B).
Here are the tcpdump captures from PROXY_A ( PROXY_Originally Assigned Proxy.csv) and PROXY_B ( PROXY_Newly Assigned Proxy.csv). 

Comment by Markku Leiniö [ 2024 Jun 28 ]

In your agent configuration you showed that RefreshActiveChecks was at default, meaning 5 seconds in Zabbix 7.0.0 agent 2.

In your captures there is 6-second interval between the agent requests.

Why is that?

Comment by Ali Berry [ 2024 Jun 28 ]

I believe that may have been a fluke because we've checked with a few other hosts that were having issues and they are sending requests every 5 seconds when looking at tcpdump. The content of packets is still same the same between them.

Comment by Markku Leiniö [ 2024 Jul 06 ]

I experienced this same case as well now in my test environment. Unfortunately I don't currently have any more information/captures than these:

  1. Agent was in stopped state for days, monitored by a proxy group, assigned to Proxy2, as shown in Zabbix UI
  2. Agent is started, and started logging this right away: "no active checks on server [192.168.7.81:10051]: host [Zabbix70-agent] not found" (where 192.168.7.81 = Proxy1)
  3. Agent is configured with ServerActive=192.168.7.81;192.168.7.82
  4. In the Proxy1 packet capture I see it responding to agent just:   {"response":"failed","info":"host [Zabbix70-agent] not found"}
  5. But when I restarted Proxy1, Proxy1 next time responded to the agent with the correct redirection: {"response":"failed","redirect":{"revision":54,"address":"192.168.7.82:10051"}} (and the agent started working with Proxy2)

Unfortunately I don't have server-Proxy1 communication captured before this, but when Proxy1 was restarted, it received this in the full config from the server:

        "host_proxy": {
            "data": [
                [
                    2,
                    "Zabbix70-agent",
                    2,
                    54,
                    3,
                    "",
                    "",
                    "agent-ident",
                    "5aa5afeb13a78079d288e37b58ace825b9c8cf89cfbc403c3a34d693f581d478"
                ]
            ],
            "fields": [
                "hostproxyid",
                "host",
                "proxyid",
                "revision",
                "tls_accept",
                "tls_issuer",
                "tls_subject",
                "tls_psk_identity",
                "tls_psk"
            ]
        }, 

meaning that at this point Proxy1 got the correct "Zabbix70-agent is monitored by Proxy2" assignment in its configuration message, and Proxy1 was able to respond with a correct redirection.

As it looks like: somehow Proxy1 forgot that Zabbix70-agent was supposed to be monitored by Proxy2, and since the agent was not monitored by Proxy1 self either, it had no other option than to respond "agent not found". And by restart it got the full config again from the server.

 

Comment by Markku Leiniö [ 2024 Jul 07 ]

Opened ZBX-24801 about the active check interval issue (agent 2 7.0.0 requests active checks every RefreshActiveChecks+1 seconds).

Comment by Markku Leiniö [ 2024 Jul 11 ]

Experienced the issue again, this time I had packet capture running all the time. Timeline:

  • 2024-07-07 12:44: server sent to Proxy1 a "host_proxy" map: Zabbix70-agent is monitored by Proxy1 (hostmap revision increased to 62)
  • 2024-07-07 17:02: the agent (7.0.0) was intentionally shut down
  • 2024-07-08 00:45: server sent to Proxy1 a full sync with a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this looks strange) (hostmap revision is still 62)
  • 2024-07-08 03:00: server sent to Proxy1 a full sync with a "host_proxy" map: Zabbix70-agent is monitored by Proxy2 (this is because I saw the agent was assigned to Proxy1 in the GUI, and I interrupted Proxy1 connection to server for a moment, to make Zabbix server to move the agent to Proxy2 -> ok) (hostmap revision increased to 63)
  • 2024-07-09 02:45: server sent to Proxy1 a full sync with a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this looks strange) (hostmap revision is still 63)
  • 2024-07-10 04:45: server sent to Proxy1 a full sync with a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this looks strange) (hostmap revision is still 63)
  • 2024-07-11 06:45: server sent to Proxy1 a full sync with a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this looks strange) (hostmap revision is still 63)
  • 2024-07-11 16:27: The agent was intentionally started again, and it logged: no active checks on server [192.168.7.81:10051]: host [Zabbix70-agent] not found

Commentary:

  • There seems to be some kind of "full sync" timer in the server-proxy configuration messages, every 26 hours
  • There also seems to be some kind of "agent is idle" functionality that removes the agent from the proxy host map configuration messages, causing the proxy to respond error instead of redirection.

Update: also the relevant messages from Proxy2 capture:

  • 2024-07-07 12:44: server sent to Proxy2 a "host_proxy" map: Zabbix70-agent is monitored by Proxy1 (ok)
  • 2024-07-08 00:45: server sent to Proxy2 a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this is not ok)
  • 2024-07-08 03:00: server sent to Proxy2 a "host_proxy" map: Zabbix70-agent is monitored by Proxy2 (ok)
  • 2024-07-09 01:45: server sent to Proxy2 a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this is not ok)
  • 2024-07-10 03:45: server sent to Proxy2 a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this is not ok)
  • 2024-07-11 05:45: server sent to Proxy2 a "host_proxy" map that is empty (this was not triggered by any of my manual operations) (this is not ok)
Comment by Markku Leiniö [ 2024 Jul 12 ]

One more reproduce for this issue, I'll then leave it for Zabbix support team to comment.

  • Proxy1 was restarted at 2024-07-11 18:28 (to get it working again with the agent, from the error situation shown above)
  • Hostmap revision was still at 63
  • Agent still configured with "ServerActive=192.168.7.81;192.168.7.82" (Proxy1;Proxy2), and the agent has been assigned to Proxy2 in the proxy group
  • I restarted the agent every now and then to ensure that Proxy1 was still up-to-date about the agent assignment (= redirection from Proxy1 to Proxy2 was working every time), last working restart was at 2024-07-12 19:42
  • At 2024-07-12 19:45 (= just over 25 hours after restarting the proxy) Proxy1 got "full sync" config from the server, again with an empty data list in "host_proxy"
  • I routinely restarted the agent again at 19:47, and now the agent just logged "no active checks on server [192.168.7.81:10051]: host [Zabbix70-agent] not found" error when it communicated with Proxy1, there is no redirection anymore (corresponding error naturally in Proxy1 log as well)
  • Hostmap revision is still at 63
  • AFAIK, situation persists until the proxy is restarted (= it gets the real full config from the server and understands the agent assignment again)

So:

  • No changes in host-proxy assignments, but about 25 hours after the proxy restart the server decided to send an empty "host_proxy" list to the proxy (without "hostmap_revision" change), and the proxy immediately forgot the agent existence, thus being unable to redirect the agent to the correct proxy anymore.
  • Naturally the agent still keeps working as long as it is not restarted: Restart causes it to use ServerActive list again, and since Proxy2 is the assigned proxy but Proxy1 is the first one listed in the config (but Proxy1 does not know about the agent assignment anymore), the agent loses monitoring.
Comment by Markku Leiniö [ 2024 Jul 12 ]

Side note for anyone reproducing and investigating this: Wireshark 4.3.0rc1-447 and newer builds include the new zabbix.hostmap_revision field to assist in packet analysis.

Comment by Markku Leiniö [ 2024 Jul 14 ]

Potential workaround (instead of restarting the proxy): Disabling and then enabling the agent host causes Zabbix server to send an updated host_proxy mapping to the proxy, thus enabling the redirection to work again, to restore the agent connectivity.

I haven't tested if the server sends the full host_proxy mapping or just a subset of it, so I don't know if using a dummy host (just for triggering the mapping update) would work for all failing agents.

Comment by Markku Leiniö [ 2024 Jul 17 ]

Maybe edgar.akhmetshin or someone else should edit this issue to reflect the real issue, this is not about agent (or Windows) problem, this is about server+proxy interoperability when proxy load balancing is used.

Comment by dimir [ 2024 Jul 22 ]

markkul Is the new title good or you would like to add something?

Comment by Markku Leiniö [ 2024 Jul 22 ]

Thanks dimir, title looks good (according to my own observations), I was also thinking about the affected components+label (= server+proxy instead of agent).

Comment by dimir [ 2024 Jul 22 ]

Thanks for mentioning, done!

Comment by Markku Leiniö [ 2024 Jul 22 ]

FWIW, in Zabbix 7.0.1 something changed, this from the documentation (https://www.zabbix.com/documentation/current/en/manual/distributed_monitoring/proxies/ha) is not correct anymore:

The group is considered "out of balance" if the number of hosts assigned to the proxy is above/below the group average by more than 10 and a factor of 2. In this case the group is marked by the server for host reassignment after the grace period (10 x failover delay), if the balance is not restored.

With 7.0.1, two agents in the same proxy group (failover period = 20s), assigned to Proxy2 because Proxy1 has been down (for the purposes to get the testing agent to Proxy2). When Proxy1 is started and it has returned to the proxy group (without any agents assigned to it yet), in about 4 minutes the server will rebalance the testing agent to Proxy1 (leaving the other agent to Proxy2). This did not happen in 7.0.0.

Update: actually the code (https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/src/zabbix_server/pgmanager/pg_cache.c#474) says:

 * Comments: Proxy group is not balanced if:
 *           - it has unassigned hosts
 *           - an online proxy has no assigned hosts
 *           - the number of hosts assigned to a proxy differs from the
 *             average (withing the group) by at least 10 hosts and factor of 2.

That is not mentioned in the documentation. But, that source file has not been changed after May. As it turns out, there is also a comment:

            /* if a proxy has no hosts and another proxy has */
            /* multiple hosts then group is not balanced     */

so this explains the above: in earlier tests I only had that one test host in the proxy group, so the group was balanced all the time. With second testhost added, rebalancing occurs (maybe in about 10*20 seconds).

Comment by Andris Zeila [ 2024 Jul 24 ]

Shortly after this time we see a strange version of "In refreshActiveChecks()" but with PROXY_A duplicated: 

"2024/06/17 07:45:26.178372 [101] In refreshActiveChecks() from PROXY_A_IP:10051,PROXY_B_IP:10051,PROXY_A_IP:10051"

This simply means that PROXY_B redirected agent to PROXY_A according to the host-proxy mapping. When redirection occurs the redirected IP is inserted at the start of the IP list. When the redirect fails the redirected address is removed from the list. When a new redirect address is received the redirected address is replaced.

And question to mdeeds regarding

2024/06/17 07:40:42.192649 is when we first see  "no active checks on server [PROXY_B_IP:10051]", this is around the time we stop seeing data. We see this this message for about 5 minutes in the logs.

During this time, we see "In refreshActiveChecks() from PROXY_B_IP:10051,PROXY_A_IP:10051".

Then at 2024/06/17 07:45:14.187490 we see the last "no active checks on server [PROXY_B_IP:10051]" message. 

Did the proxy B sync its configuration with server during this 4.5m interval? The log messages could happen if host was reassigned from proxy A to B, configuration synced with proxy A, agent would be redirected to proxy B and getting no active checks, since they haven't yet been synced.

Comment by Markku Leiniö [ 2024 Jul 24 ]

FWIW: Confirmed this same problem in 7.0.1 (server+proxies+agent) as well: after about 25.5 hours since the server restart, the server sent "full_sync" config to proxies, and the non-assigned proxy again forgot the host-proxy assignments, thus restarting the agent caused loss of monitoring for the agent.

Comment by Matt Deeds [ 2024 Jul 24 ]

wiper we are seeing "received configuration data from server at "SERVER_IP", datalen 491" in the zabbix_proxy.log every 10 seconds.
Is that what you are referring to with proxy B syncing its configuration?
Or is there something else we should be looking for? 

Comment by Andris Zeila [ 2024 Jul 25 ]

That was it, thank you.

Comment by Andris Zeila [ 2024 Jul 25 ]

Thank you for the investigation/logs. That was really helpful to understand the cause. The 26h (25h-25h to be precise) full sync was a bug. While by itself it was harmless, coupled with hostmap sync during forced full sync it caused host-proxy maps on proxies to be reset.

Comment by Andris Zeila [ 2024 Jul 25 ]

Implemented in development branch feature/ZBX-24658-7.0 (pull request)

Comment by Andris Zeila [ 2024 Jul 26 ]

Released ZBX-24658 in:

  • pre-7.0.2rc2 21b8e8983c8
  • pre-7.2.0alpha1 162cb2351a3
Comment by Markku Leiniö [ 2024 Jul 30 ]

LOL when seeing the fix for this bug in the source But anyway, now the proxy hasn't received an additional full_sync config from the server at least in the first 28 hours when testing with 7.0.2 and the agent redirection is still working --> thanks!

Generated at Sun Apr 27 09:42:19 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.