[ZBX-24658] Incorrect server+proxy interoperability when proxy load balancing is used Created: 2024 Jun 14 Updated: 2024 Aug 30 Resolved: 2024 Jul 30 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 7.0.0 |
Fix Version/s: | 7.0.2rc2, 7.2.0alpha1 |
Type: | Problem report | Priority: | Trivial |
Reporter: | Matt Deeds | Assignee: | Andris Zeila |
Resolution: | Fixed | Votes: | 4 |
Labels: | load-balancing, proxy | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Zabbix 7.0.0, Zabbix Proxy 7.0.0, Zabbix Agent 2 7.0.0, Hosts are Windows server 2016/2019/2022 |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||
Issue Links: |
|
||||
Team: | |||||
Sprint: | S24-W30/31 | ||||
Story Points: | 0.25 |
Description |
Steps to reproduce:
Result: |
Comments |
Comment by Markku Leiniö [ 2024 Jun 15 ] |
This is not the community forum, but a few questions as a community member anyway. What do you mean by "proxy groups" in plural? You should only point one host's ServerActive to member(s) of one proxy group. At the time of "no active checks on server <proxy>", what does the Zabbix server proxy list say about the load sharing: which proxy should be monitoring that host? |
Comment by Edgar Akhmetshin [ 2024 Jun 17 ] |
Hello Matt In addition to the Markku questions please provide Agent configuration used. Regards, |
Comment by Matt Deeds [ 2024 Jun 17 ] |
"proxy groups" is a typo. We just have the one proxy group consisting of two proxies. I don't have an answer to Markku's second question yet. Next time this issue happens, I can provide that answer. I've attached the agent configuration file 24658_zabbix_agent2_config.txt |
Comment by Matt Deeds [ 2024 Jun 18 ] |
We've been able to capture the agent logs during the time of the issue with debug level 4. zabbix_agent2-1.log
2024/06/17 07:40:42.192649 is when we first see "no active checks on server [PROXY_B_IP:10051]", this is around the time we stop seeing data. We see this this message for about 5 minutes in the logs. During this time, we see "In refreshActiveChecks() from PROXY_B_IP:10051,PROXY_A_IP:10051". Then at 2024/06/17 07:45:14.187490 we see the last "no active checks on server [PROXY_B_IP:10051]" message. Shortly after this time we see a strange version of "In refreshActiveChecks()" but with PROXY_A duplicated: "2024/06/17 07:45:26.178372 [101] In refreshActiveChecks() from PROXY_A_IP:10051,PROXY_B_IP:10051,PROXY_A_IP:10051" To me this seems strange. This duplicated PROXY_A message continues until we restart the agent at 2024/06/17 08:52:57.210613. After the restart the duplicated PROXY_A goes away. Thank you, |
Comment by Matt Deeds [ 2024 Jun 19 ] |
Searching through our firewall logs we were able to see that the host switches from PROXY_A to PROXY_B around 7:40:38.000, right when we start seeing "no active checks on server [PROXY_B_IP:10051]". The host switches back over to PROXY_A at 7:45:20.000, around the time we stop seeing the "no active checks on server" and start seeing the duplicated PROXY_A message. |
Comment by Markku Leiniö [ 2024 Jun 19 ] |
Just to be sure: Is the presented agent configuration file the "final" configuration? Asking because you have: Include=C:\Program Files\Server\Managed\Resources\zabbix_team.conf |
Comment by Matt Deeds [ 2024 Jun 24 ] |
Hello Markku, This is the final configuration. We can also confirm that this same behavior is happening on linux agents as well. It continues happening until we restart the proxy service. |
Comment by Markku Leiniö [ 2024 Jun 24 ] |
How are the proxies configured? In the Zabbix community forum there was similar problem where the root cause was incorrect proxy configurations: they both had the same Hostname directive configured. |
Comment by Matt Deeds [ 2024 Jun 24 ] |
That doesn't look like the issue we are having. Both of our proxies have unique hostnames defined. I've attached a proxy config file zabbix_proxy_sanitized.conf |
Comment by Matt Deeds [ 2024 Jun 26 ] |
After some more digging it seems like this zabbix blog post by Markku (https://blog.zabbix.com/zabbix-7-0-proxy-load-balancing/28173/) describes the issue we are having. Specifically the section "Proxy is online but unreachable from the active agent": "This is a non-recoverable situation (at least with the current Zabbix 7.0.0) while the reachability issue persists: The agent keeps on contacting Proxy 1, keeps receiving the redirection, and the same repeats over and over again." This seems like a bug where our only solution is to regularly restart the proxy service. Thanks |
Comment by Markku Leiniö [ 2024 Jun 26 ] |
In the blog post testing I never got your "no active checks on server [PROXY_ID:PORT]: host [HOSTNAME] not found" error in the agent (non-2) log. Basically the proxy is responding with a redirection to the correct proxy (based on the server-induced knowledge of the proxy group state), not with an error message. Here is the agent log from the case: 23378:20240611:222340.599 Starting Zabbix Agent [Zabbix70-agent]. Zabbix 7.0.0 (revision 49955f1fb5c). 23390:20240611:222858.799 Unable to connect to [192.168.7.82]:10051 [cannot connect to [[192.168.7.82]:10051]: connection timed out] |
Comment by Matt Deeds [ 2024 Jun 28 ] |
We were able to get some tcpdump captures showing that when a host is automatically redistributed to another proxy in the group, it appears to never tell the agent this. In this situation the host was originally pointed at PROXY_A and it was then redistributed to PROXY_B. PROXY_B was now shown as the assigned proxy in the host configuration in the Zabbix GUI. Our tcpdump captures show the zabbix agent on the host continue to try and connect with PROXY_A ( HOST_Originally Assigned Proxy.csv So passive checks appear to work as expected with proxy groups, it's just active checks which are failing which is consistent to what we were seeing in the agent logs. The tcpdump captures from both the proxies and the host show that the agent is still sending traffic to the originally assigned proxy (PROXY_A) and not the new one that it was shifted over to (PROXY_B). |
Comment by Markku Leiniö [ 2024 Jun 28 ] |
In your agent configuration you showed that RefreshActiveChecks was at default, meaning 5 seconds in Zabbix 7.0.0 agent 2. In your captures there is 6-second interval between the agent requests. Why is that? |
Comment by Ali Berry [ 2024 Jun 28 ] |
I believe that may have been a fluke because we've checked with a few other hosts that were having issues and they are sending requests every 5 seconds when looking at tcpdump. The content of packets is still same the same between them. |
Comment by Markku Leiniö [ 2024 Jul 06 ] |
I experienced this same case as well now in my test environment. Unfortunately I don't currently have any more information/captures than these:
Unfortunately I don't have server-Proxy1 communication captured before this, but when Proxy1 was restarted, it received this in the full config from the server: "host_proxy": { "data": [ [ 2, "Zabbix70-agent", 2, 54, 3, "", "", "agent-ident", "5aa5afeb13a78079d288e37b58ace825b9c8cf89cfbc403c3a34d693f581d478" ] ], "fields": [ "hostproxyid", "host", "proxyid", "revision", "tls_accept", "tls_issuer", "tls_subject", "tls_psk_identity", "tls_psk" ] }, meaning that at this point Proxy1 got the correct "Zabbix70-agent is monitored by Proxy2" assignment in its configuration message, and Proxy1 was able to respond with a correct redirection. As it looks like: somehow Proxy1 forgot that Zabbix70-agent was supposed to be monitored by Proxy2, and since the agent was not monitored by Proxy1 self either, it had no other option than to respond "agent not found". And by restart it got the full config again from the server.
|
Comment by Markku Leiniö [ 2024 Jul 07 ] |
Opened |
Comment by Markku Leiniö [ 2024 Jul 11 ] |
Experienced the issue again, this time I had packet capture running all the time. Timeline:
Commentary:
Update: also the relevant messages from Proxy2 capture:
|
Comment by Markku Leiniö [ 2024 Jul 12 ] |
One more reproduce for this issue, I'll then leave it for Zabbix support team to comment.
So:
|
Comment by Markku Leiniö [ 2024 Jul 12 ] |
Side note for anyone reproducing and investigating this: Wireshark 4.3.0rc1-447 and newer builds include the new zabbix.hostmap_revision field to assist in packet analysis. |
Comment by Markku Leiniö [ 2024 Jul 14 ] |
Potential workaround (instead of restarting the proxy): Disabling and then enabling the agent host causes Zabbix server to send an updated host_proxy mapping to the proxy, thus enabling the redirection to work again, to restore the agent connectivity. I haven't tested if the server sends the full host_proxy mapping or just a subset of it, so I don't know if using a dummy host (just for triggering the mapping update) would work for all failing agents. |
Comment by Markku Leiniö [ 2024 Jul 17 ] |
Maybe edgar.akhmetshin or someone else should edit this issue to reflect the real issue, this is not about agent (or Windows) problem, this is about server+proxy interoperability when proxy load balancing is used. |
Comment by dimir [ 2024 Jul 22 ] |
markkul Is the new title good or you would like to add something? |
Comment by Markku Leiniö [ 2024 Jul 22 ] |
Thanks dimir, title looks good (according to my own observations), I was also thinking about the affected components+label (= server+proxy instead of agent). |
Comment by dimir [ 2024 Jul 22 ] |
Thanks for mentioning, done! |
Comment by Markku Leiniö [ 2024 Jul 22 ] |
FWIW, in Zabbix 7.0.1 something changed, this from the documentation (https://www.zabbix.com/documentation/current/en/manual/distributed_monitoring/proxies/ha) is not correct anymore:
With 7.0.1, two agents in the same proxy group (failover period = 20s), assigned to Proxy2 because Proxy1 has been down (for the purposes to get the testing agent to Proxy2). When Proxy1 is started and it has returned to the proxy group (without any agents assigned to it yet), in about 4 minutes the server will rebalance the testing agent to Proxy1 (leaving the other agent to Proxy2). This did not happen in 7.0.0. Update: actually the code (https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/src/zabbix_server/pgmanager/pg_cache.c#474) says:
That is not mentioned in the documentation. But, that source file has not been changed after May. As it turns out, there is also a comment:
so this explains the above: in earlier tests I only had that one test host in the proxy group, so the group was balanced all the time. With second testhost added, rebalancing occurs (maybe in about 10*20 seconds). |
Comment by Andris Zeila [ 2024 Jul 24 ] |
This simply means that PROXY_B redirected agent to PROXY_A according to the host-proxy mapping. When redirection occurs the redirected IP is inserted at the start of the IP list. When the redirect fails the redirected address is removed from the list. When a new redirect address is received the redirected address is replaced. And question to mdeeds regarding
Did the proxy B sync its configuration with server during this 4.5m interval? The log messages could happen if host was reassigned from proxy A to B, configuration synced with proxy A, agent would be redirected to proxy B and getting no active checks, since they haven't yet been synced. |
Comment by Markku Leiniö [ 2024 Jul 24 ] |
FWIW: Confirmed this same problem in 7.0.1 (server+proxies+agent) as well: after about 25.5 hours since the server restart, the server sent "full_sync" config to proxies, and the non-assigned proxy again forgot the host-proxy assignments, thus restarting the agent caused loss of monitoring for the agent. |
Comment by Matt Deeds [ 2024 Jul 24 ] |
wiper we are seeing "received configuration data from server at "SERVER_IP", datalen 491" in the zabbix_proxy.log every 10 seconds. |
Comment by Andris Zeila [ 2024 Jul 25 ] |
That was it, thank you. |
Comment by Andris Zeila [ 2024 Jul 25 ] |
Thank you for the investigation/logs. That was really helpful to understand the cause. The 26h (25h-25h to be precise) full sync was a bug. While by itself it was harmless, coupled with hostmap sync during forced full sync it caused host-proxy maps on proxies to be reset. |
Comment by Andris Zeila [ 2024 Jul 25 ] |
Implemented in development branch feature/ZBX-24658-7.0 (pull request) |
Comment by Andris Zeila [ 2024 Jul 26 ] |
Released
|
Comment by Markku Leiniö [ 2024 Jul 30 ] |
LOL when seeing the fix for this bug in the source |