[ZBXNEXT-1056] log unsuccessful active agent connection attempts to server/proxy Created: 2011 Dec 16  Updated: 2013 Sep 06  Resolved: 2013 Sep 06

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Agent (G)
Affects Version/s: None
Fix Version/s: 2.0.9, 2.1.3

Type: Change Request Priority: Major
Reporter: richlv Assignee: Unassigned
Resolution: Fixed Votes: 4
Labels: logging, trivial
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

zabbix agent currently logs if it connected to the server/proxy, but there were no active checks. if the connection failed (timeout, rejected etc), no logging happens at all. this makes problem solving much, much harder.

such errors should be logged at debuglevel 3 (probably also on lower debuglevels, but nobody uses those anyway...)



 Comments   
Comment by Oleksii Zagorskyi [ 2011 Dec 17 ]

But such errors are logged with the DebugLevel=4. No?

Comment by richlv [ 2011 Dec 19 ]

sure, but that's not good enough.
if there's a firewall or config problem (which is common), it is very hard for user to trace down the source of it. given that agent already complains whenever it gets no active items from the server, it should complain when it can not connect to the server at all.

<zalex> I absolutely agree! Voted

Comment by richlv [ 2013 Apr 02 ]

when config syncing fails, it should be logged every time (message like "cannot connect to <ip/dns> to obtain configuration" ?);
when data sending fails, it should log the first failure (message like "data connections to <ip/dns> started to fail" ?), then nothing on subsequent failures. when connection succeeds after several failures, it should log that (message like "data connections to <ip/dns> succeeding again" ?)

(messages could be improved, of course )

Comment by Oleksii Zagorskyi [ 2013 Apr 02 ]

(1) Also note undocumented feature - when agent just started it retries to get list of active checks first time every 60 seconds.
After getting successful response from server it refreshes it according to RefreshActiveChecks parameter.

<richlv> ouch. that must be documented... anything similar for active proxies ?

Comment by Oleksii Zagorskyi [ 2013 Apr 02 ]

Also just small interesting info:
Zabbix server populates a "autoreg_host" table by active agent records even if you don't have auto-registration actions configured.

So there at any time you can find all names which your active agents had/have.

Comment by Andris Zeila [ 2013 Apr 04 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-1056

Comment by richlv [ 2013 Apr 04 ]

(1) seems to work nicely :

 31304:20130404:162600.895 cannot connect to [127.0.0.1:10051] for active check configuration (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 31304:20130404:162931.157 active item data uploading to [127.0.0.1:10051] started to fail: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 31304:20130404:163224.421 active item data uploading to [127.0.0.1:10051] is working again

i'm just slightly unsure about the style/formatting being different for these two messages :

 31304:20130404:162931.157 active item data uploading to [127.0.0.1:10051] started to fail: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 31304:20130404:163200.358 cannot connect to [127.0.0.1:10051] for active check configuration (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)

zalex_ua also, a part "for active check configuration" could be replaced to "to get active checks list".
It's more clear and as we usually describe this process.

Also, could we make difference (a bit) messages for first attempt (see my comment above) and all next attempts?
I'm asking because if agent once successfully got checks list then it will process them even if it cannot connect to zabbix trappers next times.
So a bit different messages could clarify what is the state for an agent during troubleshooting.

Maybe difference could be "to get list ...." and "to update list ..." ?

wiper will it be okay if the message 'to get' is displayed if active checks array is empty and 'to update' otherwise?

zalex_ua If the array also is empty if connection was success but a host doesn't have active items specified then your suggestion is not very good because it will be a bit incorrect.

I asked because it looks to me like there in agent is some internal flag which controls when agent got at least one (very first?) success response (even with empty checks list) from zabbix server.

wiper there are no flag, that's why I was reluctant to add one just for the warning message. If you are talking about the initial refresh period being 60 seconds - it's always set to 60 seconds after failure. The initial refresh period is 0, so the agent attempts to get active checks list, fails and sets the next check to +60 seconds.

richlv it probably makes sense to keep the message the same then. as wiper noted, "cannot connect to... to get" is not a very nice construct, so it might make sense to keep current form.
the reconnecting after 60 seconds in case of failure is interesting, though. i'd suggest extending the message like this :

cannot connect to [127.0.0.1:10051] for active check configuration, will retry after 60 seconds (...)

zalex_ua ok, agree with Rich. Just ... why "active check" ? I think plural form "active check*s*" would be more correct.

<richlv> there surely is some rule about it, but in general that's "just english"

wiper reverted back to single error message for active checks list retrieval (with corrected text). Fixed in r34829

zalex_ua current dev branch tested. How it looks, there are some my comments inserted with >>>:

 18475:20130408:111658.460 Starting Zabbix Agent [it0]. Zabbix 2.0.6rc1 (revision 34829).
 18476:20130408:111658.461 agent #0 started [collector]
 18477:20130408:111658.461 agent #1 started [listener]
 18478:20130408:111658.461 agent #2 started [listener]
 18479:20130408:111658.461 agent #3 started [listener]
 18480:20130408:111658.462 agent #4 started [active checks]
 18480:20130408:111858.481 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:111958.490 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:112058.498 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> zabbix server started
>>> a bit later one active item added and it started to return data to server
>>> later server stopped 
 18480:20130408:113519.441 active item data uploading to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:113558.454 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:113658.480 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:113758.513 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> server started again
 18480:20130408:113821.528 active item data uploading to [127.0.0.1:10051] is working again

Conclusion. We have 2 records for "data uploading" - fail/restoration. But we have just single message for "checks configuration" - just a fail.
Is this consistent ?
Maybe we should also log message when "checks configuration" is working again ?

As you can see in first part of log (when I didn't have active checks) I don't see when connectivity restored.

wiper not sure. Actually is the logging pattern 'fail/restoration' used anywhere else?

richlv yes, see connections failing/working to passive agents, snmp devices etc

zalex_ua additionally , maybe would be better to not log every failed attempt ?
For example log only once message like this "... cannot connect to [] for active checks configuration, will retry every 60 seconds ..." or something like that.
And of course add "restoration" message then.
Then it will be even more consistent, i.e. single failed/restoration messages for both types.

wiper okay, so any opinions on the following messages?

  7388:20130411:141456.932 active checks data upload to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
  7388:20130411:141546.955 active checks configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
  7388:20130411:142023.189 active checks data upload to [127.0.0.1:10051] is working again
  7388:20130411:142046.203 active checks configuration update from [127.0.0.1:10051] is working again

richlv singular "check" ! btw, why does one failure message has "[connect]" prepended ?

zalex_ua it looks good for me (+ rich's oppinion). then we can just add a note to doc about hardcoded retries in 60 seconds and it will be enough, IMO.

wiper The [connect] describes failure step, (connect, send, receive). I probably will be duplicated by the tcp error message (not 100% sure though), but originally it had 3 error messages, so I left it there.

wiper reverted to singular form 'active check', fixed in r34959

richlv we could bikeshed this longer, but from my side - CLOSED =)

zalex_ua yeah, thanks Andris for your patience . CLOSED

dimir Here comes dimir. The messages look a bit messy to me. I think at warning level user does not care much what was the task attempted to perform (sending collected data or trying to refresh active checks) when connection to server/proxy was lost. So I propose to simplify the whole this way (note that the message does not differ if the reason was failed data sending or active checks refresh):

(We start with running server and 0 collected values to send.)

>>> Zabbix server gets killed (attempt to refresh active checks)
12910:20130412:041140.037 active agent connection to [127.0.0.1:10051] is lost (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> active check value is added and server started (attempt to send collected data)
12910:20130412:041209.048 active agent connection to [127.0.0.1:10051] is restored
>>> Zabbix server gets killed and active check value is added (attempt to send collected data)
12910:20130412:041237.057 active agent connection to [127.0.0.1:10051] is lost (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> Zabbix server started right before active checks refresh
12910:20130412:041253.067 active agent connection to [127.0.0.1:10051] is restored
Comment by Andris Zeila [ 2013 Apr 04 ]

Good point, fixed in r34809. I think the second one with error message in () looks better.

zalex_ua see also my added notes in the comment above.

zalex_ua CLOSED

Comment by richlv [ 2013 Apr 05 ]

(2) to be documented in whatsnew;
the retry after 60 seconds in case of failure should be documented, too

wiper added do whatsnew - https://www.zabbix.com/documentation/2.0/manual/introduction/whatsnew206?&#miscellaneous_daemon_improvements

Updated RefreshActiveChecks option documentation:
https://www.zabbix.com/documentation/2.0/manual/appendix/config/zabbix_agentd
https://www.zabbix.com/documentation/2.0/manual/appendix/config/zabbix_agentd_win
https://www.zabbix.com/documentation/2.2/manual/appendix/config/zabbix_agentd
https://www.zabbix.com/documentation/2.2/manual/appendix/config/zabbix_agentd_win

Please review.

zalex_ua It looks good to me.
But similarly to, for example, what_s_new_1.8.9 I've decided to extend whatsnew206

I did not find more pages where details about hardcoded 60 seconds could be nicely added as well

Please review.

wiper looks good.

<richlv> i believe that means CLOSED

Comment by dimir [ 2013 Apr 12 ]

Successfully tested.

Comment by Andris Zeila [ 2013 Apr 12 ]

Released in:
pre-2.0.6 r34987
pre-2.1.0 34988

Comment by Andris Zeila [ 2013 Apr 12 ]

Reopening for documentation update

Comment by Alexander Vladishev [ 2013 Apr 12 ]

REOPENED

(3) Compilation warnings should be fixed:

active.c: In function ‘refresh_active_checks’:
active.c:405:5: warning: suggest explicit braces to avoid ambiguous ‘else’ [-Wparentheses]

The source code is incorrectly formatted (no parentheses):

if (last_ret != ret)
        if (SUCCEED != ret)
                zabbix_log(LOG_LEVEL_WARNING, "active check configuration update from [%s:%u] started to fail (%s)",
                        host, port, zbx_tcp_strerror());
        else
                zabbix_log(LOG_LEVEL_WARNING, "active check configuration update from [%s:%u] is working again",
                        host, port);

wiper fixed in r35012

wiper compilation warnings and coding style fixes are fixed in pre-2.0.6 r35024 and pre-2.1.0 35025

sasha Thanks! CLOSED

Comment by richlv [ 2013 Jun 13 ]

(4) current message ordering seems to be slightly strange. can these messages be printed in a reversed order ?

23887:20130613:154159.801 No active checks on server: host [labsrv] not found
23887:20130613:154159.801 active check configuration update from [192.168.5.102:10051] is working again

zalex_ua Is this original order? I don't see anything bad in the order.
How these lines (identical timestamps) could be generated in real life ?
Maybe it's an actual problem ?

<richlv> this is a snippet from an actual logfile. i assume we got back the response and thus consider the host to be available, but the response failed to find that host. the message that config update works again should come before the message which tells us what we got from the server

wiper It's possible to shuffle the code slightly and add additional check to parse the received buffer after printing connection error messages.

wiper RESOLVED in r37931

 23597:20130819:153309.217 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 23597:20130819:153509.232 active check configuration update from [127.0.0.1:10051] is working again
 23597:20130819:153509.232 No active checks on server: host [zbw agent] not found

sasha CLOSED with small changes in warning messages. Please review.

Comment by Andris Zeila [ 2013 Aug 20 ]

Released in:
pre-2.0.9 r38022
pre-2.1.3 r37984

Comment by richlv [ 2013 Aug 31 ]

(5) i slightly clarified the changelog entry - please, review

wiper reviewed, CLOSED

Generated at Thu Apr 25 23:28:08 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.