[#ZBXNEXT-1056] log unsuccessful active agent connection attempts to server/proxy

 31304:20130404:162600.895 cannot connect to [127.0.0.1:10051] for active check configuration (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 31304:20130404:162931.157 active item data uploading to [127.0.0.1:10051] started to fail: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 31304:20130404:163224.421 active item data uploading to [127.0.0.1:10051] is working again

i'm just slightly unsure about the style/formatting being different for these two messages :

 31304:20130404:162931.157 active item data uploading to [127.0.0.1:10051] started to fail: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 31304:20130404:163200.358 cannot connect to [127.0.0.1:10051] for active check configuration (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)

zalex_ua also, a part "for active check configuration" could be replaced to "to get active checks list".
It's more clear and as we usually describe this process.

Also, could we make difference (a bit) messages for first attempt (see my comment above) and all next attempts?
I'm asking because if agent once successfully got checks list then it will process them even if it cannot connect to zabbix trappers next times.
So a bit different messages could clarify what is the state for an agent during troubleshooting.

wiper will it be okay if the message 'to get' is displayed if active checks array is empty and 'to update' otherwise?

zalex_ua If the array also is empty if connection was success but a host doesn't have active items specified then your suggestion is not very good because it will be a bit incorrect.

I asked because it looks to me like there in agent is some internal flag which controls when agent got at least one (very first?) success response (even with empty checks list) from zabbix server.

wiper there are no flag, that's why I was reluctant to add one just for the warning message. If you are talking about the initial refresh period being 60 seconds - it's always set to 60 seconds after failure. The initial refresh period is 0, so the agent attempts to get active checks list, fails and sets the next check to +60 seconds.

richlv it probably makes sense to keep the message the same then. as wiper noted, "cannot connect to... to get" is not a very nice construct, so it might make sense to keep current form.
the reconnecting after 60 seconds in case of failure is interesting, though. i'd suggest extending the message like this :

cannot connect to [127.0.0.1:10051] for active check configuration, will retry after 60 seconds (...)

zalex_ua ok, agree with Rich. Just ... why "active check" ? I think plural form "active check*s*" would be more correct.

<richlv> there surely is some rule about it, but in general that's "just english"

wiper reverted back to single error message for active checks list retrieval (with corrected text). Fixed in r34829

zalex_ua current dev branch tested. How it looks, there are some my comments inserted with >>>:

 18475:20130408:111658.460 Starting Zabbix Agent [it0]. Zabbix 2.0.6rc1 (revision 34829).
 18476:20130408:111658.461 agent #0 started [collector]
 18477:20130408:111658.461 agent #1 started [listener]
 18478:20130408:111658.461 agent #2 started [listener]
 18479:20130408:111658.461 agent #3 started [listener]
 18480:20130408:111658.462 agent #4 started [active checks]
 18480:20130408:111858.481 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:111958.490 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:112058.498 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> zabbix server started
>>> a bit later one active item added and it started to return data to server
>>> later server stopped 
 18480:20130408:113519.441 active item data uploading to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:113558.454 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:113658.480 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 18480:20130408:113758.513 cannot connect to [127.0.0.1:10051] for active checks configuration, will retry after 60 seconds (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> server started again
 18480:20130408:113821.528 active item data uploading to [127.0.0.1:10051] is working again

Conclusion. We have 2 records for "data uploading" - fail/restoration. But we have just single message for "checks configuration" - just a fail.
Is this consistent ?
Maybe we should also log message when "checks configuration" is working again ?

As you can see in first part of log (when I didn't have active checks) I don't see when connectivity restored.

wiper not sure. Actually is the logging pattern 'fail/restoration' used anywhere else?

zalex_ua additionally

, maybe would be better to not log every failed attempt ?
For example log only once message like this "... cannot connect to [] for active checks configuration, will retry every 60 seconds ..." or something like that.
And of course add "restoration" message then.
Then it will be even more consistent, i.e. single failed/restoration messages for both types.

  7388:20130411:141456.932 active checks data upload to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
  7388:20130411:141546.955 active checks configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
  7388:20130411:142023.189 active checks data upload to [127.0.0.1:10051] is working again
  7388:20130411:142046.203 active checks configuration update from [127.0.0.1:10051] is working again

richlv singular "check" ! btw, why does one failure message has "[connect]" prepended ?

zalex_ua it looks good for me (+ rich's oppinion). then we can just add a note to doc about hardcoded retries in 60 seconds and it will be enough, IMO.

wiper The [connect] describes failure step, (connect, send, receive). I probably will be duplicated by the tcp error message (not 100% sure though), but originally it had 3 error messages, so I left it there.

dimir Here comes dimir. The messages look a bit messy to me. I think at warning level user does not care much what was the task attempted to perform (sending collected data or trying to refresh active checks) when connection to server/proxy was lost. So I propose to simplify the whole this way (note that the message does not differ if the reason was failed data sending or active checks refresh):

>>> Zabbix server gets killed (attempt to refresh active checks)
12910:20130412:041140.037 active agent connection to [127.0.0.1:10051] is lost (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> active check value is added and server started (attempt to send collected data)
12910:20130412:041209.048 active agent connection to [127.0.0.1:10051] is restored
>>> Zabbix server gets killed and active check value is added (attempt to send collected data)
12910:20130412:041237.057 active agent connection to [127.0.0.1:10051] is lost (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
>>> Zabbix server started right before active checks refresh
12910:20130412:041253.067 active agent connection to [127.0.0.1:10051] is restored

[ZBXNEXT-1056] log unsuccessful active agent connection attempts to server/proxy Created: 2011 Dec 16 Updated: 2013 Sep 06 Resolved: 2013 Sep 06
Status:	Closed
Project:	ZABBIX FEATURE REQUESTS
Component/s:	Agent (G)
Affects Version/s:	None
Fix Version/s:	2.0.9, 2.1.3

[ZBXNEXT-1056] log unsuccessful active agent connection attempts to server/proxy Created: 2011 Dec 16 Updated: 2013 Sep 06 Resolved: 2013 Sep 06