[ZBXNEXT-101] Debugging failed web monitoring steps Created: 2009 Oct 09 Updated: 2016 May 29 Resolved: 2014 Aug 26 |
|
Status: | Closed |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Frontend (F), Proxy (P), Server (S) |
Affects Version/s: | None |
Fix Version/s: | 2.3.4 |
Type: | Change Request | Priority: | Major |
Reporter: | Christoph Haas | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 14 |
Labels: | debugging, webmonitoring | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Debian. Zabbix 1.6.6 |
Attachments: |
![]() ![]() |
||||||||
Issue Links: |
|
Description |
Currently when a Web Monitoring step fails it doesn't tell why. It would be great if there were some way of showing what happened instead. Like showing the first kilobyte of the HTTP body. I increased the debug level on the server but even then had a hard time finding out why a certain step failed. |
Comments |
Comment by Hannu Ylitalo [ 2010 Jan 21 ] |
+1 Also would be nice to have item for each step and have zabbix go through all steps even if one step fails and have some macro for actions/triggers that tells which step failed. |
Comment by greg long [ 2010 Jul 01 ] |
I have created a patch that dumps web output to the zabbix log on a failed web step. Here is the code, feel free to integrate it. It works with 1.4 thru 1.8 modify /src/zabbix_server/httppoller/httptest.c around line 460 replace zabbix_log(LOG_LEVEL_DEBUG, "Page didn't match [%s]", httpstep.required); zabbix_log(LOG_LEVEL_WARNING, "For URL [%s], page didn't match [%s]. Output follows:", httpstep.url, httpstep.required); |
Comment by Arli [ 2012 May 03 ] |
And in more recent versions - in the same file around line 407 modify block to look like this: if ('\0' != *httpstep.required && This workaround helped me a lot while debugging web monitoring scenarios protected by different SSO technologies. |
Comment by Bart Verwilst [ 2012 Oct 25 ] |
Had a step fail for a few minutes, then it recovered. I have no idea why.. |
Comment by richlv [ 2013 Oct 07 ] |
two debugging patches against trunk r38935, thanks to wiper.
|
Comment by Igors Homjakovs (Inactive) [ 2014 May 20 ] |
Fixed in svn://svn.zabbix.com/branches/dev/ZBXNEXT-101 |
Comment by Bart Verwilst [ 2014 May 20 ] |
Could you enlighten us about how this was fix and what kind of goodness we can expect in the next release? Thanks! |
Comment by Bart Verwilst [ 2014 May 20 ] |
If the patches above were used, then never mind.. |
Comment by Igors Homjakovs (Inactive) [ 2014 May 21 ] |
Bart, This fix is still under testing and other modification may still be added. You will be enlightened in a short while |
Comment by Alexander Vladishev [ 2014 May 21 ] |
Related issues: |
Comment by richlv [ 2014 May 22 ] |
hmm, based on the attached patches and earlier discussions, once debug is enabled, sent and received data should be logged no matter whether it failed or not... maybe it's worth discussing and describing implementation details first |
Comment by Igors Homjakovs (Inactive) [ 2014 Jul 02 ] |
Additional modification in r46970 and r46998. Up to now the following things have been implemented:
Now it is possible to change the log level for all processes, one process(e.g. trapper #1), all processes of the same kind (e.g. all trapper processes) or a process with a certain PID.
Due to the fact that now it is possible to send signals to any process additional care should be taken with signal handling. The main problem is that in some cases undefined behaviour is possible after signal arrival. One way to fix it is to set a flag when signal arrives and then check this flag in the beginning of each process loop. Another option is to block a signal while process is in the middle of the loop and then unblock and execute signal handler at the end of the loop. Second important thing is that logging has to be avoided (or improved) in the signal handler because non-signal-safe functions are currently used there. |
Comment by richlv [ 2014 Jul 03 ] |
thanks for the update. detailed spec will still be needed
overall i like this development a lot |
Comment by Igors Homjakovs (Inactive) [ 2014 Jul 04 ] |
Some comments regarding the above questions:
|
Comment by Andris Zeila [ 2014 Jul 14 ] |
(1) In current design adding new process type will change the protocol - the ZBX_PROCESS_TYPE_ALL and ZBX_PROCESS_TYPE_PID values will be different. While old server should not send the signals to new server (and vice versa), it still would be nice to keep the same protocol for the next Zabbix versions too. Also it would be better to separate signal contents from internal constants used for command line parsing (ZBX_TASK_* defines). See the specification draft at https://www.zabbix.org/wiki/Docs/specs/ZBXNEXT-101 igorsh RESOLVED in r47333. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(2) The ZBX_TASK_START_HTTP_DEBUG, ZBX_TASK_STOP_HTTP_DEBUG defines are not used anywhere. igorsh RESOLVED in r47295. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(3) The set_log_level_task() function directly uses the zbx_optarg external variable. It would be better to pass the igorsh RESOLVED in r47333. wiper CLOSED. Please check slight changes in log level control option parsing error messages in r48017 |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(4) Instead of having the process type->name mapping defined in two places (compare_process_type_string() and get_process_type_string() functions) the compare_process_type_string() function should iterate through process types and use get_process_type_string() to test if the specified string matches process type name. Also it should be renamed to get_process_type_by_name() or similar. igorsh RESOLVED in r47333. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(5) Use zbx_error() instead of printf() to print command argument parsing error messages. igorsh RESOLVED in r47334. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(6) The inline help states that the log level options can have optional arguments '=process, N'. It should be '=process,N' without spaces (however process name itself can contain spaces). igorsh RESOLVED in r47333. wiper Runtime control options: config_cache_reload Reload configuration cache log_level_increase <option> Increase log level log_level_decrease <option> Decrease log level Log level control options: No option All processes =pid Pid number of a process =process All specified processes (e.g. all poller processes) =process,N Process name and number (e.g. poller #1, http poller #3) It could be improved, for example to something like this: Runtime control options: config_cache_reload Reload configuration cache log_level_increase[=<target>] Increase log level, affect all processes if target is not specified log_level_decrease[=<target>] Decrease log level, affect all processes if target is not specified Log level control targets: <pid> Process identifier <process type> All processes of specified type (e.g. poller) <process type>,N Process type and number (e.g. poller,3) Still need to reach consensus about the format. REOPENED asaveljevs andris changed --help format recently under wiper yes, we are waiting on his input andris igorsh RESOLVED in r48351 and r48354. asaveljevs Help for zabbix_agentd currently says: Log level control targets: <pid> Process identifier <process type> All processes of specified type (e.g. poller) <process type,N> Process type and number (e.g. poller,3) Zabbix agent does not have a poller process. In the example, it might be better to include a process type that really exists on the agent. Also, according to http://en.wiktionary.org/wiki/e.g.#Usage_notes , in American English "e.g." should be followed by a comma. However, in that case, there may be too many commas. REOPENED. igorsh RESOLVED in r48475. asaveljevs CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(7) There is no need for separate set_log_level_task() implementations for proxy and server. The both do the same thing and can be moved somewhere. igorsh RESOLVED in r47333 and r47336. wiper CLOSED, as agent,server and proxy shares the same get_process_type_by_name() function I removed the callback parameter from set_log_level_task() function. Please review r48021 igorsh Thank you. CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(8) Use THIS_SHOULD_NEVER_HAPPEN instead of assert(0) igorsh RESOLVED in r47487. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(9) Might be better to merge get_process_type_by_server_num() and get_process_num_by_server_num() functions into one function get_process_info_by_thread(int *process_type, int *process_num). igorsh RESOLVED in r47362. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jul 14 ] |
(10) Log level changes should work for agent (nix only) too. igorsh RESOLVED in r47443 and r47474. wiper CLOSED |
Comment by Igors Homjakovs (Inactive) [ 2014 Jul 21 ] |
(11) Signal blocking has been added for server and proxy processes in order to prevent undefined behaviour on user signal(SIGUSR1) arrival. Signal blocking has not been added for dbconfig and proxyconfig processes in order to preserve instant configuration reloading. igorsh RESOLVED in r47486. wiper the signal block was removed, so this is not relevant anymore. CLOSED |
Comment by Andris Zeila [ 2014 Aug 14 ] |
(12) Continuing from the discussion above we should always log the received contents in web monitoring with trace log level, even if the web monitoring step succeeds. igorsh RESOLVED in r48206. wiper CLOSED |
Comment by Andris Zeila [ 2014 Aug 14 ] |
(13) Vmware collector also should log the received data when log level set to trace. igorsh RESOLVED in r48217. wiper CLOSED. Please review r48382 (slightly changed log messages) igorsh Thank you. Looks good. Please review my modifications in r48428. wiper CLOSED |
Comment by Andris Zeila [ 2014 Aug 14 ] |
(14) It would be better to use bit masks/shifts to assemble the signal data. It would make the data independent of system endianess and the code would be easier to read. And instead of reusing ZBX_TASK_* defines for message code, create separate messages, with config cache reload message having the same value 9, but the log level messages having values 1,2 (see spec). igorsh RESOLVED in r48327. wiper CLOSED |
Comment by Andris Zeila [ 2014 Aug 14 ] |
(15) The config cache reload message should be assembled with the same tools used to create log level messages to have the same byte order (as defined in spec). Currently config cache reload message is sent to the first process, which should be configuration syncer. But now we can find the right process so we should use this technique also when redirecting config cache reload message. igorsh RESOLVED in r48327. wiper CLOSED, please review changes in r48401 igorsh Thank you. CLOSED. |
Comment by Alexander Vladishev [ 2014 Aug 22 ] |
(16) man pages should be updated igorsh RESOLVED in r48415. asaveljevs Review will be done later, when we resolve conflicts with |
Comment by Andris Zeila [ 2014 Aug 26 ] |
Successfully tested |
Comment by Alexander Vladishev [ 2014 Aug 26 ] |
(17) Documentation:
asaveljevs When documenting, mention that sleeping processes will be woken up and might unexpectedly start executing. For instance, if we send "log_level_increase" command to a sleeping housekeeper, it will do housekeeping. Similarly, a listening trapper will print that it has processed data in 0 seconds. <richlv> did we just implement asaveljevs My comment above superseded by igorsh I have updated the documentation, except the manpages. Please have a look. <dimir> agent man page should mention support for -R https://www.zabbix.com/documentation/2.4/manpages/zabbix_agentd igorsh RESOLVED. andris It should be documented that changing log level for a single Zabbix process by specifing process PID is supported only if PID <= 65535. On systems with large PIDs <process type,N> can be used to change log level for a single process. igorsh RESOLVED in https://www.zabbix.com/documentation/2.4/manual/concepts/server andris CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 27 ] |
(19) [minor] In log.h, we have the following set of functions: void zabbix_set_log_level(int level); int set_debug_level_up(); int set_debug_level_down(); const char *get_debug_level_string(); int zabbix_check_log_level(int level); In some of them, we say "log level", while in the others we say "debug level". I would suggest to (a) have them all refer to "log level", (b) prefix them with "zabbix_" to avoid issues like igorsh RESOLVED in r48489. asaveljevs Functions that do not take any parameters should have "void" in their parameter list. Otherwise, it is possible to call these functions with any number of arguments. REOPENED. igorsh RESOLVED in r48524. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 27 ] |
(20) This looks a bit wrong, because if proxy and agent are built together without server, then "zbxself" will be processed two times: Index: src/libs/Makefile.am =================================================================== --- src/libs/Makefile.am (revision 48471) +++ src/libs/Makefile.am (working copy) @@ -50,7 +50,11 @@ zbxicmpping \ zbxself endif +if AGENT +AGENT_SUBDIRS = \ + zbxself endif +endif This is exactly what we wanted to avoid in igorsh RESOLVED in r48492. asaveljevs Good, except formatting got broken. REOPENED. igorsh RESOLVED in r48527. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 27 ] |
(21) zabbix_set_log_level() was such a useful debugging function! Very sad to see it being removed, I'd rather keep it. If you would still like to remove it, however, please remove the prototype from include/log.h. igorsh RESOLVED in r48494. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 27 ] |
(22) As noted by sasha, the following logging code (for instance, in src/zabbix_proxy/proxyconfig/proxyconfig.c) is incorrect: zabbix_log(LOG_LEVEL_INFORMATION, "server #%d started [%s #%d]", server_num, get_process_type_string(process_type), process_num); It says hardcoded "server #%d started", but it should be "server" for zabbix_server and "proxy" for zabbix_proxy. asaveljevs Similarly, in src/zabbix_agent/stats.c, there is: zabbix_log(LOG_LEVEL_INFORMATION, "agent #%d started [collector]", server_num); The "collector" part should come from get_process_type_string() function, instead of being hardcoded. asaveljevs Similarly, too, the following code distinguished between server and proxy before: #ifdef HAVE_SNMP init_snmp("zabbix_server"); #endif wiper Working on it, solving libsysperf dependency on collector does not look simple asaveljevs Looks good, but please see r48534. It changes get_daemon_type_string() function to accept unsigned char, same as get_process_type_string(). RESOLVED. wiper Thanks, CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 27 ] |
(23) We now have the following structure: typedef struct { int server_num; int process_num; unsigned char process_type; void *args; } zbx_thread_args_t; This has "server", "process" and "thread", words that almost contradict each other. There is probably little we can do regarding "process" and "thread" - those are already rooted in our code. However, the "server_num" field is new and makes no sense on the agent. Perhaps, "global_num" would be better? Also, perhaps we could replace the following repetitive code with a single call to, say, zbx_thread_unpack_args(args, &global_num, &process_num, &process_type)? process_type = ((zbx_thread_args_t *)args)->process_type; server_num = ((zbx_thread_args_t *)args)->server_num; process_num = ((zbx_thread_args_t *)args)->process_num; wiper I was thinking about creating a single entry function for all process (worker) threads, which would handle args, log the '<binary> started> message and then call the specific entry function (also passed with args). I even started to implement it, but then ran into dependency problems and decided to back off. wiper It was decided to leave zbx_thread_agrs_t possible refactoring in ZBXNEXT-2427 |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(24) In function get_log_level_message(), variable "proc_name" is not always freed. igorsh RESOLVED in r48532. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(25) Not related to this development, but compiling zabbix_sender.dll gives a warning: src\zabbix_sender\win32\zabbix_sender.c(66) : warning C4002: too many actual parameters for macro 'zbx_tcp_recv' wiper Yes, the windows sender dll was not updated when the socket API changed slightly asaveljevs Variable "answer" became unused. Removed it in r48536. RESOLVED. wiper Right, CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(26) The following was added for use with siginfo_t.si_uid field: #define SIG_CHECKED_FIELD_UINT(siginfo, field) (NULL == siginfo ? 0 : siginfo->field) Unfortunately, the value of 0 does not allow to distinguish between root user and NULL siginfo. Perhaps we should live with just SIG_CHECKED_FIELD, as before, despite si_uid being unsigned - it might not be common to have user IDs over 31 bits. wiper RESOLVED in asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(27) A notion of "Zabbix worker process" was introduced in daemon.c: in common_worker_sigusr_handler() function and message "process pid:%d is not a Zabbix worker process". We might wish to avoid introducing a new term and just stick with "Zabbix process". wiper RESOLVED in r48531 asaveljevs Looks better, but not quite: zabbix_log(LOG_LEVEL_ERR, "failed to redirect signal: process pid:%d is not a Zabbix" " process", ZBX_RTC_GET_DATA(flags)); If we try to send SIGUSR1 to the main process, it will say that "pid:12345 is not a Zabbix process", which is not entirely correct. Perhaps, "Zabbix child process" would be better. REOPENED. wiper That's why I wanted to find some definition for Zabbix 'worker' processes. For example if the server invokes a script - technically it's also a Zabbix child process. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(28) In zabbix_agentd.c, the following change was made: static char shortopts[] = - "c:hVpt:" + "c:hVR:pt:" #ifdef _WINDOWS "idsxm" #endif However, "R" should only be present on Unix, in the same way as "usage_message" above contains "R" only for Windows. wiper I was thinking it would be better to give explicit error message about runtime control not being supported on Windows, rather than user thinking that he made an error and trying to find how to change the log level on Windows wiper RESOLVED in r48533 (removed runtime option control from Windows agent options) asaveljevs Please see r48548. Among other things, it removes double break that r48533 introduced. RESOLVED. |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(29) Daemons do not compile on platforms without sigqueue() function (e.g., FreeBSD 4.2): ../../src/libs/zbxnix/libzbxnix.a(daemon.o): In function `common_worker_sigusr_handler': src/libs/zbxnix/daemon.c:78: undefined reference to `zbx_sigusr_handler' igorsh RESOLVED in r48541. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(30) Previously, main process used to say that it started: 14488:20140826:152618.787 server #0 started [main process] Unfortunately, it no longer does. igorsh RESOLVED in r48539. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(31) When we receive signal, we write: 4544:20140828:145453.425 Got signal [signal:10(SIGUSR1),sender_pid:4543,sender_uid:1000,value_int:33025]. Since we are using packed integers for signal messages, it might be better to dump value_int as hexadecimal. wiper RESOLVED in r48529 asaveljevs Changed the hexadecimal dump to be 8 digits long in r48546. RESOLVED. |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(32) When we change log level, we write: 4544:20140828:145449.938 log level has been increased to "4 - debug" As discussed, let's change that to the following instead: 4544:20140828:145449.938 log level has been increased to 4 (debug) igorsh RESOLVED in r48543. asaveljevs There should not quotes around the printed value. REOPENED. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(33) Old bug, but anyway: if a PID file is empty, we quit without an error message: $ touch /tmp/zabbix_server.pid $ sbin/zabbix_server -R log_level_increase zabbix_server [4907]: This is because in function read_pid_file() we do not set an error message if fscanf() fails. igorsh RESOLVED in r48545 and r48547. asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(34) In zabbix_agentd.c, we currently have: case ZBX_TASK_RUNTIME_CONTROL: zbx_load_config(ZBX_CFG_FILE_OPTIONAL); exit(SUCCEED == zbx_sigusr_send(t.flags) ? EXIT_SUCCESS : EXIT_FAILURE); break; Config file loading should not be optional, because if we fail at loading the config, whatever PID file we are thinking of as the default is probably wrong. wiper RESOLVED in r48555 asaveljevs CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Aug 28 ] |
(35) Currently, we allow the following, which sends signal to all our processes. We should not. $ sbin/zabbix_server -R log_level_increase=0 wiper RESOLVED in asaveljevs CLOSED |
Comment by Igors Homjakovs (Inactive) [ 2014 Aug 29 ] |
Available in pre-2.3.4 (trunk) r48604. |
Comment by richlv [ 2014 Aug 30 ] |
some review asked in subissues 7 & 15 |
Comment by richlv [ 2014 Sep 07 ] |
|
Comment by Andris Mednis [ 2014 Sep 08 ] |
(36) get_process_info_by_thread() has no prototype: zabbix_agentd.c:215:5: warning: no previous prototype for 'get_process_info_by_thread' [-Wmissing-prototypes] int get_process_info_by_thread(int server_num, unsigned char *process_type, int *process_num) |
Comment by Andris Mednis [ 2014 Sep 08 ] |
(37) Confusing messages from validation of runtime control option:
$ sbin/zabbix_proxy -c ~/ZBX-8391/zabbix_proxy.conf -R log_level_increaseAAA
zabbix_proxy [21649]: unknown log level control option: AAA
$ sbin/zabbix_proxy -c ~/ZBX-8391/zabbix_proxy.conf -R log_level_increase=poller,
zabbix_proxy [21808]: invalid log level control option: process number must be unsigned short non-zero value
igorsh RESOLVED ir r48933. andris Please review proposed changes in r49134. igorsh Additional changes in r49137. andris Please review proposed change in r49140. igorsh Thank you. Moved to igorsh Agent crashes if only process name is specified (e.g -Rlog_level_increase=poller) andris RESOLVED in r49317. igorsh Thank you. CLOSED |
Comment by Andris Mednis [ 2014 Sep 08 ] |
(38) get_process_info_by_thread() arguments have the same names as global variables in src/zabbix_agent/zabbix_agentd.c: zabbix_agentd.c: In function 'get_process_info_by_thread': zabbix_agentd.c:215:36: warning: declaration of 'server_num' shadows a global declaration [-Wshadow] int get_process_info_by_thread(int server_num, unsigned char *process_type, int *process_num) ^ zabbix_agentd.c:180:22: warning: shadowed declaration is here [-Wshadow] ZBX_THREAD_LOCAL int server_num = 0; ^ zabbix_agentd.c:215:63: warning: declaration of 'process_type' shadows a global declaration [-Wshadow] int get_process_info_by_thread(int server_num, unsigned char *process_type, int *process_num) ^ zabbix_agentd.c:178:32: warning: shadowed declaration is here [-Wshadow] ZBX_THREAD_LOCAL unsigned char process_type = 255; /* ZBX_PROCESS_TYPE_UNKNOWN */ ^ zabbix_agentd.c:215:82: warning: declaration of 'process_num' shadows a global declaration [-Wshadow] int get_process_info_by_thread(int server_num, unsigned char *process_type, int *process_num) ^ zabbix_agentd.c:179:22: warning: shadowed declaration is here [-Wshadow] ZBX_THREAD_LOCAL int process_num; igorsh RESOLVED in r48897 and moved to |
Comment by Aleksandrs Saveljevs [ 2014 Sep 10 ] |
This caused IPMI pollers not working on the server: |
Comment by Andris Zeila [ 2014 Sep 11 ] |
(39) There is a potential lockup if signal is received during message logging - as it will try to lock the mutex already locked by log function. Possible solution would be to block signals during log writes, but need to check performance impact and how it would work from signal handler itself. igorsh Moved to |
Comment by Igors Homjakovs (Inactive) [ 2014 Sep 15 ] |
(40) Debug level 5 messages are not printed if syslog is used. |
Comment by Oleksii Zagorskyi [ 2016 May 19 ] |
Take headers into consideration (including for logging) has been requested in |