[ZBXNEXT-3386] IPMI connection to a single device is kept open by multiple processes Created: 2016 Aug 18 Updated: 2024 Apr 10 Resolved: 2017 Feb 09 |
|
Status: | Closed |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | None |
Fix Version/s: | 3.4.0alpha1, 3.4 (plan) |
Type: | Change Request | Priority: | Minor |
Reporter: | Andris Zeila | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
||||||||||||||||||||||||
Team: | |||||||||||||||||||||||||
Sprint: | Sprint 1 |
Description |
When IPMI value is requested the connection is kept open by the process. This means that after a while each IPMI poller (and possible each unreachable poller) will have an open IPMI connection to all monitored devices, often overloading the BMCs. To avoid it one device (host) should be queried only by one process. This process will:
Zabbix should support multiple dedicated IPMI pollers, each poller monitoring a subset of hosts. |
Comments |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 17 ] |
(1) Errors in IPMI commands cause dead loop in zbx_perform_openipmi_ops that blocks execution of execute_commands so no further command execution is possible. Steps to reproduce: 1. Create item 2. Create trigger 3. Create action 3.1. Create action operation as IPMI command 3.2. Use non-existing command as value (for example, "kolbaski on") 4. Make trigger go into problem state. As far as I can tell, problem was present before current task, so this can be addressed in different issue. wiper RESOLVED in r65388 vjaceslavs I think that there should also be a check for port value "0" as it will cause the same dead loop. wiper Good point, RESOLVED in r65391 vjaceslavs Thanks, but just to make things a bit more consistent, there should also be check for port value "0" when performing IPMI value request (or even better when host is being cached ipmi_manager_cache_host) as poller will stay in the same dead loop. wiper zbx_ipmi_convert_port() is used for both - value and command requests. vjaceslavs Sorry, missed that. Thanks! CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 18 ] |
(2) IPC relies on non-blocking sockets and assumes that blocking state is a successful result for write operation. This can cause problems when write would block on non-blocking socket as event assumes that write failed at the end of data buffer. For example, consider following scenario: ipc_socket_write_message does not check for result of each ipc_write_data call so in cases when header and data is being sent in two consecutive ipc_write_data calls (large data chunk) and first call fails with EAGAIN, data transmission continues with sending data, corrupting the header data as a result. To illustrate this situation we can use dummy packet: HHHHThis is data Two consecutive ipc_write_data calls happen: If first call ends with EAGAIN and no data sent, data part is being transferred, sending "This is data" (let's assume, that there are no errors and all data is sent). This is datadata This problem can happen only on high load when sending of header (small data chunk) can fail and when data exceeds ZBX_IPC_SOCKET_BUFFER_SIZE so data is being sent in two consecutive ipc_write_data calls. wiper Aded check for the actual bytes written after sending header for large messages. vjaceslavs Thanks! CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 19 ] |
(3) zbx_ipmi_execute_command have hardcoded timeout (10 sec) for a start of IPC communication (zbx_ipc_socket_open) this can cause lag in command execution on socket error. wiper Failure to connect to IPMI manager should have been treated as fatal error. So we decided to increase timetout to 60 seconds and exit process if failed. vjaceslavs CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 19 ] |
(4) Current IPC implementation relies on the fact that IPC service (thread or process that executes zbx_ipc_service_start) will respond and that it will respond in a small period of time. IPC clients use zbx_ipc_socket_read, that on blocking socket can cause block in execution flow for uncontrolled time period. I suggest to introduce timeouts for IPC (not only for zbx_ipc_socket_open) to provide more flexible use of IPC. This is related to situation described in (1) wiper Currently failure to get answer from IPC service is fatal error. Even in case of IPMI command execution the command will timeout on IPMI pollers and error result will be returned. In future we will add timeout parameter to zbx_ipc_socket_read function when it will be required. CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 20 ] |
(5) zbx_deserialize_str uses malloc (not zbx_malloc), but result of malloc is not checked so out of memory error will cause crash. wiper RESOLVED in r65279 vjaceslavs CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 25 ] |
(6) ipmi_poller_send_request can cause sending of corrupted messages. ipmi_poller_send_request checks result of zbx_ipc_client_send and stops if zbx_ipc_client_send fails nevertheless this does not ensure that full message is transferred. Consider the following case: Now there are only two possible outcomes: If we add the fact that deserialize operations like zbx_deserialize_str does not take into account total size of message this can lead to buffer overrun. wiper In the IPMI manager/poller context failure to send message to poller is a critical error and Zabbix should exit. However in generic context it might not be a critical failure. So for now it will be enough to log error and remove client. vjaceslavs Current fix can cause pollers to become disconnected (as ipc_service_remove_client closes the connection) from IPMI manager without any option to reconnect. Is this isolation (inability to communicate with manager) of poller intended? wiper Yes, if write failed then the connection must have been closed. Either by poller crashing or by external tampering. In the second case reading from service socket should fail and poller will exit with critical error. vjaceslavs Ok, thanks! CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 26 ] |
(7) [D] Specification is missing a note about the fact that IPC socket files should stay intact while Zabbix is running. 12678:20170126:094124.013 cannot connect to IPMI service: Cannot connect to service "ipmi": [2] No such file or directory. 12655:20170126:094124.016 One child process died (PID:12678,exitcode/signal:1). Exiting ... 12655:20170126:094126.029 syncing history data... 12655:20170126:094126.029 syncing history data done 12655:20170126:094126.029 syncing trend data... 12655:20170126:094126.038 syncing trend data done 12655:20170126:094126.038 Zabbix Server stopped. Zabbix 3.3.0 (revision {ZABBIX_REVISION}). I am not really sure if it is ok to just stop if socket is gone (I think it needs approval). wiper Loosing an internal service is a critical failure, which usually is handled by exiting Zabbix. vjaceslavs I agree that this is a critical failure, but I am not sure if silent death of monitoring solution is the best way to react on that. I would suggest to:
Without that monitoring solution dies if socket file is absent and user needs one more monitoring solution just to monitor the Zabbix as there are no recover strategies / notifications. But all I am saying is that it is a suggestion and not a bug in implementation of current task. Nevertheless there should be a warning in documentation about the fact that death of socket file will cause death of Zabbix server, so socket file location should be chosen wisely. wiper Of course ideally we should try to recover from critical errors - that includes restarting crashed processes (probably not more that X times per last Y minutes to avoid crash loops). wiper Socket creation was fixed to use correct permissions for the socket file, reducing probability of accidental socket file removal. |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 31 ] |
(8) Some warnings are missing from log. For example if IPMI host becomes disabled there was a warning level record in log: temporarily disabling IPMI agent checks on host "localhost linux": host unavailable The same is for IPMI item error: IPMI agent item "IPMIItem1" on host "localhost linux" failed: first network error, wait for 15 seconds Now there are no such record in log. vjaceslavs Sorry, my bad. Looks like host was already disabled. CLOSED |
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 31 ] |
Successfully tested |
Comment by Andris Zeila [ 2017 Feb 01 ] |
(9) [I] The new SocketDir parameter must be added to default server and proxy configuration files (conf/zabbix_*.conf) RESOLVED in r65416 vjaceslavs CLOSED |
Comment by Andris Zeila [ 2017 Feb 01 ] |
(9) [D] Documentation:
RESOLVED sasha Wonderful! But please add information about "IPMI manager" process. REOPENED wiper RESOLVED sasha CLOSED |
Comment by Andris Zeila [ 2017 Feb 02 ] |
(10) [I] IPMI manager process monitoring item must be added to Zabbix App Proxy template. Also IPMI manager process monitoring items must be added to internal process monitoring graphs (server and proxy). RESOLVED in r65474 vjaceslavs CLOSED |
Comment by Andris Zeila [ 2017 Feb 02 ] |
(11) [S] Defects reported by coverity scan: 155916, 155917, 155918, 155921, 155922 RESOLVED in r65481, r65482, r65484, r65485, r65486 vjaceslavs Result of inner fcntl call still remains unchecked: if (-1 == fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK)) wiper RESOLVED in r65497 vjaceslavs CLOSED glebs.ivanovskis Coverity agrees. |
Comment by Andris Zeila [ 2017 Feb 03 ] |
(12) There are potential conflicts when multiple zabbix instances are using the same socket file. To fix it:
RESOLVED in r65501 vjaceslavs CLOSED |
Comment by Alexander Vladishev [ 2017 Feb 06 ] |
(13) After upgrade to the latest 3.3.0 r65534 power control return 0 (off) instead of 1 (on) wiper RESOLVED in r65556 vjaceslavs CLOSED |
Comment by Alexander Vladishev [ 2017 Feb 06 ] |
(14) Cannot compile agent $ ./configure --enable-agent $ make clean > /dev/null $ make install -s Making install in src Making install in libs Making install in zbxcrypto Making install in zbxcommon Making install in zbxlog Making install in zbxalgo Making install in zbxnix ipcservice.c:8:19: error: event.h: No such file or directory ipcservice.c: In function ‘event_new’: ipcservice.c:73: error: invalid application of ‘sizeof’ to incomplete type ‘struct event’ ipcservice.c: In function ‘ipc_service_add_client’: ipcservice.c:763: error: ‘EV_READ’ undeclared (first use in this function) ipcservice.c:763: error: (Each undeclared identifier is reported only once ipcservice.c:763: error: for each function it appears in.) ipcservice.c:763: error: ‘EV_PERSIST’ undeclared (first use in this function) ipcservice.c:764: error: ‘EV_WRITE’ undeclared (first use in this function) ipcservice.c: In function ‘ipc_service_event_log_cb’: ipcservice.c:918: error: ‘_EVENT_LOG_DEBUG’ undeclared (first use in this function) ipcservice.c:921: error: ‘_EVENT_LOG_MSG’ undeclared (first use in this function) ipcservice.c:924: error: ‘_EVENT_LOG_WARN’ undeclared (first use in this function) ipcservice.c:927: error: ‘_EVENT_LOG_ERR’ undeclared (first use in this function) ipcservice.c: In function ‘zbx_ipc_service_start’: ipcservice.c:1412: warning: assignment makes pointer from integer without a cast ipcservice.c:1413: error: ‘EV_READ’ undeclared (first use in this function) ipcservice.c:1413: error: ‘EV_PERSIST’ undeclared (first use in this function) ipcservice.c: In function ‘zbx_ipc_service_recv’: ipcservice.c:1499: error: ‘EVLOOP_ONCE’ undeclared (first use in this function) ipcservice.c:1502: error: ‘EVLOOP_NONBLOCK’ undeclared (first use in this function) vjaceslavs Just checked current version of dev branch. Agent compiles without any problems for me. wiper Looks like there are 2 problems:
wiper Configure script did check for libevent, but only if server/proxy are being built. I moved IPC service related code under HAVE_IPCSERVICE define, which currently is defined if ipmi support is enabled (--with-openipmi). vjaceslavs Agent compiles with / without libevent installed. CLOSED |
Comment by Andris Zeila [ 2017 Feb 08 ] |
(15) [S] Trunk does not compile with OpenIPMI. Fix (provided by glebs.ivanovskis): Index: src/zabbix_server/ipmi/ipmi_manager.c =================================================================== --- src/zabbix_server/ipmi/ipmi_manager.c (revision 65567) +++ src/zabbix_server/ipmi/ipmi_manager.c (working copy) @@ -747,7 +747,7 @@ init_result(&result); SET_TEXT_RESULT(&result, value); value = NULL; - dc_add_history(itemid, ITEM_VALUE_TYPE_TEXT, 0, &result, &ts, state, NULL); + dc_add_history(itemid, 0, &result, &ts, state, NULL); free_result(&result); } break; @@ -756,7 +756,7 @@ case AGENT_ERROR: case CONFIG_ERROR: state = ITEM_STATE_NOTSUPPORTED; - dc_add_history(itemid, ITEM_VALUE_TYPE_TEXT, 0, NULL, &ts, state, value); + dc_add_history(itemid, 0, NULL, &ts, state, value); } dc_flush_history(); @@ -847,7 +847,7 @@ int errcode = CONFIG_ERROR; zbx_timespec(&ts); - dc_add_history(items[i].itemid, ITEM_VALUE_TYPE_TEXT, 0, NULL, &ts, state, error); + dc_add_history(items[i].itemid, 0, NULL, &ts, state, error); DCrequeue_items(&items[i].itemid, &state, &ts.sec, NULL, NULL, &errcode, 1); zbx_free(error); continue; RESOLVED in 65569 vjaceslavs CLOSED |
Comment by Andris Zeila [ 2017 Feb 08 ] |
(16) [S] Suppressed multiple unused parameter warnings in libevent callback functions. RESOLVED in r65588 vjaceslavs CLOSED |
Comment by Andris Zeila [ 2017 Feb 09 ] |
The latest fixes released in:
|