[ZBXNEXT-3386] IPMI connection to a single device is kept open by multiple processes Created: 2016 Aug 18  Updated: 2024 Apr 10  Resolved: 2017 Feb 09

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Proxy (P), Server (S)
Affects Version/s: None
Fix Version/s: 3.4.0alpha1, 3.4 (plan)

Type: Change Request Priority: Minor
Reporter: Andris Zeila Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Causes
causes ZBX-14764 Proxy crashes when StartIPMIPollers=2 Closed
Duplicate
is duplicated by ZBX-3243 Network error while retrieving IPMI data Closed
Sub-task
depends on ZBX-4823 unreachable pollers may "hang" when a... Closed
Team: Team A
Sprint: Sprint 1

 Description   

When IPMI value is requested the connection is kept open by the process. This means that after a while each IPMI poller (and possible each unreachable poller) will have an open IPMI connection to all monitored devices, often overloading the BMCs.

To avoid it one device (host) should be queried only by one process. This process will:

  • accept IPMI keys from normal and unreachable pollers
  • query the device for requested data
  • return the data to the poller

Zabbix should support multiple dedicated IPMI pollers, each poller monitoring a subset of hosts.



 Comments   
Comment by Vjaceslavs Bogdanovs [ 2017 Jan 17 ]

(1) Errors in IPMI commands cause dead loop in zbx_perform_openipmi_ops that blocks execution of execute_commands so no further command execution is possible.

Steps to reproduce:

1. Create item
2. Create trigger
3. Create action
    3.1. Create action operation as IPMI command
    3.2. Use non-existing command as value (for example, "kolbaski on")
4. Make trigger go into problem state.

As far as I can tell, problem was present before current task, so this can be addressed in different issue.

wiper RESOLVED in r65388

vjaceslavs I think that there should also be a check for port value "0" as it will cause the same dead loop.

wiper Good point, RESOLVED in r65391

vjaceslavs Thanks, but just to make things a bit more consistent, there should also be check for port value "0" when performing IPMI value request (or even better when host is being cached ipmi_manager_cache_host) as poller will stay in the same dead loop.
I also suggest to introduce "port 0" validation in frontend.

wiper zbx_ipmi_convert_port() is used for both - value and command requests.

vjaceslavs Sorry, missed that. Thanks! CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 18 ]

(2) IPC relies on non-blocking sockets and assumes that blocking state is a successful result for write operation. This can cause problems when write would block on non-blocking socket as event assumes that write failed at the end of data buffer.

For example, consider following scenario:
1. ipmi_poller_send_request is called
2. zbx_ipc_client_send is called from ipmi_poller_send_request
3. ipc_socket_write_message is called from zbx_ipc_client_send and this is where the fun part begins

ipc_socket_write_message does not check for result of each ipc_write_data call so in cases when header and data is being sent in two consecutive ipc_write_data calls (large data chunk) and first call fails with EAGAIN, data transmission continues with sending data, corrupting the header data as a result.

To illustrate this situation we can use dummy packet:

HHHHThis is data

Two consecutive ipc_write_data calls happen:
1. Sending "HHHH" (header)
2. Sending "This is data" (data)

If first call ends with EAGAIN and no data sent, data part is being transferred, sending "This is data" (let's assume, that there are no errors and all data is sent).
Then check occurs and creates event listener to retry data transmission with offset being set as 12 (size of data sent) and in a short time period event listener is being called sending "data". As a result, packet gets corrupted:

This is datadata

This problem can happen only on high load when sending of header (small data chunk) can fail and when data exceeds ZBX_IPC_SOCKET_BUFFER_SIZE so data is being sent in two consecutive ipc_write_data calls.

wiper Aded check for the actual bytes written after sending header for large messages.
This should fix the issue when ipc_socket_write_message is used by zbx_ipc_client_send function. The zbx_ipc_socket_write function is used by clients and is operating on blocking sockets - so there should be no problems.
RESOLVED in r65168

vjaceslavs Thanks! CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 19 ]

(3) zbx_ipmi_execute_command have hardcoded timeout (10 sec) for a start of IPC communication (zbx_ipc_socket_open) this can cause lag in command execution on socket error.
Decision should be made on how IPC timeouts should be implemented: if timeout should not be configurable by user, then smaller values should be used as 10s delay on IPC can result noticeable impact on execution flow.

wiper Failure to connect to IPMI manager should have been treated as fatal error. So we decided to increase timetout to 60 seconds and exit process if failed.
RESOLVED in r65278

vjaceslavs CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 19 ]

(4) Current IPC implementation relies on the fact that IPC service (thread or process that executes zbx_ipc_service_start) will respond and that it will respond in a small period of time. IPC clients use zbx_ipc_socket_read, that on blocking socket can cause block in execution flow for uncontrolled time period. I suggest to introduce timeouts for IPC (not only for zbx_ipc_socket_open) to provide more flexible use of IPC. This is related to situation described in (1)

wiper Currently failure to get answer from IPC service is fatal error. Even in case of IPMI command execution the command will timeout on IPMI pollers and error result will be returned. In future we will add timeout parameter to zbx_ipc_socket_read function when it will be required.

CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 20 ]

(5) zbx_deserialize_str uses malloc (not zbx_malloc), but result of malloc is not checked so out of memory error will cause crash.

wiper RESOLVED in r65279

vjaceslavs CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 25 ]

(6) ipmi_poller_send_request can cause sending of corrupted messages. ipmi_poller_send_request checks result of zbx_ipc_client_send and stops if zbx_ipc_client_send fails nevertheless this does not ensure that full message is transferred. Consider the following case:
1. zbx_ipc_client_send sends only part of data and add an event to queue for asynchronous write operation (ipc_client_write_event_cb)
2. Asynchronous write (ipc_client_write) fails, events get terminated by ipc_client_free_events

Now there are only two possible outcomes:
1. poller gets stuck waiting for data in zbx_ipc_socket_read
2. poller receives part of data from first zbx_ipc_client_send call and rest of the data from another zbx_ipc_client_send call

If we add the fact that deserialize operations like zbx_deserialize_str does not take into account total size of message this can lead to buffer overrun.

wiper In the IPMI manager/poller context failure to send message to poller is a critical error and Zabbix should exit. However in generic context it might not be a critical failure. So for now it will be enough to log error and remove client.
RESOLVED in r65347

vjaceslavs Current fix can cause pollers to become disconnected (as ipc_service_remove_client closes the connection) from IPMI manager without any option to reconnect. Is this isolation (inability to communicate with manager) of poller intended?

wiper Yes, if write failed then the connection must have been closed. Either by poller crashing or by external tampering. In the second case reading from service socket should fail and poller will exit with critical error.

vjaceslavs Ok, thanks! CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 26 ]

(7) [D] Specification is missing a note about the fact that IPC socket files should stay intact while Zabbix is running.
If I set /tmp as SocketDir configuration value and socket file (currently zabbix_ipmi.sock) gets deleted while Zabbix is running, then Zabbix continues to work without any problems except for IPMI commands. When trying to execute IPMI command, Zabbix stops:

 12678:20170126:094124.013 cannot connect to IPMI service: Cannot connect to service "ipmi": [2] No such file or directory.
 12655:20170126:094124.016 One child process died (PID:12678,exitcode/signal:1). Exiting ...
 12655:20170126:094126.029 syncing history data...
 12655:20170126:094126.029 syncing history data done
 12655:20170126:094126.029 syncing trend data...
 12655:20170126:094126.038 syncing trend data done
 12655:20170126:094126.038 Zabbix Server stopped. Zabbix 3.3.0 (revision {ZABBIX_REVISION}).

I am not really sure if it is ok to just stop if socket is gone (I think it needs approval).
It is not really a bug of Zabbix, but it is a weird behavior (stopping server) and should be mentioned in documentation (that socket files should remain intact). Just in case.

wiper Loosing an internal service is a critical failure, which usually is handled by exiting Zabbix.
Regarding documentation - we don't state that shared memory, semaphores must not be removed by other processes, so I don't think we should write something about sockets.

vjaceslavs I agree that this is a critical failure, but I am not sure if silent death of monitoring solution is the best way to react on that. I would suggest to:

  1. Try to reinitialize the communication channel
  2. Send notification of any kind so death is not silent

Without that monitoring solution dies if socket file is absent and user needs one more monitoring solution just to monitor the Zabbix as there are no recover strategies / notifications. But all I am saying is that it is a suggestion and not a bug in implementation of current task.

Nevertheless there should be a warning in documentation about the fact that death of socket file will cause death of Zabbix server, so socket file location should be chosen wisely.

wiper Of course ideally we should try to recover from critical errors - that includes restarting crashed processes (probably not more that X times per last Y minutes to avoid crash loops).

wiper Socket creation was fixed to use correct permissions for the socket file, reducing probability of accidental socket file removal.
CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 31 ]

(8) Some warnings are missing from log. For example if IPMI host becomes disabled there was a warning level record in log:

temporarily disabling IPMI agent checks on host "localhost linux": host unavailable

The same is for IPMI item error:

IPMI agent item "IPMIItem1" on host "localhost linux" failed: first network error, wait for 15 seconds

Now there are no such record in log.

vjaceslavs Sorry, my bad. Looks like host was already disabled. CLOSED

Comment by Vjaceslavs Bogdanovs [ 2017 Jan 31 ]

Successfully tested

Comment by Andris Zeila [ 2017 Feb 01 ]

(9) [I] The new SocketDir parameter must be added to default server and proxy configuration files (conf/zabbix_*.conf)

RESOLVED in r65416

vjaceslavs CLOSED

Comment by Andris Zeila [ 2017 Feb 01 ]

(9) [D] Documentation:

RESOLVED

sasha Wonderful! But please add information about "IPMI manager" process.

REOPENED

wiper RESOLVED

sasha CLOSED

Comment by Andris Zeila [ 2017 Feb 02 ]

(10) [I] IPMI manager process monitoring item must be added to Zabbix App Proxy template.

Also IPMI manager process monitoring items must be added to internal process monitoring graphs (server and proxy).

RESOLVED in r65474

vjaceslavs CLOSED

Comment by Andris Zeila [ 2017 Feb 02 ]

(11) [S] Defects reported by coverity scan: 155916, 155917, 155918, 155921, 155922

RESOLVED in r65481, r65482, r65484, r65485, r65486

vjaceslavs Result of inner fcntl call still remains unchecked:

if (-1 == fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK))

wiper RESOLVED in r65497

vjaceslavs CLOSED

glebs.ivanovskis Coverity agrees. Thank you!

Comment by Andris Zeila [ 2017 Feb 03 ]

(12) There are potential conflicts when multiple zabbix instances are using the same socket file. To fix it:

  • IPMI manager will drop connections from foreign processes
  • Programm type (server, proxy, agent) will be added to IPC service socket file name, so server and proxy can run on the same system with default settings.

RESOLVED in r65501

vjaceslavs CLOSED

Comment by Alexander Vladishev [ 2017 Feb 06 ]

(13) After upgrade to the latest 3.3.0 r65534 power control return 0 (off) instead of 1 (on)

wiper RESOLVED in r65556

vjaceslavs CLOSED

Comment by Alexander Vladishev [ 2017 Feb 06 ]

(14) Cannot compile agent

$ ./configure --enable-agent
$ make clean > /dev/null
$ make install -s
Making install in src
Making install in libs
Making install in zbxcrypto
Making install in zbxcommon
Making install in zbxlog
Making install in zbxalgo
Making install in zbxnix
ipcservice.c:8:19: error: event.h: No such file or directory
ipcservice.c: In function ‘event_new’:
ipcservice.c:73: error: invalid application of ‘sizeof’ to incomplete type ‘struct event’ 
ipcservice.c: In function ‘ipc_service_add_client’:
ipcservice.c:763: error: ‘EV_READ’ undeclared (first use in this function)
ipcservice.c:763: error: (Each undeclared identifier is reported only once
ipcservice.c:763: error: for each function it appears in.)
ipcservice.c:763: error: ‘EV_PERSIST’ undeclared (first use in this function)
ipcservice.c:764: error: ‘EV_WRITE’ undeclared (first use in this function)
ipcservice.c: In function ‘ipc_service_event_log_cb’:
ipcservice.c:918: error: ‘_EVENT_LOG_DEBUG’ undeclared (first use in this function)
ipcservice.c:921: error: ‘_EVENT_LOG_MSG’ undeclared (first use in this function)
ipcservice.c:924: error: ‘_EVENT_LOG_WARN’ undeclared (first use in this function)
ipcservice.c:927: error: ‘_EVENT_LOG_ERR’ undeclared (first use in this function)
ipcservice.c: In function ‘zbx_ipc_service_start’:
ipcservice.c:1412: warning: assignment makes pointer from integer without a cast
ipcservice.c:1413: error: ‘EV_READ’ undeclared (first use in this function)
ipcservice.c:1413: error: ‘EV_PERSIST’ undeclared (first use in this function)
ipcservice.c: In function ‘zbx_ipc_service_recv’:
ipcservice.c:1499: error: ‘EVLOOP_ONCE’ undeclared (first use in this function)
ipcservice.c:1502: error: ‘EVLOOP_NONBLOCK’ undeclared (first use in this function)

vjaceslavs Just checked current version of dev branch. Agent compiles without any problems for me.

wiper Looks like there are 2 problems:

  1. configure script does not work, it should have failed libevent check (did you try running bootstrap?)
  2. currently ipcservice is not used by agent, we probably should drop libevent requirement for agent. Maybe introduce new define, something like HAVE_IPCSERVICE

sasha

  1. bootstrap.sh has been executed
  2. libevent-dev is not installed on the system

wiper Configure script did check for libevent, but only if server/proxy are being built. I moved IPC service related code under HAVE_IPCSERVICE define, which currently is defined if ipmi support is enabled (--with-openipmi).
RESOLVED in r65562

vjaceslavs Agent compiles with / without libevent installed. CLOSED

Comment by Andris Zeila [ 2017 Feb 08 ]

(15) [S] Trunk does not compile with OpenIPMI. Fix (provided by glebs.ivanovskis):

Index: src/zabbix_server/ipmi/ipmi_manager.c
===================================================================
--- src/zabbix_server/ipmi/ipmi_manager.c	(revision 65567)
+++ src/zabbix_server/ipmi/ipmi_manager.c	(working copy)
@@ -747,7 +747,7 @@
 				init_result(&result);
 				SET_TEXT_RESULT(&result, value);
 				value = NULL;
-				dc_add_history(itemid, ITEM_VALUE_TYPE_TEXT, 0, &result, &ts, state, NULL);
+				dc_add_history(itemid, 0, &result, &ts, state, NULL);
 				free_result(&result);
 			}
 			break;
@@ -756,7 +756,7 @@
 		case AGENT_ERROR:
 		case CONFIG_ERROR:
 			state = ITEM_STATE_NOTSUPPORTED;
-			dc_add_history(itemid, ITEM_VALUE_TYPE_TEXT, 0, NULL, &ts, state, value);
+			dc_add_history(itemid, 0, NULL, &ts, state, value);
 	}
 
 	dc_flush_history();
@@ -847,7 +847,7 @@
 			int		errcode = CONFIG_ERROR;
 
 			zbx_timespec(&ts);
-			dc_add_history(items[i].itemid, ITEM_VALUE_TYPE_TEXT, 0, NULL, &ts, state, error);
+			dc_add_history(items[i].itemid, 0, NULL, &ts, state, error);
 			DCrequeue_items(&items[i].itemid, &state, &ts.sec, NULL, NULL, &errcode, 1);
 			zbx_free(error);
 			continue;

RESOLVED in 65569

vjaceslavs CLOSED

Comment by Andris Zeila [ 2017 Feb 08 ]

(16) [S] Suppressed multiple unused parameter warnings in libevent callback functions.

RESOLVED in r65588

vjaceslavs CLOSED

Comment by Andris Zeila [ 2017 Feb 09 ]

The latest fixes released in:

  • pre-3.3.0 r65604
Generated at Tue May 27 21:38:02 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.