-
Problem report
-
Resolution: Fixed
-
Major
-
None
-
Sprint 51 (Apr 2019)
-
1
We have such number of rows in corresponding tables:
hosts - 423722
interface - 620132
dservices - 360533
dchecks - 18123
drules - 18123
Yes, 18K network discovery rules, each assigned to own proxy (yes, 18K proxies).
Each rule scan this ranges each 3 hours: 192.168.140.1-254,192.168.150.1-39 (~300 IPs).
The problem that we always have many heavy queries from this function:
select h.hostid,h.status from hosts h,interface i,dservices ds,dchecks dc,drules dr static zbx_uint64_t select_discovered_host(const DB_EVENT *event) { const char *__function_name = "select_discovered_host"; DB_RESULT result; DB_ROW row; zbx_uint64_t hostid = 0; char *sql = NULL; zabbix_log(LOG_LEVEL_DEBUG, "In %s() eventid:" ZBX_FS_UI64, __function_name, event->eventid); switch (event->object) { case EVENT_OBJECT_DHOST: sql = zbx_dsprintf(sql, "select h.hostid,h.status" " from hosts h,interface i,dservices ds,dchecks dc,drules dr" " where h.hostid=i.hostid" " and i.ip=ds.ip" " and ds.dcheckid=dc.dcheckid" " and dc.druleid=dr.druleid" " and h.status in (%d,%d)" " and " ZBX_SQL_NULLCMP("dr.proxy_hostid", "h.proxy_hostid") " and i.useip=1" " and ds.dhostid=" ZBX_FS_UI64 " order by i.hostid", HOST_STATUS_MONITORED, HOST_STATUS_NOT_MONITORED, event->objectid); break; case EVENT_OBJECT_DSERVICE: sql = zbx_dsprintf(sql, "select h.hostid,h.status" " from hosts h,interface i,dservices ds,dchecks dc,drules dr" " where h.hostid=i.hostid" " and i.ip=ds.ip" " and ds.dcheckid=dc.dcheckid" " and dc.druleid=dr.druleid" " and h.status in (%d,%d)" " and " ZBX_SQL_NULLCMP("dr.proxy_hostid", "h.proxy_hostid") " and i.useip=1" " and ds.dserviceid=" ZBX_FS_UI64 " order by i.hostid", HOST_STATUS_MONITORED, HOST_STATUS_NOT_MONITORED, event->objectid); break; default: goto exit;
namely 2nd query - with filtering by ds.dserviceid we could spot in waiting transaction.
On average these queries take 15 seconds and cause high CPU usage on DB server.
As I see in my debug test each query from the function is repeated once (per check and per rule). I.e. two very similar heavy queries.
I've figure out that this SQL comes from operation to delete a host, when zabbix server needs to check host existence.
The operation (and the function call) is performed for IPs when host is deleted already and IP is still not responsive (check is failed).
Maybe the SQL could be improved to be faster and avoid high CPU usage?