-
Incident report
-
Resolution: Won't fix
-
Major
-
None
-
1.8.18
-
I have about 600 monitored boxes and only 150 are behind the proxies (temporary I cannot move more).
Number of hosts (monitored/not monitored/templates) 1299 620 / 263 / 416
Number of items (monitored/disabled/not supported) 63869 56129 / 2612 / 5128
Number of triggers (enabled/disabled)[problem/unknown/ok] 27957 22347 / 5610 [125 / 2199 / 20023]
Number of users (online) 79 3
Required server performance, new values per second 250.84 -
I have about 600 monitored boxes and only 150 are behind the proxies (temporary I cannot move more). Number of hosts (monitored/not monitored/templates) 1299 620 / 263 / 416 Number of items (monitored/disabled/not supported) 63869 56129 / 2612 / 5128 Number of triggers (enabled/disabled)[problem/unknown/ok] 27957 22347 / 5610 [125 / 2199 / 20023] Number of users (online) 79 3 Required server performance, new values per second 250.84 -
I'm trying to solve problem with queue of delayed items data.
After few hours after start zabbix I'm able to observe that telnet on server port does not open instantly TCP session. After this all active checks have longer and longer delay. Sometimes it recovers but sometimes not and so far only restart of server helps.
Looking on netstat output on zabbix server I see:
# netstat -an | grep <IP.of.zbx.srv>:10051| awk '{ print $6}' |sort| uniq -c 38 CLOSE_WAIT 99 ESTABLISHED 26 SYN_RECV 1 SYN_SENT 396 TIME_WAIT
so numbers are not big.
Because it was not possible to open new connection to zabbix srv I've increased StartTrappers to 40. After this after few minutes I had tons of error messages in server log like:
31964:20131013:143013.211 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update ids set nextid=nextid+1 where nodeid=0 and table_name='events' and field_name='eventid'] zabbix_server [31964]: ERROR [file:db.c,line:1582] Something impossible has just happened.
With StartTrappers=40 zabbix server is almost instantly unuseable.