[ZBX-4275] zabbix needs to wait a bit on databases instead of just closing down Created: 2011 Oct 25  Updated: 2017 May 30  Resolved: 2011 Nov 07

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 1.8.9
Fix Version/s: 1.8.9, 1.9.8 (beta)

Type: Incident report Priority: Major
Reporter: Trever L. Adams Assignee: Unassigned
Resolution: Fixed Votes: 2
Labels: database, problems, startup
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Fedora 16



 Description   

Any system where it is not guaranteed that the database will be up before Zabbix server tries to start will yield a non-started Zabbix server. An example of this is Fedora 16 with Postgresql and systemd.

I believe the fix is fairly simple. Zabbix should wait for a time being if it cannot connect to the database. Maybe a loop of sorts with 250ms between connection attempts and a maximum of 10 seconds of waiting?

Also, if the database goes down, Zabbix should try to reattach. (It may do this already.)

https://bugzilla.redhat.com/show_bug.cgi?id=729753 is this bug and hasn't seen any traffic.



 Comments   
Comment by Oleksii Zagorskyi [ 2011 Oct 26 ]

Loop of with 250ms is very small, 1 second will be ok (IMO). 10 seconds of waiting is ok.

The same problem under FreeBSD (zabbix installed from Port collection) even when zabbix_server init script contains "# REQUIRE: DAEMON mysql" and MySQL starts before zabbix_server. MySQL need some additional time to create a socket and at the end zabbix_server is stopped.

I added "sleep 5" before "run_rc_command "$1"" in the init script.
It's not very good solution, but I do not see nothing better.

I vote for this issue.

Comment by Aleksandrs Saveljevs [ 2011 Nov 07 ]

Available in development branch svn://svn.zabbix.com/branches/dev/ZBX-4275 .

Comment by dimir [ 2011 Nov 07 ]

Successfully tested.

Comment by Aleksandrs Saveljevs [ 2011 Nov 08 ]

Available in pre-1.8.9 in r23039 and in pre-1.9.8 in r23040.

Comment by dimir [ 2011 Nov 08 ]

I propose to reopen it. As rich suggested, when we start zabbix_server and db is down, the process will be hanging there useless. Init script would tell the server is running (and it will be right) and the only thing a user could do to identify the problem is looking at the log. And that would be more obvious if the process would stop after a timeout. So I suggest introducing a timeout, but configurable, as asaveljevs suggested. So that administrator that knows his db startup can take 1 minute would just set it in the config file. The default could be 10 seconds.

Opinions?

Comment by Oleksii Zagorskyi [ 2011 Nov 08 ]

I vote for timeout. Even hard coded in the sources 10 seconds will be better than try indefinitely.
Configurable option would be the best solution.

Comment by richlv [ 2011 Nov 08 ]

agreed. my biggest concern would be something like a failover solution which would never spot such a server being down is using lsb initscripts

a silly question, does this affect proxy as well ?

Comment by Aleksandrs Saveljevs [ 2011 Nov 09 ]

Note that if the database is down, then in the vast majority of cases users will not be able to use GUI, too. Since Zabbix without GUI is not very useful either, users will be able to figure out that something is wrong.

A less likely scenario that achieves the same problematic behavior at startup is also possible: Zabbix server connects to the database successfully, then database goes down when Zabbix server performs initial queries, before spawning any processes. Zabbix server would be no more useful in that case.

Yes, the change affected proxy, too. But note that for proxy there is no watchdog - if a proxy loses a connection to the database at any time, it is just as useless as if it does not have a connection to the database in the beginning.

The same logic applies for server, except that server has a watchdog. But, as mentioned above, users will be able to notice the unavailability of the database because their GUI will not work.

Comment by richlv [ 2011 Nov 09 ]

looking at the example of the failover again, users would not be looking at the gui - they would expect clustering solution to detect daemon as being down. and i guess that would be impossible...

i'm getting more and more convinced that there should be a timeout on waiting for the db - not just upon startup, but in general (users could set it to 0, if desired). that should also solve the issue of being able to connect to the db initially upon startup, but not afterwards

Comment by dimir [ 2011 Nov 09 ]

I like the idea of having a timeout globally. So that all the db connection issues are treated the same way, doesn't matter be that during server startup or during normal work. With 0 as timeout parameter users instruct the server to wait for db eternally.

Comment by richlv [ 2011 Nov 14 ]

having thought over this during the weekend (of course), some summary :

1. cases when users are less likely to spot that zabbix is down :
1.1. cluster/ha/failover setup, when frontend is connected to a central database and thus they can access all data;
1.2. when frontend uses a different access details for security reasons;
1.3. when zabbix is used more for alerting or during non-working hours

2. thus when/which users are more likely to be hit by this :
2.1. bigger users (who have ha or other availability solution set up);
2.2. when they least expect it - during the night and such

the problem is that detecting this in an automated fashion is extremely hard, if not impossible. we basically have to parse logfile to detect that the daemon is not working, which isn't feasible in the initscript. currently i don't see how users could workaround this issue

Comment by Oleksii Zagorskyi [ 2012 Jan 29 ]

This discussion is not finished, btw.
Currently we have at least 3 votes for global timeout.

Additionally suppose I'm using some very simple shell script (watch-dog) which checks presence of running "zabbix_server" process.
Currently my script will falsely think that zabbix server daemon is working but really it doesn't work.

I suggest to reconsider this issue again and maybe report new ZBX(NEXT).

Comment by dimir [ 2012 Feb 03 ]

<rich> for example, in this training session two people misconfigured
password info in the server config file and started up server. they
saw db access error message, fixed it, then tried to start server
again (as they expected server to shutdown automatically). that
resulted in pidfile error messages, of course, which confused them
even more. we should really consider implementing a configurable
timeout for such cases.

Comment by dimir [ 2012 Feb 03 ]

Oleksiy, I vote for the one who knows the issue very well to describe it best, I vote for you.

Comment by Oleksii Zagorskyi [ 2012 Feb 06 ]

Discussed problem reported in a ZBX-4611

Comment by Oleksii Zagorskyi [ 2012 Feb 06 ]

Just for record:
quote from the summary -> " to wait a bit"
a result of this issue -> "to wait eternally"

Generated at Thu Apr 25 15:12:43 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.