[ZBX-15602] SystemD "TimeoutSec=infinity" is bad without units dependency order Created: 2019 Feb 06 Updated: 2024 Apr 10 Resolved: 2021 Jul 21 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Packages (C) |
Affects Version/s: | 4.0.4 |
Fix Version/s: | 4.0.17rc1, 4.4.5rc1, 4.4 (plan), 5.0.0alpha1, 5.0 (plan) |
Type: | Problem report | Priority: | Trivial |
Reporter: | Tim White | Assignee: | Jurijs Klopovskis |
Resolution: | Fixed | Votes: | 11 |
Labels: | reboot, systemd, timeout | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Ubuntu 18.04.1 |
Attachments: |
![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||
Team: | |||||||||||||||||||||
Sprint: | Sprint 56 (Sep 2019), Sprint 55 (Aug 2019), Sprint 54 (Jul 2019), Sprint 57 (Oct 2019), Sprint 58 (Nov 2019), Sprint 59 (Dec 2019), Sprint 60 (Jan 2020) | ||||||||||||||||||||
Story Points: | 0 |
Description |
Steps to reproduce:
Changing TimeoutSec in /lib/systemd/system/zabbix-server.service to something more sane than Infinity would ensure that if the shutdown of Zabbix-server does hang, it can be killed by systemd after a resonable length of time, say 5 minutes. |
Comments |
Comment by Edgar Akhmetshin [ 2019 Feb 06 ] |
Hello Tim, Could you attach log file with Zabbix server shutdown procedure in progress? Regards, |
Comment by Tim White [ 2019 Feb 06 ] |
I've attached the logs from where I believe the issue was (snipped the redundant 30 minutes). It seems that MySQL (MariaDB) was shutdown before Zabbix-server, and so Zabbix-server keeps trying to reconnect for awhile. That kind makes this 2 issues. Firstly, we don't have our dependencies correct (we should rely on MySQL/MariaDB) in the systemd file, so that systemd knows how to shutdown/startup the service. Secondly, we should set a suitable timeout for when things do go wrong, as Infinity is not a good default for any service. Also, I noticed at boot time, that zabbix-server started before the database was ready. This really highlights needing the dependencies in systemd to be correct.
Logs from startup showing we are just too quick trying to connect to the database 1336:20190206:125535.289 using configuration file: /etc/zabbix/zabbix_server.conf 1336:20190206:125535.344 [Z3001] connection to database 'zabbix' failed: [2002] Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) 1336:20190206:125535.344 database is down: reconnecting in 10 seconds 1336:20190206:125545.349 database connection re-established 1336:20190206:125545.352 current database version (mandatory/optional): 04000000/04000003 1336:20190206:125545.352 required mandatory version: 04000000
Proposed SystemD after changing dependencies and Timeout [Unit] Description=Zabbix Server After=syslog.target After=network.target After=mysql.service [Service] Environment="CONFFILE=/etc/zabbix/zabbix_server.conf" EnvironmentFile=-/etc/default/zabbix-server Type=forking Restart=on-failure PIDFile=/run/zabbix/zabbix_server.pid KillMode=control-group ExecStart=/usr/sbin/zabbix_server -c $CONFFILE ExecStop=/bin/kill -SIGTERM $MAINPID RestartSec=10s TimeoutSec=300s [Install] WantedBy=multi-user.target
|
Comment by Edgar Akhmetshin [ 2019 Feb 06 ] |
Tim, What operating system is used? How was Zabbix and database installed and from which repository? Please, show output from the following command: sudo systemctl list-unit-files --type service --state enabled,generated; Regards, |
Comment by Tim White [ 2019 Feb 06 ] |
Ubuntu 18.04, installed from Zabbix repository deb packages. $ apt-cache policy zabbix-server-mysql zabbix-server-mysql: Installed: 1:4.0.4-1+bionic Candidate: 1:4.0.4-1+bionic Version table: *** 1:4.0.4-1+bionic 500 500 http://repo.zabbix.com/zabbix/4.0/ubuntu bionic/main amd64 Packages 100 /var/lib/dpkg/status 1:3.0.12+dfsg-1 500 500 http://au.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages $ sudo systemctl list-unit-files --type service --state enabled,generated; UNIT FILE STATE accounts-daemon.service enabled apache2.service enabled apparmor.service enabled apport.service generated atd.service enabled [email protected] enabled avahi-daemon.service enabled blk-availability.service enabled chrony.service enabled chronyd.service enabled console-setup.service enabled cron.service enabled dbus-org.freedesktop.Avahi.service enabled dbus-org.freedesktop.resolve1.service enabled ebtables.service enabled gammu-smsd.service enabled [email protected] enabled grub-common.service generated irqbalance.service enabled iscsi.service enabled keyboard-setup.service enabled lvm2-monitor.service enabled lxcfs.service enabled lxd-containers.service enabled mariadb.service enabled mysql.service enabled mysqld.service enabled netfilter-persistent.service enabled networkd-dispatcher.service enabled ondemand.service enabled open-iscsi.service enabled open-vm-tools.service enabled pollinate.service enabled postfix.service enabled rsync.service enabled rsyslog.service enabled salt-minion.service enabled setvtrgb.service enabled ssh.service enabled sshd.service enabled syslog.service enabled systemd-resolved.service enabled systemd-timesyncd.service enabled ufw.service enabled unattended-upgrades.service enabled ureadahead.service enabled veeamservice.service generated vgauth.service enabled vnstat.service enabled vnstatd.service enabled zabbix-agent.service enabled zabbix-server.service enabled 52 unit files listed. |
Comment by dimir [ 2019 Feb 06 ] |
This has already been discussed. Let me share some quotes from IRC: <Richlv> dimir, what would be the desired thing to do when reaching the timeout ? <Richlv> it seems like in case of zabbix just killing it wouldn't be a good idea anyway <Richlv> think db upgrade <Richlv> so you have to think about the longest db upgrade expected <Richlv> which can easily be hours. so is it worth setting a timeout of, let's say 10 hours ? in old versions, people sometimes had to wait for days, so... maybe even a week ? <Richlv> so at that point the timeout value becomes quite arbitrary <Richlv> easier to set it to infinity and document that :) <volter> dimir: The "starting in upgrade situations" is probably only relevant, if you don't run the server in the foreground, which I do in Fedora. <volter> I wonder what the implicit defaults are anyway! <volter> Defaults to DefaultTimeoutStartSec= from the manager configuration file <volter> That's 90 seconds in my case. <volter> Let's think about how bad it could be if you killed Zabbix on an upgrade: Probably not very bad. <volter> The worst thing that can happen with the history syncer (which creates trends, if I'm not wrong), is: Not much <volter> And if you need to shut it down, you'll lose some data anyway, unless it's buffered elsewhere. <volter> I see no compelling reason. <volter> Why might want to compare this to PG, for instance. <volter> Where transactions could remain open for hours. <volter> What is more: https://bugzilla.redhat.com/show_bug.cgi?id=1446015 <volter> "Configures the time to wait for stop. If a service is asked to stop, but does not terminate in the specified time, it will be terminated forcibly via SIGTERM, and after another timeout of equal duration with SIGKILL" even <volter> (For the stopping part, of course) <volter> I suggest to don't touch this at all. <volter> I can't see a big problem. <volter> I guess the only feasible problem is dataloss, when it comes to the "Stop" part. <volter> And 90 seconds is a lot, plus, there are 90 more seconds. <volter> And as far as the startup goes: I saw foreground as the solution, which is also easier for systemd to track and you are getting rid of that "duplicate" pidfile specification. <volter> However, this has consequences on the logging. <volter> The systemd journal will capture anything that's emmited through syslog, stdout and stderr. <volter> See bugzilla ticket! <volter> Experience tells me, if there is no good reason to change something: Don't. <dimir> So what do you suggest for those that find 90 seconds not enough? <volter> I would try to figure out why it's taking so long and if shot down, whether anything critical is happening. <volter> Furthermore, everybody can easily put their own unit file in /etc/systemd/system to override what the vendor ships. but I remember that some earlier RHEL7 does not support TimeoutSec=infinity it depends on systemd version, process does not start with TimeoutSec=infinity then, someone gave me an advice TimeoutSec=0 is same as infinity so, I think TimeoutSec=infinity on debian/ubuntu, TimeoutSec=0 on RHEL
|
Comment by dimir [ 2019 Feb 06 ] |
There is a reason why there is TimeoutSec=Infinity, but we should document why it is so and how to overcome it. |
Comment by Tim White [ 2019 Feb 08 ] |
I can see from the IRC logs that TimeoutSec=infinity is intentional, but possibly still misunderstood. SystemD won't wait forever, in the case of a shutdown on Ubuntu, at 30 minutes it will force kill it even with TimeoutSec=infinity. However, regardless of that, we can actually fix this issue of a long shutdown by fixing the dependencies. [Unit] Description=Zabbix Server After=syslog.target After=network.target After=mysql.service The problem of the long shutdown was that it shuts down MySQL before Zabbix, which is why Zabbix didn't exit. Other than documenting why TimeoutSec=infinity (ideally as a comment in the SystemD file too), we should fix the dependencies of the SystemD file. |
Comment by Denis Pantsyrev [ 2019 Jun 27 ] |
Still unresolved Update from 4.0.9 to 4.0.10 takes about 20 minutes. I fix it manually after each update, it's boring. Please add these fixes like in @Tim White comment. It's easy to fix but it's improve product performance! Regards, |
Comment by Benoît Locher [ 2019 Jul 02 ] |
I had the same problem when rebooting my Debian server (Stretch 9.9) : Zabbix service v4.0.10 would hang forever. Adding the following line (following advice from Tim) : After=postgresql.service in [Unit] section solved the problem. |
Comment by Tim White [ 2019 Jul 05 ] |
I don't think changing this to "Status: Needs documenting" is the right fix. As explained in my earlier comments, the fix is to ensure the dependencies are correct. Yes, documenting why we have a long timeout (even infinity if you guys still want it) is needed, but we really need to fix the dependency order. |
Comment by dimir [ 2019 Jul 15 ] |
timw_suqld, we can't depend on mysql service for 2 reasons:
Looks like documenting is the only thing we can do. |
Comment by Tim White [ 2019 Jul 16 ] |
We can use After with optional dependencies. (https://unix.stackexchange.com/questions/423722/systemd-service-file-with-optional-dependency) Also, mariadb often provides an alias, so mysql.service is enough to catch MySQL and MariaDB. So something like the following will fix the dependencies without forcing them to use a particular SQL server, or running it on the same server: [Unit] Description=Zabbix Server After=syslog.target After=network.target Wants=mysql.service After=mysql.service Wants=postgresql.service After=postgresql.service I still think that TimeoutSec=infinity should be fixed (it really doesn't do what people think it does), but at least if you fix the dependencies, it's less likely to bite people trying to shutdown/reboot servers. |
Comment by Marek Krolikowski [ 2019 Aug 12 ] |
Hey Guys! I got same problem with Zabbix 4.0.11 on Debian 10. But Tim got right how to repair this problem. timw_suqld Thanks! Wants=mysql.service After=mysql.service Wants=postgresql.service After=postgresql.service |
Comment by dimir [ 2019 Aug 13 ] |
This is all good and in theory we could list all available MySQL flavors in zabbix-server-mysql package: Wants=mysql.service Wants=mariadb.service Wants=percona.service But there's no way to detect which database (local or remote) the Zabbix server uses. So, even if you have MySQL running locally there could be a situation when Zabbix server does not depend on it, just because it uses remote database. Sorry, there's no clear way, working for all situations, how we could change anything in current situation. I guess the best way for you would be currently to use systemctl edit zabbix-server https://askubuntu.com/questions/659267/how-do-i-override-or-configure-systemd-services That thing we could document. |
Comment by richlv [ 2019 Aug 13 ] |
Having local MySQL but using a remote is an edge case, document it. |
Comment by Vladislavs Sokurenko [ 2019 Aug 13 ] |
related issue |
Comment by Glebs Ivanovskis [ 2019 Aug 13 ] |
I totally agree with richlv. I my understanding the majority of Zabbix installations will have Zabbix server/proxy and DB server on the same box. Among the rest who will use a dedicated DB server, Zabbix will likely run on a dedicated box as well. And the proposed change should not affect a very unlikely use case mentioned by dimir in any detrimental way. |
Comment by dimir [ 2019 Aug 13 ] |
Imagine you have some broken local installation of MySQL that you used sometimes for testing, that is not working anymore. Adding proposed changes becomes regression for such setup. |
Comment by Glebs Ivanovskis [ 2019 Aug 13 ] |
Sorry, I can't push my imagination that far. This sounds like even more of an edge case, almost like "what if dinosaur comments on this ticket". |
Comment by richlv [ 2019 Aug 13 ] |
The said dinosaur could have hacked into the box and deliberately installed MySQL there to mess with the user. |
Comment by dimir [ 2019 Aug 13 ] |
Not many of us have big experience in packaging, very complex and interesting area. Not many imagine all the aspects of it, how tiny little change can break things for some users out there far away, yes, with the OS versions from dinosaur times, how different the installations are... Not many of us know and not many of us care. In my opinion the worst thing in packaging is regression. And I'm not interested in any details: if I have everything working for years and this upgrade breaks my installation - I become very desperate and I will not think of the software as stable any more. |
Comment by dimir [ 2019 Aug 13 ] |
Additional things to check the behavior if we are to modify something (thanks, kodai!):
|
Comment by Jackie Hunt [ 2019 Aug 22 ] |
I ran into this issue with postgresql. Please include it in any fix and/or documentation. |
Comment by Tim White [ 2019 Aug 23 ] |
This should prevent regressions. The main issue is startup/shutdown order. Currently, if the database engine shutsdown before Zabbix, we end up with Zabbix unable to shutdown correctly, and so the timeout is an issue. With Wants, if the service fails to start/stop, we get the same behavior as currently, we still try and start/stop Zabbix. This is the advantage of Wants of Require in this situation. And with the After= tags, the order is defined. Regarding systemd version, I can't find a changelog entry for when it was added, but I see references > 3 years old about using it, so I get a feeling it's been around a long time. |
Comment by Vladislavs Sokurenko [ 2019 Sep 30 ] |
Zabbix server requires a running database, if database is not available then it cannot be shut down properly without loosing collected history. That is why Zabbix server is waiting for the database to be UP again. |
Comment by Tim White [ 2019 Oct 01 ] |
@Vladislavs, this is why the dependencies need to be fixed. If the dependencies are fixed, it'll ensure at shutdown that Zabbix Server shuts down BEFORE the database shuts down. When the server is shutting down, the database isn't going to come back up to allow Zabbix to shutdown. Also, at some point you need to declare that data as lost, if you've not had a database available in say 10 minutes, it's probably not going to come back, so loss of data will occur. Given that Zabbix is trying to shutdown anyway, it shouldn't be collecting new data, and so some loss of data at shutdown is acceptable. |
Comment by Adam Garrett [ 2019 Oct 02 ] |
I just started experiencing this issue as well today. Adding the line After=mysql.service resolved this issue. Thanks, Tim. |
Comment by Krasherwares [ 2019 Oct 09 ] |
При установке Zabbix с использованием образа debian-9.5.0-i386-xfce-CD-1.iso получил ошибку: Проблема зависания при перезагрузке ушла. |
Comment by Marcel Wiechmann [ 2019 Oct 09 ] |
Also suffering under the same issues here and editing the zabbix-server.service file fixed the problem. I only want to mention that a documentation should mention the different options for the After and Wants value for the MySQL Installation (mysql.service or mariadb.service). |
Comment by dimir [ 2019 Oct 14 ] |
Krasherwares, this issue tracker is international, please use only English language. |
Comment by Krasherwares [ 2019 Oct 16 ] |
dimir, no problem (the same in English): The devil is in the details. Just like Tim White said. Need to finish: After=mysql.service In the [Unit] section of the file zabbix-server.service: The hang problem on reboot is gone. |
Comment by tbsky [ 2019 Oct 22 ] |
the current packaging way is broken for 99% of users. if you want to document, then document it for the 1% edge usage. but please make the package works by default for rest 99% of users. |
Comment by Oleksii Zagorskyi [ 2019 Dec 08 ] |
I came here after my own frustration ... That was sad to discover that we don't have dependencies and on OS reboot I had to wait ~30 minutes until systemd finally kills zabbix_server, which lost connection to mariadb and retried to connect, while ignoring received SIGTERM at all. As we see, while issue summary is about TimeoutSec=infinity (or 0, for systemd version < 229), looking to discussion it's clear that 99% of complains would be resolved by just defining order (not dependency) on services start and, most important - termination ! I did test, as there were concerns and can say for sure that adding just one line "After=mysql.service" resolves the issue. Interesting that when I add "After=mysql.service", it actually controls "mariadb.service":
# systemctl show zabbix-server | grep "^After"
After=systemd-journald.socket mariadb.service fakee.service sysinit.target network.target syslog.target basic.target system.slice
that's because of symlinks added. If "After=mysqld.service" is added too, then the command still shows only "mariadb.service" actually added. On Debian/Ubuntu, MariaDB creates these symlinks, so any unit file name may be used (mysql/mysqld/mariadb): # systemctl enable mariadb Created symlink /etc/systemd/system/mysql.service → /lib/systemd/system/mariadb.service. Created symlink /etc/systemd/system/mysqld.service → /lib/systemd/system/mariadb.service. Created symlink /etc/systemd/system/multi-user.target.wants/mariadb.service → /lib/systemd/system/mariadb.service. Because mariadb unit has this sections: [Install] WantedBy=multi-user.target Alias=mysql.service Alias=mysqld.service I've also checked Percona packages for different OS: MySQL (not MariaDB): Looks like "mysql" is very compatible for many cases, except of RHEL8/MySQL v8.0, which uses "mysqld" only. For PostgreSQL it's much more simple, everywhere it's "postgresql.service" without aliases. [Unit] After=mysql.service After=mysqld.service After=postgresql.service So, dimir, I ask to do that. I'll update this issue properties as I'm pretty sure I'll do a proper thing |
Comment by dimir [ 2019 Dec 09 ] |
It was decided to add After. |
Comment by Jurijs Klopovskis [ 2019 Dec 20 ] |
Fixed in 3.0.29, 4.0.16 & 4.4.4 releases. |
Comment by Oleksii Zagorskyi [ 2019 Dec 21 ] |
Just for a record. For %mysql% packages, all 3 services were added for "After" - mysql, mysqld, mariadb. Anyway, THANK YOU ! |
Comment by Glebs Ivanovskis [ 2020 Mar 26 ] |
Similar issue: |
Comment by Glebs Ivanovskis [ 2020 Jul 12 ] |
|
Comment by dimir [ 2020 Jul 13 ] |
yurii, could you confirm the same logic was applied to PostgreSQL in 4.0.16, 4.4.4 and 5.0.0? As far as I can tell this issue was fixed for both MySQL*/PostgreSQL in packages. |
Comment by Jurijs Klopovskis [ 2020 Jul 13 ] |
We have After=syslog.target After=network.target After=mysql.service After=mysqld.service After=mariadb.service After=postgresql.service in the service file. Debian-based distros typically include versions in systemd service names, thus presumably a simple After=postgresql.service directive will not cut it. Must investigate. |
Comment by Glebs Ivanovskis [ 2020 Jul 13 ] |
dimir, yurii, thank you for looking into it! Reporter of |
Comment by Александр Иванович Шабуров [ 2021 May 04 ] |
Hi! Here are logs when runing "systemctl stop zabbiх-server" on reboot computer. Das zabbix-server ask postgres for signup and smart shutdown? If so then why 15:11:20.867 errors occures? It is hangs zabbix-server on shutdown computer. If zabbix-server is not the reason for for signup and smart shutdown, how should I know about source this signal? Thanks "journalctl -u zabbix-server"May 04 15:11:20 db-mon-wtc.microsoft.platina.ru systemd[1]: Stopping Zabbix Server... ................................................. May 04 15:11:50 db-mon-wtc.microsoft.platina.ru systemd[1]: zabbix-server.service: Killing process 2097 (zabbix_server) with signal SIGKILL. "journalctl -u pgpro"May 04 15:11:50 db-mon-wtc.microsoft.platina.ru systemd[1]: Stopping PostgreSQL database server... postgres log2021-05-04 15:11:20.474 MSK [1611] LOG: received SIGHUP, reloading configuration files .............................. 2021-05-04 15:11:20.499 MSK [2121] FATAL: terminating connection due to administrator command zabbix-server log in the same time2042:20210504:151120.541 Got signal [signal:15(SIGTERM),sender_pid:2771,sender_uid:0,reason:0]. Exiting ... 2072:20210504:151120.868 database is down: reconnecting in 10 seconds 2066:20210504:151120.869 database is down: reconnecting in 10 seconds 2097:20210504:151120.878 database is down: reconnecting in 10 seconds 2072:20210504:151130.869 database is down: reconnecting in 10 seconds 2066:20210504:151130.869 database is down: reconnecting in 10 seconds 2097:20210504:151130.878 database is down: reconnecting in 10 seconds 2072:20210504:151140.869 database is down: reconnecting in 10 seconds 2066:20210504:151140.870 database is down: reconnecting in 10 seconds 2097:20210504:151140.878 database is down: reconnecting in 10 seconds
|
Comment by Jurijs Klopovskis [ 2021 May 05 ] |
Hi shab2, The issue is with the database being shut down before Zabbix server had time to sync data. To mitigate this problem we have added several After statements to the server and proxy systemd unit files. [Unit] Description=Zabbix Server After=syslog.target After=network.target After=mysql.service After=mysqld.service After=mariadb.service After=postgresql.service After=pgbouncer.service After=postgresql-9.4.service After=postgresql-9.5.service After=postgresql-9.6.service After=postgresql-10.service After=postgresql-11.service After=postgresql-12.service After=postgresql-13.service This should cover most people. If this does not work for you, it is always possible to add a similar directive for the database server unit on your own using systemctl edit command. |