Re: systemctl daemon-reexec forgets running services and starts everything new

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On 4 Apr 2023, at 09:33, Wasser, Erik <ewasser@xxxxxxxxxxxxx> wrote:
> 
> Hello List,
> 
> I'm posting this here because the official bugtracker at
> https://github.com/systemd/systemd/issues accepts only reports for new
> versions of systemd. This is a bug report for version 249.
> 
> # Our problem #
> 
> During a regular update of our container environment, `systemd` (and
> the related packages libpam-systemd, libsystemd0, libudev1,
> systemd-sysv and udev) were updated from `249.11-0ubuntu3.6` to
> `249.11-0ubuntu3.7`. We're talking only about Ubuntu 22.04. Our Ubuntu
> 20.04 is working fine with `systemctl daemon-reexec`.
> 
> In my opinion, the update was not the problem because we've tried
> downgrading and tried these versions: (current) `249.11-0ubuntu3.7`,
> `249.11-0ubuntu3.6`, `249.11-0ubuntu3.4` and `249.11-0ubuntu3.3`. The
> symptoms were the same. The last working version is version
> `245.4-4ubuntu3.19`.
> 
> # Symptoms #
> 
> The `/var/lib/dpkg/info/systemd.postinst` executes a `systemctl
> daemon-reexec` and that ended in a disaster.

Seems it is a packaging issue that ubuntu or debian needs to help you with not systemd?

Barry


> It seems that `systemd`
> is forgetting all it started children and tries to start nearly every
> configured service again. Naturally, the old services are still
> running, and the ports can't be opened twice and `systemd` won't give
> up. Here are some(!) of the logfiles:
> 
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Starting Create Volatile
> Files and Directories...
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: systemd-udevd.service: Found
> left-over process 130 (systemd-udevd) in control group while starting
> unit. Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: systemd-udevd.service: Found
> left-over process 31475 (systemd-udevd) in control group while
> starting unit. Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: systemd-udevd.service: Found
> left-over process 31476 (systemd-udevd) in control group while
> starting unit. Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> 
> And...
> 
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Reached target System Initialization.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily apt download activities.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily apt upgrade
> and clean activities.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily dpkg database
> backup timer.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Periodic ext4 Online
> Metadata Check for All Filesystems.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Condition check resulted in
> Discard unused blocks once a week being skipped.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily rotation of log files.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily man-db regeneration.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Message of the Day.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Clean PHP session
> files every 30 mins.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Update the plocate
> database daily.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Daily Cleanup of
> Temporary Directories.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Reached target Basic System.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: System is tainted: cgroupsv1
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Reached target Timer Units.
> 
> And...
> 
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Found left-over
> process 206 (atd) in control group while starting unit. Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Starting Deferred execution
> scheduler...
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: cron.service: Found
> left-over process 164 (cron) in control group while starting unit.
> Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started Regular background
> program processing daemon.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: dbus.service: Found
> left-over process 177 (dbus-daemon) in control group while starting
> unit. Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Started D-Bus System Message Bus.
> 
> And...
> 
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: rsyslog.service: Found
> left-over process 204 (rsyslogd) in control group while starting unit.
> Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Failed with
> result 'exit-code'.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Unit process
> 206 (atd) remains running after unit stopped.
> 
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found
> left-over process 382 (apache2) in control group while starting unit.
> Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found
> left-over process 392 (apache2) in control group while starting unit.
> Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found
> left-over process 397 (apache2) in control group while starting unit.
> Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: apache2.service: Found
> left-over process 3052 (apache2) in control group while starting unit.
> Ignoring.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: This usually indicates
> unclean termination of a previous run, or service implementation
> deficiencies.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Starting The Apache HTTP Server...
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: Stopped Deferred execution scheduler.
> Mar 31 12:51:39 FQDN_REDACTED systemd[1]: atd.service: Found left-over
> process 206 (atd) in control group while starting unit. Ignoring.
> 
> And...
> 
> Mar 31 12:51:40 FQDN_REDACTED sshd[31772]: error: Bind to port 22 on
> 0.0.0.0 failed: Address already in use.
> 
> And...
> 
> Mar 31 12:52:06 FQDN_REDACTED systemd[1]: Started The Salt Minion.
> Mar 31 12:52:06 FQDN_REDACTED salt-minion[32339]: The Salt Minion is shutdown.
> Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Main
> process exited, code=exited, status=1/FAILURE
> Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Failed
> with result 'exit-code'.
> Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Unit
> process 2808 (/opt/saltstack/) remains running after unit stopped.
> Mar 31 12:52:11 FQDN_REDACTED systemd[1]: salt-minion.service: Unit
> process 2848 (/opt/saltstack/) remains running after unit stopped.
> 
> Other internal `systemd` processes were started also again:
> 
> root 1 0.0 0.1 101204 12444 ? Ss 10:19 0:03 /lib/systemd/systemd -z
> --system --deserialize 16
> root 75 0.0 0.1 31440 13484 ? Ss 10:19 0:00 /lib/systemd/systemd-journald
> systemd+ 159 0.0 0.0 16124 8004 ? Ss 10:19 0:00 /lib/systemd/systemd-networkd
> message+ 177 0.0 0.0 8252 4440 ? Ss 10:19 0:00 @dbus-daemon --system
> --address=systemd: --nofork --nopidfile --systemd-activation
> --syslog-only
> root 205 0.0 0.0 14908 6464 ? Ss 10:19 0:00 /lib/systemd/systemd-logind
> systemd+ 223 0.0 0.1 25268 12592 ? Ss 10:19 0:00 /lib/systemd/systemd-resolved
> root 31424 0.0 0.1 31424 13636 ? Ss 12:51 0:00 /lib/systemd/systemd-journald
> systemd+ 31636 0.0 0.0 16124 6588 ? Ss 12:51 0:00 /lib/systemd/systemd-networkd
> message+ 31639 0.0 0.0 8124 3804 ? Ss 12:51 0:00 @dbus-daemon --system
> --address=systemd: --nofork --nopidfile --systemd-activation
> --syslog-only
> root 31682 0.0 0.0 14908 6480 ? Ss 12:51 0:00 /lib/systemd/systemd-logind
> systemd+ 31686 0.0 0.1 25268 12580 ? Ss 12:51 0:00 /lib/systemd/systemd-resolved
> root 32087 0.0 0.0 21436 5252 ? Ss 12:51 0:00 /lib/systemd/systemd-udevd
> 
> You can either kill all the old processes and restart them, and then
> everything is fine. Or you can reboot the container. Besides that
> `systemctl daemon-reexec` the `systemd` version is running fine.
> `systemctl daemon-reload` is working like a charm.
> 
> # Normal case #
> 
> In the normal case, a `systemctl daemon-reexec` just prints only a few lines:
> 
> Mar 31 14:21:58 FQDN_REDACTED systemd[1]: Reexecuting.
> Mar 31 14:21:58 FQDN_REDACTED systemd[1]: systemd 249.11-0ubuntu3.7
> running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK
> +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2
> +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY
> -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP
> +SYSVINIT default-hierarchy=unified)
> Mar 31 14:21:58 FQDN_REDACTED systemd[1]: Detected architecture x86-64.
> 
> # Testcase #
> 
> Doing a `systemctl daemon-reexec` and `ssh localhost` shows the
> problem. `systemctl` removes the directory `/run/sshd` during the
> reexec and `ssh` will refuse further connects because the directory is
> missing.
> 
> $ systemctl daemon-reexec
> $ ssh root@localhost
> kex_exchange_identification: read: Connection reset by peer
> Connection reset by 127.0.0.1 port 22
> $
> 
> Killing the old instance of SSH and restarting it will work.
> 
> # Some details to the hardware #
> 
> Our metal runs OpenVZ/Virtuozzo with this kernel (without any problems):
> 
>> Linux FQDN_REDACTED 3.10.0-1127.18.2.vz7.163.46 #1 SMP Fri Nov 20 21:47:55 MSK 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
> The container with the `systemctl daemon-reexec` problem reports the
> following kernel:
> 
> Linux FQDN_REDACTED 5.4.0 #1 SMP Thu Apr 22 16:18:59 MSK 2021 x86_64
> x86_64 x86_64 GNU/Linux
> 
> # Upshot #
> 
> * Can somebody help me with this issue?
> * Why is `systemctl` losing its internal state about the running
> processes/services?
> * Why is `systemctl` restarting everything?
> 
> -- 
> Erik Wasser
> 





[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux