Re: systemctl stop going through timeout even though all processes have exited

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I recall it was a bit complex for systemd to get notified when the cgroup actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies a helper executable that the kernel runs... I wonder if that mechanism is broken on your system.

On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <martin@xxxxxxxxxx> wrote:
I'm seeing "systemctl stop <service>" for several services taking a
long time because it goes through the timeout process, even though all
relevant processes have exited.

I'll give 2 examples.  Both examples are running inside a privileged
Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
systemd version, reported by "systemctl --version" in the container
is:

  systemd 239 (239-74.el8_8.5)

Here is ctdb.system:

  [Unit]
  Description=CTDB
  Documentation=man:ctdbd(1) man:ctdb(7)
  After=network-online.target time-sync.target
  ConditionFileNotEmpty=/etc/ctdb/nodes

  [Service]
  Type=forking
  LimitCORE=infinity
  LimitNOFILE=1048576
  TasksMax=4096
  PIDFile=/var/run/ctdb/ctdbd.pid
  ExecStart=/usr/sbin/ctdbd
  ExecStop=/usr/bin/ctdb shutdown
  KillMode=control-group
  Restart=no

  [Install]
  WantedBy=multi-user.target

"/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
starting and then stopping using systemctl works fine.  However, many
times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
value then the duration reduces accordingly.  I can confirm using both
"ps auxfww" and "systemd-cgls" that within the container there are no
relevant processes a moment after "systemctl stop ctdb" is run.  In
particular, in systemd-cgls ctdb.service is gone but "systemctl stop
ctdb" is still waiting.

Before attempting to stop, the service is successfully started:

  Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing: /usr/sbin/ctdbd
  Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location file:/var/log/log.ctdb
  Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741 (ctdbd).
  Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to ctdb.service.
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited, code=exited status=0
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for state start.
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742 belongs to service, we are happy.
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running
  Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start finished, result=done
  Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
  -- Subject: Unit ctdb.service has finished start-up
  -- Defined-By: systemd
  -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
  --
  -- Unit ctdb.service has finished starting up.
  --
  -- The start-up result is done.

The relevant part of the log while stopping seems to be:

  Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743 (ctdb-eventd).
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited, code=exited, status=0/SUCCESS
  Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
  Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received EPOLLHUP on stored fd 18 (stored), closing.
  Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860 (ctdb).
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process exited, code=exited status=0
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for state stop.
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop -> stop-sigterm
  Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
  Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
  Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834 (ctdb_lock_helpe).
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835 (ctdb_lock_helpe).
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)
  Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839 belongs to ctdb.service.
  Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832 (n/a).
  Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got notification message from PID 840 (WATCHDOG=1)
  Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm' timed out. Killing.
  Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result 'timeout'.
  -- Subject: Unit failed
  -- Defined-By: systemd
  -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
  --
  -- The unit ctdb.service has entered the 'failed' state with result 'timeout'.
  Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed stop-sigterm -> failed
  Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
  Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job ctdb.service/stop finished, result=done
  Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB.
  -- Subject: Unit ctdb.service has finished shutting down
  -- Defined-By: systemd
  -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
  --
  -- Unit ctdb.service has finished shutting down.
  Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=59 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a
  Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed state.

It would be very useful if systemd could log what it is still waiting
for when it times out.

Note that during start and stop, CTDB runs a lot of subprocesses,
including some that use systemctl to start and stop various services
that it, in turn, manages.

The full debug level log, after running:

  systemd-analyze log-level debug

is uploaded it to:

  https://meltin.net/uploads/systemd/ctdb-stop.log

I'm happy to reply and attach it, but it is 48KB.

The only theory I can come up with is some sort of race where
processes are created during shutdown and systemd gets confused.

I see a similar thing for a much simpler service, winbind:

Here is winbind.service:

  [Unit]
  Description=Samba Winbind Daemon
  Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5)
  After=network.target nmb.service

  [Service]
  Type=notify
  PIDFile=/var/run/winbindd.pid
  EnvironmentFile=-/etc/sysconfig/samba
  ExecStart=/usr/sbin/winbindd --foreground --no-process-group $WINBINDOPTIONS
  ExecReload=/bin/kill -HUP $MAINPID
  LimitCORE=infinity

  [Install]
  WantedBy=multi-user.target

Yesterday I watched it do the same thing as CTDB.  I could start the
service by hand but it would time out during stop, nearly every time,
even though there were no relevant processes running anymore.
winbindd sends a READY=1 notification after successfully starting.  It
does not send STOPPING=1.  winbindd is much simpler during shutdown.
I can get logs for this one too if necessary.

Thanks for any help.

peace & happiness,
martin


--
Mantas Mikulėnas

[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux