What value do you have in /sys/fs/cgroup/systemd/release_agent, as seen by systemd? Does it point to the correct executable?
Does e.g. forkstat (or execsnoop or similar) show that the executable is being run when the cgroup empties?
On Fri, Oct 13, 2023, 04:20 Martin Schwenke <martin@xxxxxxxxxx> wrote:
Hi Mantas,
Yes, it looks like cgroups v1.
Would this be a kernel bug? systemd bug?
Thanks...
peace & happiness,
martin
On Wed, 11 Oct 2023 08:19:59 +0300, Mantas Mikulėnas
<grawity@xxxxxxxxx> wrote:
> Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I
> recall it was a bit complex for systemd to get notified when the cgroup
> actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies
> a helper executable that the kernel runs... I wonder if that mechanism is
> broken on your system.
>
> On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <martin@xxxxxxxxxx> wrote:
>
> > I'm seeing "systemctl stop <service>" for several services taking a
> > long time because it goes through the timeout process, even though all
> > relevant processes have exited.
> >
> > I'll give 2 examples. Both examples are running inside a privileged
> > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host. The
> > systemd version, reported by "systemctl --version" in the container
> > is:
> >
> > systemd 239 (239-74.el8_8.5)
> >
> > Here is ctdb.system:
> >
> > [Unit]
> > Description=CTDB
> > Documentation=man:ctdbd(1) man:ctdb(7)
> > After=network-online.target time-sync.target
> > ConditionFileNotEmpty=/etc/ctdb/nodes
> >
> > [Service]
> > Type=forking
> > LimitCORE=infinity
> > LimitNOFILE=1048576
> > TasksMax=4096
> > PIDFile=/var/run/ctdb/ctdbd.pid
> > ExecStart=/usr/sbin/ctdbd
> > ExecStop=/usr/bin/ctdb shutdown
> > KillMode=control-group
> > Restart=no
> >
> > [Install]
> > WantedBy=multi-user.target
> >
> > "/usr/bin/ctdb shutdown" causes a controlled shutdown. In many cases,
> > starting and then stopping using systemctl works fine. However, many
> > times it takes >90s to stop, as per TimeoutStopSec. If I reduce that
> > value then the duration reduces accordingly. I can confirm using both
> > "ps auxfww" and "systemd-cgls" that within the container there are no
> > relevant processes a moment after "systemctl stop ctdb" is run. In
> > particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> > ctdb" is still waiting.
> >
> > Before attempting to stop, the service is successfully started:
> >
> > Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> > /usr/sbin/ctdbd
> > Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> > file:/var/log/log.ctdb
> > Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> > (ctdbd).
> > Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to
> > ctdb.service.
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited,
> > code=exited status=0
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for
> > state start.
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> > belongs to service, we are happy.
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start
> > finished, result=done
> > Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
> > -- Subject: Unit ctdb.service has finished start-up
> > -- Defined-By: systemd
> > -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
> > --
> > -- Unit ctdb.service has finished starting up.
> > --
> > -- The start-up result is done.
> >
> > The relevant part of the log while stopping seems to be:
> >
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743
> > (ctdb-eventd).
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file
> > /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,
> > code=exited, status=0/SUCCESS
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received
> > EPOLLHUP on stored fd 18 (stored), closing.
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860
> > (ctdb).
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process exited,
> > code=exited status=0
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for
> > state stop.
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop ->
> > stop-sigterm
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834
> > (ctdb_lock_helpe).
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835
> > (ctdb_lock_helpe).
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe) died
> > (code=exited, status=0/SUCCESS)
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839 belongs to
> > ctdb.service.
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832
> > (n/a).
> > Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got
> > notification message from PID 840 (WATCHDOG=1)
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm'
> > timed out. Killing.
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result
> > 'timeout'.
> > -- Subject: Unit failed
> > -- Defined-By: systemd
> > -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
> > --
> > -- The unit ctdb.service has entered the 'failed' state with result
> > 'timeout'.
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed stop-sigterm ->
> > failed
> > Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties
> > member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as
> > error-name=n/a error-message=n/a
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job ctdb.service/stop
> > finished, result=done
> > Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB.
> > -- Subject: Unit ctdb.service has finished shutting down
> > -- Defined-By: systemd
> > -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
> > --
> > -- Unit ctdb.service has finished shutting down.
> > Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager
> > member=JobRemoved cookie=59 reply_cookie=0 signature=uoss error-name=n/a
> > error-message=n/a
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed
> > state.
> >
> > It would be very useful if systemd could log what it is still waiting
> > for when it times out.
> >
> > Note that during start and stop, CTDB runs a lot of subprocesses,
> > including some that use systemctl to start and stop various services
> > that it, in turn, manages.
> >
> > The full debug level log, after running:
> >
> > systemd-analyze log-level debug
> >
> > is uploaded it to:
> >
> > https://meltin.net/uploads/systemd/ctdb-stop.log
> >
> > I'm happy to reply and attach it, but it is 48KB.
> >
> > The only theory I can come up with is some sort of race where
> > processes are created during shutdown and systemd gets confused.
> >
> > I see a similar thing for a much simpler service, winbind:
> >
> > Here is winbind.service:
> >
> > [Unit]
> > Description=Samba Winbind Daemon
> > Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5)
> > After=network.target nmb.service
> >
> > [Service]
> > Type=notify
> > PIDFile=/var/run/winbindd.pid
> > EnvironmentFile=-/etc/sysconfig/samba
> > ExecStart=/usr/sbin/winbindd --foreground --no-process-group
> > $WINBINDOPTIONS
> > ExecReload=/bin/kill -HUP $MAINPID
> > LimitCORE=infinity
> >
> > [Install]
> > WantedBy=multi-user.target
> >
> > Yesterday I watched it do the same thing as CTDB. I could start the
> > service by hand but it would time out during stop, nearly every time,
> > even though there were no relevant processes running anymore.
> > winbindd sends a READY=1 notification after successfully starting. It
> > does not send STOPPING=1. winbindd is much simpler during shutdown.
> > I can get logs for this one too if necessary.
> >
> > Thanks for any help.
> >
> > peace & happiness,
> > martin
> >
>
>
> --
> Mantas Mikulėnas