Re: systemctl stop going through timeout even though all processes have exited

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mantas,

/sys/fs/cgroup/systemd/release_agent contains
/usr/lib/systemd/systemd-cgroups-agent.  This is inside the container.
I'm not sure if systemd (inside the container) would see anything
different.  I'm also not sure if this is correct.

I used:

  bpftrace -e 'tracepoint:syscalls:sys_enter_exec*{ printf("pid: %d, comm: %s, args: ", pid, comm); join(args->argv); }'

to trace exec()s.

Although I see other executables being run, though nothing related to
systemd, I don't see systemd-cgroups-agent being run.

Thanks again for any ideas...

peace & happiness,
martin

On Fri, 13 Oct 2023 12:01:33 +0300, Mantas Mikulėnas
<grawity@xxxxxxxxx> wrote:

> What value do you have in /sys/fs/cgroup/systemd/release_agent, as seen by
> systemd? Does it point to the correct executable?
> 
> Does e.g. forkstat (or execsnoop or similar) show that the executable is
> being run when the cgroup empties?
> 
> On Fri, Oct 13, 2023, 04:20 Martin Schwenke <martin@xxxxxxxxxx> wrote:
> 
> > Hi Mantas,
> >
> > Yes, it looks like cgroups v1.
> >
> > Would this be a kernel bug?  systemd bug?
> >
> > Thanks...
> >
> > peace & happiness,
> > martin
> >
> > On Wed, 11 Oct 2023 08:19:59 +0300, Mantas Mikulėnas
> > <grawity@xxxxxxxxx> wrote:
> >  
> > > Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks  
> > Docker), I  
> > > recall it was a bit complex for systemd to get notified when the cgroup
> > > actually empties – via /sys/fs/cgroup/systemd/release_agent that  
> > specifies  
> > > a helper executable that the kernel runs... I wonder if that mechanism is
> > > broken on your system.
> > >
> > > On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <martin@xxxxxxxxxx>  
> > wrote:  
> > >  
> > > > I'm seeing "systemctl stop <service>" for several services taking a
> > > > long time because it goes through the timeout process, even though all
> > > > relevant processes have exited.
> > > >
> > > > I'll give 2 examples.  Both examples are running inside a privileged
> > > > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
> > > > systemd version, reported by "systemctl --version" in the container
> > > > is:
> > > >
> > > >   systemd 239 (239-74.el8_8.5)
> > > >
> > > > Here is ctdb.system:
> > > >
> > > >   [Unit]
> > > >   Description=CTDB
> > > >   Documentation=man:ctdbd(1) man:ctdb(7)
> > > >   After=network-online.target time-sync.target
> > > >   ConditionFileNotEmpty=/etc/ctdb/nodes
> > > >
> > > >   [Service]
> > > >   Type=forking
> > > >   LimitCORE=infinity
> > > >   LimitNOFILE=1048576
> > > >   TasksMax=4096
> > > >   PIDFile=/var/run/ctdb/ctdbd.pid
> > > >   ExecStart=/usr/sbin/ctdbd
> > > >   ExecStop=/usr/bin/ctdb shutdown
> > > >   KillMode=control-group
> > > >   Restart=no
> > > >
> > > >   [Install]
> > > >   WantedBy=multi-user.target
> > > >
> > > > "/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
> > > > starting and then stopping using systemctl works fine.  However, many
> > > > times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
> > > > value then the duration reduces accordingly.  I can confirm using both
> > > > "ps auxfww" and "systemd-cgls" that within the container there are no
> > > > relevant processes a moment after "systemctl stop ctdb" is run.  In
> > > > particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> > > > ctdb" is still waiting.
> > > >
> > > > Before attempting to stop, the service is successfully started:
> > > >
> > > >   Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> > > > /usr/sbin/ctdbd
> > > >   Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> > > > file:/var/log/log.ctdb
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> > > > (ctdbd).
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process  
> > exited,  
> > > > code=exited status=0
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD  
> > for  
> > > > state start.
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> > > > belongs to service, we are happy.
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded:  
> > 710742  
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start ->  
> > running  
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job  
> > ctdb.service/start  
> > > > finished, result=done
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
> > > >   -- Subject: Unit ctdb.service has finished start-up
> > > >   -- Defined-By: systemd
> > > >   -- Support:  
> > https://lists.freedesktop.org/mailman/listinfo/systemd-devel  
> > > >   --
> > > >   -- Unit ctdb.service has finished starting up.
> > > >   --
> > > >   -- The start-up result is done.
> > > >
> > > > The relevant part of the log while stopping seems to be:
> > > >
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743
> > > > (ctdb-eventd).
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file
> > > > /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,
> > > > code=exited, status=0/SUCCESS
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > > > sender=org.freedesktop.systemd1 destination=n/a
> > > > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > > > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > > > cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a  
> > error-message=n/a  
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > > > sender=org.freedesktop.systemd1 destination=n/a
> > > > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > > > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > > > cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a  
> > error-message=n/a  
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received
> > > > EPOLLHUP on stored fd 18 (stored), closing.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860
> > > > (ctdb).
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process  
> > exited,  
> > > > code=exited status=0
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD  
> > for  
> > > > state stop.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop ->
> > > > stop-sigterm
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > > > sender=org.freedesktop.systemd1 destination=n/a
> > > > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > > > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > > > cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a  
> > error-message=n/a  
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > > > sender=org.freedesktop.systemd1 destination=n/a
> > > > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > > > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > > > cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a  
> > error-message=n/a  
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834
> > > > (ctdb_lock_helpe).
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835
> > > > (ctdb_lock_helpe).
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe)  
> > died  
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832
> > > > (n/a).
> > > >   Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got
> > > > notification message from PID 840 (WATCHDOG=1)
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm'
> > > > timed out. Killing.
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result
> > > > 'timeout'.
> > > >   -- Subject: Unit failed
> > > >   -- Defined-By: systemd
> > > >   -- Support:  
> > https://lists.freedesktop.org/mailman/listinfo/systemd-devel  
> > > >   --
> > > >   -- The unit ctdb.service has entered the 'failed' state with result
> > > > 'timeout'.
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed  
> > stop-sigterm ->  
> > > > failed
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal
> > > > sender=org.freedesktop.systemd1 destination=n/a
> > > > path=/org/freedesktop/systemd1  
> > interface=org.freedesktop.DBus.Properties  
> > > > member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as
> > > > error-name=n/a error-message=n/a
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job  
> > ctdb.service/stop  
> > > > finished, result=done
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB.
> > > >   -- Subject: Unit ctdb.service has finished shutting down
> > > >   -- Defined-By: systemd
> > > >   -- Support:  
> > https://lists.freedesktop.org/mailman/listinfo/systemd-devel  
> > > >   --
> > > >   -- Unit ctdb.service has finished shutting down.
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal
> > > > sender=org.freedesktop.systemd1 destination=n/a
> > > > path=/org/freedesktop/systemd1  
> > interface=org.freedesktop.systemd1.Manager  
> > > > member=JobRemoved cookie=59 reply_cookie=0 signature=uoss  
> > error-name=n/a  
> > > > error-message=n/a
> > > >   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed
> > > > state.
> > > >
> > > > It would be very useful if systemd could log what it is still waiting
> > > > for when it times out.
> > > >
> > > > Note that during start and stop, CTDB runs a lot of subprocesses,
> > > > including some that use systemctl to start and stop various services
> > > > that it, in turn, manages.
> > > >
> > > > The full debug level log, after running:
> > > >
> > > >   systemd-analyze log-level debug
> > > >
> > > > is uploaded it to:
> > > >
> > > >   https://meltin.net/uploads/systemd/ctdb-stop.log
> > > >
> > > > I'm happy to reply and attach it, but it is 48KB.
> > > >
> > > > The only theory I can come up with is some sort of race where
> > > > processes are created during shutdown and systemd gets confused.
> > > >
> > > > I see a similar thing for a much simpler service, winbind:
> > > >
> > > > Here is winbind.service:
> > > >
> > > >   [Unit]
> > > >   Description=Samba Winbind Daemon
> > > >   Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5)
> > > >   After=network.target nmb.service
> > > >
> > > >   [Service]
> > > >   Type=notify
> > > >   PIDFile=/var/run/winbindd.pid
> > > >   EnvironmentFile=-/etc/sysconfig/samba
> > > >   ExecStart=/usr/sbin/winbindd --foreground --no-process-group
> > > > $WINBINDOPTIONS
> > > >   ExecReload=/bin/kill -HUP $MAINPID
> > > >   LimitCORE=infinity
> > > >
> > > >   [Install]
> > > >   WantedBy=multi-user.target
> > > >
> > > > Yesterday I watched it do the same thing as CTDB.  I could start the
> > > > service by hand but it would time out during stop, nearly every time,
> > > > even though there were no relevant processes running anymore.
> > > > winbindd sends a READY=1 notification after successfully starting.  It
> > > > does not send STOPPING=1.  winbindd is much simpler during shutdown.
> > > > I can get logs for this one too if necessary.
> > > >
> > > > Thanks for any help.
> > > >
> > > > peace & happiness,
> > > > martin
> > > >  
> > >
> > >
> > > --
> > > Mantas Mikulėnas  
> >
> >
> >  





[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux