Re: [RFC PATCH] cgroup: add cgroup.signal

Michal Koutný <mkoutny@xxxxxxxx> · Mon, 26 Apr 2021 16:42:45 +0200

Hello Christian,
I have some questions to understand the motivation here.

On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> - Signals are specified by writing the signal number into cgroup.signal.
>   An alternative would be to allow writing the signal name but I don't
>   think that's worth it. Callers like systemd can easily do a snprintf()
>   with the signal's define/enum.
> - Since signaling is a one-time event and we're holding cgroup_mutex()
>   as we do for freezer we don't need to worry about tasks joining the
>   cgroup while we're signaling the cgroup. Freezer needed to care about
>   this because a task could join or leave frozen/non-frozen cgroups.
>   Since we only support SIGKILL currently and SIGKILL works for frozen
>   tasks there's also not significant interaction with frozen cgroups.
> - Since signaling leads to an event and not a state change the
>   cgroup.signal file is write-only.
Have you considered accepting a cgroup fd to pidfd_send_signal and
realize this operation through this syscall? (Just asking as it may
prevent some of these consequences whereas bring other unclarities.)

> - Since we currently only support SIGKILL we don't need to generate a
>   separate notification and can rely on the unpopulated notification
>   meachnism. If we support more signals we can introduce a separate
>   notification in cgroup.events.
What kind of notification do you have in mind here?

> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
>   you have two tasks in different pid namespaces the cgroup would still
>   be frozen.
>   The cgroup.signal mechanism should consequently behave the same way,
>   i.e.  signal all processes and ignore in which pid namespace they
>   exist. This would obviously mean that if you e.g. had a task from an
>   ancestor pid namespace join a delegated cgroup of a container in a
>   child pid namespace the container can kill that task. But I think this
>   is fine and actually the semantics we want since the cgroup has been
>   delegated.
What do you mean by a delegated cgroup in this context?

> - We're holding the read-side of tasklist lock while we're signaling
>   tasks. That seems fine since kill(-1, SIGKILL) holds the read-side
>   of tasklist lock walking all processes and is a way for unprivileged
>   users to trigger tasklist lock being held for a long time. In contrast
>   it would require a delegated cgroup with lots of processes and a deep
>   hierarchy to allow for something similar with this interface.
I'd better not proliferate tasklist_lock users if it's avoidable (such
as freezer does).

> Fwiw, in addition to the system manager and container use-cases I think
> this has the potential to be picked up by the "kill" tool. In the future
> I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated
(OT: FTR, there's `systemctl kill` already ;-))

Michal
Attachment:
signature.asc

Description: Digital signature