Hello Christian, I have some questions to understand the motivation here. On Fri, Apr 23, 2021 at 07:13:51PM +0200, Christian Brauner <brauner@xxxxxxxxxx> wrote: > - Signals are specified by writing the signal number into cgroup.signal. > An alternative would be to allow writing the signal name but I don't > think that's worth it. Callers like systemd can easily do a snprintf() > with the signal's define/enum. > - Since signaling is a one-time event and we're holding cgroup_mutex() > as we do for freezer we don't need to worry about tasks joining the > cgroup while we're signaling the cgroup. Freezer needed to care about > this because a task could join or leave frozen/non-frozen cgroups. > Since we only support SIGKILL currently and SIGKILL works for frozen > tasks there's also not significant interaction with frozen cgroups. > - Since signaling leads to an event and not a state change the > cgroup.signal file is write-only. Have you considered accepting a cgroup fd to pidfd_send_signal and realize this operation through this syscall? (Just asking as it may prevent some of these consequences whereas bring other unclarities.) > - Since we currently only support SIGKILL we don't need to generate a > separate notification and can rely on the unpopulated notification > meachnism. If we support more signals we can introduce a separate > notification in cgroup.events. What kind of notification do you have in mind here? > - Freezer doesn't care about tasks in different pid namespaces, i.e. if > you have two tasks in different pid namespaces the cgroup would still > be frozen. > The cgroup.signal mechanism should consequently behave the same way, > i.e. signal all processes and ignore in which pid namespace they > exist. This would obviously mean that if you e.g. had a task from an > ancestor pid namespace join a delegated cgroup of a container in a > child pid namespace the container can kill that task. But I think this > is fine and actually the semantics we want since the cgroup has been > delegated. What do you mean by a delegated cgroup in this context? > - We're holding the read-side of tasklist lock while we're signaling > tasks. That seems fine since kill(-1, SIGKILL) holds the read-side > of tasklist lock walking all processes and is a way for unprivileged > users to trigger tasklist lock being held for a long time. In contrast > it would require a delegated cgroup with lots of processes and a deep > hierarchy to allow for something similar with this interface. I'd better not proliferate tasklist_lock users if it's avoidable (such as freezer does). > Fwiw, in addition to the system manager and container use-cases I think > this has the potential to be picked up by the "kill" tool. In the future > I'd hope we can do: kill -9 --cgroup /sys/fs/cgroup/delegated (OT: FTR, there's `systemctl kill` already ;-)) Michal
Attachment:
signature.asc
Description: Digital signature