Re: [RFC PATCH] cgroup: add cgroup.signal

Christian Brauner <christian.brauner@xxxxxxxxxx> · Wed, 28 Apr 2021 16:37:46 +0200

On Tue, Apr 27, 2021 at 10:29:00AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 27, 2021 at 11:36:06AM +0200, Christian Brauner wrote:
> > I thought about this optimization but (see below) given that it should
> > work with threaded cgroups we can't only walk thread-group leaders,
> > afaiu.
> 
> CSS_TASK_ITER_PROCS|CSS_TASK_ITER_THREADED iterates all thread group leaders
> in the threaded domain and is used to implement cgroup.procs. This should
> work, right?

Yes, I switched to that.

> 
> > > > @@ -4846,6 +4916,11 @@ static struct cftype cgroup_base_files[] = {
> > > > +	{
> > > > +		.name = "cgroup.signal",
> > > > +		.flags = CFTYPE_NOT_ON_ROOT,
> > > > +		.write = cgroup_signal_write,
> > > 
> > > I think this shouldn't be visible in threaded cgroups (or return an
> > > error when attempting to kill those).
> > 
> > I've been wondering about this too but then decided to follow freezer in
> > that regard. I think it should also be fine because a kill to a thread
> > will cause the whole thread-group to be taken down which arguably is the
> > semantics we want anyway.
> 
> I'd align it with cgroup.procs. Killing is a process-level operation (unlike
> arbitrary signal delivery which I think is another reason to confine this to
> killing) and threaded cgroups should be invisible to process-level
> operations.

Ok, so we make write to cgroup.kill in threaded cgroups EOPNOTSUPP which
is equivalent what a read on cgroup.procs would yield.

Tejun, Roman, Michal, I've been thinking a bit about the escaping
children during fork() when killing a cgroup and I would like to propose
we simply take the write-side of threadgroup_rwsem during cgroup.kill.

This would give us robust protection against escaping children during
fork() since fork()ing takes the read-side already in cgroup_can_fork().
And cgroup.kill should be sufficiently rare that this isn't an
additional burden.

Other ways seems more fragile where the child can potentially escape
being killed. The most obvious case is when CLONE_INTO_CGROUP is not
used. If a cgroup.kill is initiated after cgroup_can_fork() and after
the parent's fatal_signal_pending() check we will wait for the parent to
release the siglock in cgroup_kill(). Once it does we deliver the fatal
signal to the parent. But if we haven't passed cgroup_post_fork() fast
enough the child can be placed into that cgroup right after the kill.
That's not super bad per se since the child isn't technically visible in
the target cgroup anyway but it feels a bit cleaner if it would die
right away. We could minimize the window by raising a flag say CGRP_KILL
say:

static void __cgroup_kill(struct cgroup *cgrp)
{
       spin_lock_irq(&css_set_lock);
       set_bit(CGRP_KILL, &cgrp->flags);
       spin_unlock_irq(&css_set_lock);

       css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
       while ((task = css_task_iter_next(&it)))
               /* KILL IT */

       spin_lock_irq(&css_set_lock);
       clear_bit(CGRP_KILL, &cgrp->flags);
       spin_unlock_irq(&css_set_lock);

but that's also not clean enough for my taste. So I think the
threadgroup_rwsem might be quite clean and acceptable?

static void cgroup_kill(struct cgroup *cgrp)
{
       struct cgroup_subsys_state *css;
       struct cgroup *dsct;

       lockdep_assert_held(&cgroup_mutex);

       percpu_down_write(&cgroup_threadgroup_rwsem);
       cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
               __cgroup_kill(dsct);
       percpu_up_write(&cgroup_threadgroup_rwsem);
}