Re: Maybe a race window in cgroup.kill?

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Wed, 29 Jan 2025 11:08:41 -0800

On Fri, Jan 24, 2025 at 11:33:07AM -1000, Tejun Heo wrote:
> Hello, Christian.
> 
> I was looking at cgroup.kill implementation and wondering whether there
> could be a race window. So, __cgroup_kill() does the following:
> 
>  k1. Set CGRP_KILL.
>  k2. Iterate tasks and deliver SIGKILL.
>  k3. Clear CGRP_KILL.
> 
> The copy_process() does the following:
> 
>  c1. Copy a bunch of stuff.
>  c2. Grab siglock.
>  c3. Check fatal_signal_pending().
>  c4. Commit to forking.
>  c5. Release siglock.
>  c6. Call cgroup_post_fork() which puts the task on the css_set and tests
>      CGRP_KILL.
> 
> The intention seems to be that either a forking task gets SIGKILL and
> terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
> don't see what guarantees that k3 can't happen before c6. ie. After a
> forking task passes c5, k2 can take place and then before the forking task
> reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
> What am I missing?
> 
> Thanks.

I think this is indeed the race though small. One way to fix this is by
taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the
fork side takes it in read mode from cgroup_can_fork() to
cgroup_post_fork(). Though I think we should avoid that as this adds
one more potential stall scenario for cgroup.kill which is usually
triggered under extreme situation (memory pressure). I have prototyped a
sequence number based approach below. If that is acceptable then I can
proposed the patch with detailed commit message.