On Fri, Jan 24, 2025 at 11:33:07AM -1000, Tejun Heo wrote: > Hello, Christian. > > I was looking at cgroup.kill implementation and wondering whether there > could be a race window. So, __cgroup_kill() does the following: > > k1. Set CGRP_KILL. > k2. Iterate tasks and deliver SIGKILL. > k3. Clear CGRP_KILL. > > The copy_process() does the following: > > c1. Copy a bunch of stuff. > c2. Grab siglock. > c3. Check fatal_signal_pending(). > c4. Commit to forking. > c5. Release siglock. > c6. Call cgroup_post_fork() which puts the task on the css_set and tests > CGRP_KILL. > > The intention seems to be that either a forking task gets SIGKILL and > terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I > don't see what guarantees that k3 can't happen before c6. ie. After a > forking task passes c5, k2 can take place and then before the forking task > reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. > What am I missing? > > Thanks. I think this is indeed the race though small. One way to fix this is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork side takes it in read mode from cgroup_can_fork() to cgroup_post_fork(). Though I think we should avoid that as this adds one more potential stall scenario for cgroup.kill which is usually triggered under extreme situation (memory pressure). I have prototyped a sequence number based approach below. If that is acceptable then I can proposed the patch with detailed commit message.