Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes: > On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote: >> Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes: >> >> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote: >> >> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx> >> >> >> >> Currently ptrace_stop() / do_signal_stop() rely on the special states >> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this >> >> state exists only in task->__state and nowhere else. >> >> >> >> There's two spots of bother with this: >> >> >> >> - PREEMPT_RT has task->saved_state which complicates matters, >> >> meaning task_is_{traced,stopped}() needs to check an additional >> >> variable. >> >> >> >> - An alternative freezer implementation that itself relies on a >> >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will >> >> result in misbehaviour. >> >> >> >> As such, add additional state to task->jobctl to track this state >> >> outside of task->__state. >> >> >> >> NOTE: this doesn't actually fix anything yet, just adds extra state. >> >> >> >> --EWB >> >> * didn't add a unnecessary newline in signal.h >> >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up >> >> instead of in signal_wake_up_state. This prevents the clearing >> >> of TASK_STOPPED and TASK_TRACED from getting lost. >> >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared >> > >> > Hi Eric, Peter, >> > >> > On s390 this patch triggers warning at kernel/ptrace.c:272 when >> > kill_child testcase from strace tool is repeatedly used (the source >> > is attached for reference): >> > >> > while :; do >> > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child >> > done >> > >> > It normally takes few minutes to cause the warning in -rc3, but FWIW >> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of >> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace. >> > >> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't >> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we >> > observe a child in __TASK_TRACED state. Could you please comment here? >> > >> >> For clarity the warning is that the child is not in __TASK_TRACED state. >> >> The code is waiting for the code to stop in the scheduler in the >> __TASK_TRACED state so that it can safely read and change the >> processes state. Some of that state is not even saved until the >> process is scheduled out so we have to wait until the process >> is stopped in the scheduler. > > So I assume (checked actually) the return 0 below from kernel/sched/core.c: > wait_task_inactive() is where it bails out: > > 3303 while (task_running(rq, p)) { > 3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state)) > 3305 return 0; > 3306 cpu_relax(); > 3307 } > > Yet, the child task is always found in __TASK_TRACED state (as seen > in crash dumps): > >> 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace > 101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child > 108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child > crash> task bb04b200 __state > PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child" > __state = 8, > > crash> task d0b10100 __state > PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child" > __state = 8, That is weird. >> At least on s390 it looks like there is a race between SIGKILL and >> ptrace_check_attach. That isn't good. >> >> Reading the code below there is something missing because I don't see >> anything making ptrace calls, and ptrace_check_attach (which contains >> the warning) only happens in the ptrace syscall. > > That is what I believe strace does when calling that code: > > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child Thank you. That was my braino. I will have to see if it reproduces for me on x86 (I don't have an s390). Perhaps if I can reproduce it I can guess what is going wrong. So far it appears WARN_ON_ONCE has nothing to warn about yet it is warning. Eric