On Sat, Jun 25, 2022 at 11:34:46AM -0500, Eric W. Biederman wrote: > I haven't gotten as far as reproducing this but I have started giving > this issue some thought. > > This entire thing smells like a memory barrier is missing somewhere. > However by definition the lock implementations in linux provide all the > needed memory barriers, and in the ptrace_stop and ptrace_check_attach > path I don't see cases where these values are sampled outside of a lock > except in wait_task_inactive. Does doing that perhaps require a > barrier? > > The two things I can think of that could shed light on what is going on > is enabling lockdep, to enable the debug check in signal_wake_up_state > and verifying bits of state that should be constant while the task > is frozen for ptrace are indeed constant when task is frozen for ptrace. > Something like my patch below. > > If you could test that when you have a chance that would help narrow > down what is going on. > > Thank you, > Eric > > diff --git a/kernel/ptrace.c b/kernel/ptrace.c > index 156a99283b11..6467a2b1c3bc 100644 > --- a/kernel/ptrace.c > +++ b/kernel/ptrace.c > @@ -268,9 +268,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > } > read_unlock(&tasklist_lock); > > - if (!ret && !ignore_state && > - WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED))) > + if (!ret && !ignore_state) { > + WARN_ON_ONCE(!(child->jobctl & JOBCTL_PTRACE_FROZEN)); > + WARN_ON_ONCE(!(child->joctctl & JOBCTL_TRACED)); > + WARN_ON_ONCE(READ_ONCE(child->__state) != __TASK_TRACED); > + WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)); > ret = -ESRCH; > + } > > return ret; > } I modified your chunk a bit - hope that is what you had in mind: diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 156a99283b11..f0e9a9a4d63c 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -268,9 +268,19 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) } read_unlock(&tasklist_lock); - if (!ret && !ignore_state && - WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED))) - ret = -ESRCH; + if (!ret && !ignore_state) { + unsigned int __state; + + WARN_ON_ONCE(!(child->jobctl & JOBCTL_PTRACE_FROZEN)); + WARN_ON_ONCE(!(child->jobctl & JOBCTL_TRACED)); + __state = READ_ONCE(child->__state); + if (__state != __TASK_TRACED) { + pr_err("%s(%d) __state %x", __FUNCTION__, __LINE__, __state); + WARN_ON_ONCE(1); + } + if (WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED))) + ret = -ESRCH; + } return ret; } When WARN_ON_ONCE(1) hits the child __state is always zero/TASK_RUNNING, as reported by the preceding pr_err(). Yet, in the resulting core dump it is always __TASK_TRACED. Removing WARN_ON_ONCE(1) while looping until (__state != __TASK_TRACED) confirms the unexpected __state is always TASK_RUNNING. It never observed more than one iteration and gets printed once in 30-60 mins. So probably when the condition is entered __state is TASK_RUNNING more often, but gets overwritten with __TASK_TRACED pretty quickly. Which kind of consistent with my previous observation that kernel/sched/core.c:3305 is where return 0 makes wait_task_inactive() fail. No other WARN_ON_ONCE() hit ever.