On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote: > Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes: > > > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote: > >> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > >> > >> Currently ptrace_stop() / do_signal_stop() rely on the special states > >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this > >> state exists only in task->__state and nowhere else. > >> > >> There's two spots of bother with this: > >> > >> - PREEMPT_RT has task->saved_state which complicates matters, > >> meaning task_is_{traced,stopped}() needs to check an additional > >> variable. > >> > >> - An alternative freezer implementation that itself relies on a > >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will > >> result in misbehaviour. > >> > >> As such, add additional state to task->jobctl to track this state > >> outside of task->__state. > >> > >> NOTE: this doesn't actually fix anything yet, just adds extra state. > >> > >> --EWB > >> * didn't add a unnecessary newline in signal.h > >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up > >> instead of in signal_wake_up_state. This prevents the clearing > >> of TASK_STOPPED and TASK_TRACED from getting lost. > >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared > > > > Hi Eric, Peter, > > > > On s390 this patch triggers warning at kernel/ptrace.c:272 when > > kill_child testcase from strace tool is repeatedly used (the source > > is attached for reference): > > > > while :; do > > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child > > done > > > > It normally takes few minutes to cause the warning in -rc3, but FWIW > > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of > > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace. > > > > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't > > fail") suggests this WARN_ON_ONCE() is not really expected, yet we > > observe a child in __TASK_TRACED state. Could you please comment here? > > > > For clarity the warning is that the child is not in __TASK_TRACED state. > > The code is waiting for the code to stop in the scheduler in the > __TASK_TRACED state so that it can safely read and change the > processes state. Some of that state is not even saved until the > process is scheduled out so we have to wait until the process > is stopped in the scheduler. So I assume (checked actually) the return 0 below from kernel/sched/core.c: wait_task_inactive() is where it bails out: 3303 while (task_running(rq, p)) { 3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state)) 3305 return 0; 3306 cpu_relax(); 3307 } Yet, the child task is always found in __TASK_TRACED state (as seen in crash dumps): > 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace 101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child 108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child crash> task bb04b200 __state PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child" __state = 8, crash> task d0b10100 __state PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child" __state = 8, > At least on s390 it looks like there is a race between SIGKILL and > ptrace_check_attach. That isn't good. > > Reading the code below there is something missing because I don't see > anything making ptrace calls, and ptrace_check_attach (which contains > the warning) only happens in the ptrace syscall. That is what I believe strace does when calling that code: strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child > Eric > > > > > Thanks! > > > > /* > > * Check for the corner case that previously lead to segfault > > * due to an attempt to access unitialised tcp->s_ent. > > * > > * 13994 ????( <unfinished ...> > > * ... > > * 13994 <... ???? resumed>) = ? > > * > > * Copyright (c) 2019 The strace developers. > > * All rights reserved. > > * > > * SPDX-License-Identifier: GPL-2.0-or-later > > */ > > > > #include "tests.h" > > > > #include <sched.h> > > #include <signal.h> > > #include <unistd.h> > > #include <sys/mman.h> > > #include <sys/wait.h> > > > > #define ITERS 10000 > > #define SC_ITERS 10000 > > > > int > > main(void) > > { > > volatile sig_atomic_t *const mem = > > mmap(NULL, get_page_size(), PROT_READ | PROT_WRITE, > > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > if (mem == MAP_FAILED) > > perror_msg_and_fail("mmap"); > > > > for (unsigned int i = 0; i < ITERS; ++i) { > > mem[0] = mem[1] = 0; > > > > const pid_t pid = fork(); > > if (pid < 0) > > perror_msg_and_fail("fork"); > > > > if (!pid) { > > /* wait for the parent */ > > while (!mem[0]) > > ; > > /* let the parent know we are running */ > > mem[1] = 1; > > > > for (unsigned int j = 0; j < SC_ITERS; j++) > > sched_yield(); > > > > pause(); > > return 0; > > } > > > > /* let the child know we are running */ > > mem[0] = 1; > > /* wait for the child */ > > while (!mem[1]) > > ; > > > > if (kill(pid, SIGKILL)) > > perror_msg_and_fail("kill"); > > if (wait(NULL) != pid) > > perror_msg_and_fail("wait"); > > } > > > > return 0; > > }