On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote: <snip> > 2. Weird strace behaviour across pidns boundary > > When strace'ing (with -ff) lxc-start, I get a proper strace for the > directly spawned process and the container init. However, any processes > spawned by the container's init are not straced properly (I get two > empty files, named <foo>.<pid-in-root-ns> and <foo>.2 -- presumably pid > inside the container). The container also seems to malfunction under > strace (looks like exec() failing as lxc-ps shows two "init" processes). > > This is quite painful as it prevents strace'ing processes in containers > even after startup. Here's a snippet of strace'ing a bash (pid 179 > inside, pid 2959 outside) trying to run 'ls'. The shell hangs until I > kill the strace process. > > pipe([3, 4]) = 0 > clone(Process 197 attached > child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7859708) = 197 > Process 2999 attached (waiting for parent) > [pid 2959] setpgid(197, 197) = 0 > [pid 2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > [pid 2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > [pid 2959] close(3) = 0 > [pid 2959] close(4) = 0 > [pid 2959] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0 > [pid 2959] ioctl(255, TIOCSPGRP, [197]) = 0 > [pid 2959] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0 > [pid 2959] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > [pid 2959] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > [pid 2959] waitpid(-1, Process 2959 suspended > ^C <unfinished ...> > Process 2959 detached > Process 197 detached > Process 2999 detached > > 'strace ls' ran completely inside the container works as expected. I'm suprised strace of ls works across pid namespaces. I've been looking at strace and it seemed to me that one kernel change and a bunch of strace changes are needed to make strace'ing in child pid namespaces work. Eric Biederman's setns() patches also might help. Can you get a little farther with the kernel fix below? Fix incorrect pid namespace used by ptrace during fork/vfork/clone pid namespaces are not used properly by ptrace in do_fork(). When tracing parent != real_parent because parent is the tracing task. Yet the pid in the real_parent's namespace is being used in do_fork(): nr = task_pid_vnr(p); /* uses real_parent's pid namespace */ if (clone_flags & CLONE_PARENT_SETTID) put_user(nr, parent_tidptr); /* "real_parent_tidptr" */ ... tracehook_report_clone_complete(trace, regs, clone_flags, nr, p); /* ptrace broken */ if (clone_flags & CLONE_VFORK) { freezer_do_not_count(); wait_for_completion(&vfork); freezer_count(); tracehook_report_vfork_done(p, nr); /* ptrace broken */ In this case re-using the value in nr is wrong. This bug can be seen by attaching to an already-running task in a descendent namespace with strace -f. When the traced task forks strace won't attach to the new task properly because it sees the incorrect pid. For example, if root is running on two VTs and root@VTN# indicates switching to VT N: root@VT1# ns_exec -cp /bin/bash root@VT1# echo $$ 1 root@VT2# strace -f -e fork,vfork,clone -p <pid of bash> Process 14518 attached - interrupt to quit root@VT1# /bin/bash <stops -- new bash shell does not respond to input> root@VT2# clone(Process 15 attached ... ) = 15 Process 15044 attached (waiting for parent) Process 14518 suspended <no more output> <hit ctrl-c> root@VT1# echo $$ 15 strace sees the pid of the new process to attach to as 15 when it should really be attaching to pid 15044. Interestingly enough, it does also attach to 15044 later but since the initial attach failed it does not properly resume the traced task. (I assume wait() helped here -- it reported 15044 and hence strace is aware that 15044 exists -- I haven't read the strace code to confirm this.) Miscellaneous Notes re: ptrace and pid namespaces (Documentation/* fodder?): Note that if the tracer detaches and a tracer from a different ancestor pid namespace attaches we'll have the wrong pid number again. The only way to fix that is to have ptrace hold a reference to a struct pid so long as it may be needed for PTRACE_GETEVENTMSG. The only way it's possible to ptrace a task outside the tracer's pid namespace is if the already-tracing task enters a new descendent pid namespace: tracer tracer does . \ => clone(CLONE_NEWPID) => / \ tracee tracer tracee In this case the pids returned by PTRACE_GETEVENTMSG will be 0. Since attaching to tasks that aren't in descendent namespaces is not possible, this is a very unlikely problem to encounter. Signed-off-by: Matt Helsley <matthltc@xxxxxxxxxx> Cc: Roland McGrath <roland@xxxxxxxxxx> (MAINTAINERS: ptrace) Cc: Oleg Nesterov <oleg@xxxxxxxxxx> (MAINTAINERS: ptrace) Cc: <utrace folks> Cc: Sukadev Bhattiprolu <sukadev@xxxxxxxxxx> (pid ns) Cc: containers@xxxxxxxxxxxxxxxxxxxxxxxxxx (pid ns) Cc: linux-kernel@xxxxxxxxxxxxxxx diff --git a/kernel/fork.c b/kernel/fork.c index 3a65513..7946ea6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags, */ if (!IS_ERR(p)) { struct completion vfork; + int ptrace_pid_vnr; trace_sched_process_fork(current, p); @@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags, wake_up_new_task(p, clone_flags); } + ptrace_pid_vnr = nr; + if (unlikely(p->parent != p->real_parent)) { + rcu_read_lock(); + ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns); + rcu_read_unlock(); + } tracehook_report_clone_complete(trace, regs, - clone_flags, nr, p); + clone_flags, + ptrace_pid_vnr, p); if (clone_flags & CLONE_VFORK) { freezer_do_not_count(); wait_for_completion(&vfork); freezer_count(); - tracehook_report_vfork_done(p, nr); + tracehook_report_vfork_done(p, ptrace_pid_vnr); } } else { nr = PTR_ERR(p); _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers