Re: [PATCH] procfs: Do not release pid_ns->proc_mnt too early

Louis Rilling <Louis.Rilling@xxxxxxxxxxx> · Mon, 21 Jun 2010 16:15:18 +0200

On 21/06/10  5:58 -0700, Eric W. Biederman wrote:
> Louis Rilling <Louis.Rilling@xxxxxxxxxxx> writes:
> 
> > On 18/06/10 18:27 +0200, Oleg Nesterov wrote:
> >> On 06/18, Louis Rilling wrote:
> >> >
> >> > On 17/06/10 23:36 +0200, Oleg Nesterov wrote:
> >> > > On 06/17, Eric W. Biederman wrote:
> >> > > >
> >> > > > The task->children isn't changed until __unhash_process() which runs
> >> > > > after flush_proc_task().
> >> > >
> >> > > Yes. But this is only the current implementation detail.
> >> > > It would be nice to cleanup the code so that EXIT_DEAD tasks are
> >> > > never sit in ->children list.
> >> > >
> >> > > > So we should be able to come up with
> >> > > > a variant of do_wait() that zap_pid_ns_processes can use that does
> >> > > > what we need.
> >> > >
> >> > > See above...
> >> > >
> >> > > Even if we modify do_wait() or add the new variant, how the caller
> >> > > can wait for EXIT_DEAD tasks? I don't think we want to modify
> >> > > release_task() to do __wake_up_parent() or something similar.
> >> >
> >> > Indeed, I was thinking about calling __wake_up_parent() from release_task()
> >> > once parent->children becomes empty.
> >> >
> >> > Not sure about the performance impact though. Maybe some WAIT_NO_CHILDREN flag
> >> > in parent->signal could limit it. But if EXIT_DEAD children are removed from
> >> > ->children before release_task(), I'm afraid that this becomes impossible.
> >> 
> >> Thinking more, even the current do_wait() from zap_pid_ns_processes()
> >> is not really good. Suppose that some none-init thread is ptraced, then
> >> zap_pid_ns_processes() will hange until the tracer does do_wait() or
> >> exits.
> >
> > Is this really a bad thing? If somebody ptraces a task in a pid namespace, that
> > sounds reasonable to have this namespace (and it's init task) pinned.
> 
> Louis.  Have you seen this problem hit without my setns patch?

Yes. I hit it with Kerrighed patches. I also have an ugly reproducer on
2.6.35-rc3 (see attachments). Ugly because I introduced artifical delays
in release_task(). I couldn't trigger the bug without it, probably because the
scheduler is too kind :)

I'm using memory poisoining (SLAB and DEBUG_SLAB) to make it easy to observe the
bug.

Example:
# ./proc_flush_task-bug-reproducer 1

> 
> I'm pretty certain that this hits because there are processes do_wait
> does not wait for, in particular processes in a disjoint process tree.

Indeed do_wait() misses EXIT_DEAD children.

> 
> So at this point I am really favoring killing the do_wait and making
> this all asynchronous.

Any idea about how to do it?

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <signal.h>
#include <linux/sched.h>

int pipefd[2];

int init(void *arg)
{
	int nr, i, err;
	sighandler_t sigret;
	char c;

	close(pipefd[0]);

	err = setsid();
	if (err < 0) {
		perror("setsid");
		abort();
	}

	sigret = signal(SIGCHLD, SIG_IGN);
	if (sigret == SIG_ERR) {
		fprintf(stderr, "signal\n");
		abort();
	}
	
	nr = atoi(arg);
	for (i = 0; i < nr; i++) {
		err = fork();
		if (err < 0) {
			perror("fork");
			abort();
		} else if (err == 0) {
			printf("%d before\n", getpid());
			fflush(stdout);
			pause();
			printf("%d after\n", getpid());
			fflush(stdout);
			return 0;
		}
	}

	err = write(pipefd[1], &c, 1);
	if (err != 1) {
		perror("write");
		abort();
	}

	pause();

	return 0;
}

int main(int argc, char *argv[])
{
	long stack_size = sysconf(_SC_PAGESIZE);
	void *stack = alloca(stack_size) + stack_size;
	pid_t pid;
	char c;
	int ret;

	ret = pipe(pipefd);
	if (ret) {
		perror("pipe");
		abort();
	}

	ret = clone(init, stack, CLONE_NEWPID | SIGCHLD, argv[1]);
	if (ret < 0) {
		perror("clone");
		abort();
	}
	pid = ret;
	printf("%d\n", pid);
	fflush(stdout);

	close(pipefd[1]);
	ret = read(pipefd[0], &c, 1);
	if (ret != 1) {
		if (ret) {
			perror("read");
			abort();
		} else {
			sleep(5);
		}
	}

	printf("killing %d\n", pid);
	fflush(stdout);
	ret = kill(-pid, SIGKILL);
	if (ret) {
		perror("kill");
		abort();
	}

	return 0;
}
commit 7b7cae6ae5c543b8e9cc84fc041d9bce36e7b674
Author: Louis Rilling <louis.rilling@xxxxxxxxxxx>
Date:   Wed Jun 16 16:20:02 2010 +0200

    proc_flush_task() debug

diff --git a/kernel/exit.c b/kernel/exit.c
index ceffc67..be8cdb0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -169,6 +169,14 @@ repeat:
 	atomic_dec(&__task_cred(p)->user->processes);
 	rcu_read_unlock();
 
+	if (task_pid(p)->level > 0) {
+		if (!thread_group_leader(p) || !is_container_init(p)) {
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_timeout(10 * HZ);
+		}
+		printk("release_task: %d/%d\n", p->pid, task_pid(p)->numbers[1].nr);
+	}
+
 	proc_flush_task(p);
 
 	write_lock_irq(&tasklist_lock);
Attachment:
signature.asc

Description: Digital signature
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers