Andrea, We have been running a leak in child pid namespaces and some early debugging points to the following commit: >> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821 >> Author: Andrea Arcangeli <andrea@xxxxxxx> >> Date: Mon Feb 4 22:29:21 2008 -0800 >> Reverting the commit seems to fix the leak but we need to do some more analysis (like the lstat() question Daniel has). However I have a basic question regarding the commit - the log mentions: > do_exit->release_task->mark_inode_dirty_sync->schedule() (will never > come back to run journal_stop) But release_task() calls shrink_dcache_parent() for a _procfs_ dentry. Does journal_stop() apply to procfs also ? Thanks, Sukadev Daniel Lezcano [dlezcano@xxxxxxxxxx] wrote: > Sukadev Bhattiprolu wrote: >> Daniel Lezcano [dlezcano@xxxxxxxxxx] wrote: >>> Sukadev Bhattiprolu wrote: >>>> Still digging through some traces, but below I have some questions >>>> that I am still trying to answer. >>>> >>>>> I am not sure what you mean by 'struct pids' but what I observed is: >>>> Ok, I see that too. If pids leak, then pid-namespace will leak too. >>>> Do you see any leaks in proc_inode_cache ? >>> Yes, right. It leaks too. >> >> Ok, some progress... >> >> Can you please verify these observations: >> >> - If the container exits normally, the leak does not seem to happen. >> (i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop). >> >> - Revert the following commit and check if the leak happens: >> >> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821 >> Author: Andrea Arcangeli <andrea@xxxxxxx> >> Date: Mon Feb 4 22:29:21 2008 -0800 >> >> (this commit added the check for PF_EXITING in proc_flush_task_mnt >> loosely explained below). > > > >> Incomplete analysis :-) >> >> If the container-init is terminated (by the lxc-stop), the container zaps >> other processes in the container and waits for them. The leak happens in >> this case. >> >> Following sequence of events occur: >> >> - container-init calls do_exit and sets PF_EXITING (in exit_signals()) >> >> - container init calls zaps_pid_ns_processes() (exit_notify / >> forget_orignal_parent() / find_new_reaper()) >> >> - In zap_pid_ns_processes() container-init sends SIGKILL to >> descendants and calls sys_wait(). >> >> - The sys_wait() is expected to call release_task() which calls >> proc_flush_task_mnt(). >> >> - proc_flush_task_mnt() looks up the dentry for the pid (2 in >> our example) and finds the dentry. >> >> But since container-init is itself exiting (i.e PF_EXITING is >> set) it does NOT call the shrink_dcache_parent(), but, >> interestingly calls d_drop() and dput(). >> >> Now the d_drop() unhashes the dentry for the pid 2. >> >> - proc_flush_task_mnt() then tries to find the dentry for the >> tgid of the process. In our case, the tgid == pid == 2 and >> we just unhashed the dentry for "2". >> >> So, we don't find the dentry for the leader either (and hence >> don't make the second shrink_dcache_parent() call in >> proc_flush_task_mnt() either). >> >> Without a call to shrink_dcache_parent(), the proc inode >> for the process that was terminated by container init is >> not deleted (i.e we don't call proc_delete_inode() or >> the put_pid() inside it) causing us to leak proc_inodes, >> struct pid and hence struct pid_namespace. > > Ouch ! > > Nice analysis :) > > Following your explanation I was able to reproduce a simple program > added in attachment. But there is something I do not understand is why > the leak does not appear if I do the 'lstat' (cf. test program) in the > pid 2 context. > > >> There should be a better fix, but first please confirm if reverting the >> above commit fixes the leak for you also. > > I confirm the leak does no longer appear when reverting this patch. > > Thanks > -- Daniel | #include <stdio.h> | #include <unistd.h> | #include <stdlib.h> | #include <sys/prctl.h> | #include <sys/param.h> | #include <sys/stat.h> | #include <sys/poll.h> | #include <signal.h> | #include <sched.h> | | #ifndef CLONE_NEWPID | # define CLONE_NEWPID 0x20000000 | #endif | | int child(void *arg) | { | pid_t pid; | struct stat s; | | if (mount("proc", "/proc", "proc", 0, NULL)) { | perror("mount"); | return -1; | } | | pid = fork(); | if (pid < 0) { | perror("fork"); | return -1; | } | | if (!pid) { | poll(0, 0 , -1); | exit(-1); | } | | poll(0, 0, -1); | | return 0; | } | | pid_t clonens(int (*fn)(void *), void *arg, int flags) | { | long stack_size = sysconf(_SC_PAGESIZE); | void *stack = alloca(stack_size) + stack_size; | return clone(fn, stack, flags | SIGCHLD, arg); | } | | int main(int argc, char *argv[]) | { | pid_t pid; | struct stat s; | char path[MAXPATHLEN]; | | pid = clonens(child, NULL, CLONE_NEWNS|CLONE_NEWPID); | if (pid < 0) { | perror("clone"); | return -1; | } | | /* yes ugly.*/ | sleep(1); | | /* !! assumption : child of my child is pid + 1 | * any reliable simple solution is welcome :) */ | snprintf(path, sizeof(path), "/proc/%d/exe", pid + 1); | | if (lstat(path, &s)) { | perror("lstat"); | exit(-1); | } | | if (kill(pid, SIGKILL)) { | perror("kill"); | return -1; | } | | return 0; | } _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers