On Tue, Apr 07, 2020 at 12:29:34PM +0900, Yun Levi wrote: > sock_prot_inuse_add+0x10/0x20 > __sock_release+0x44/0xc0 > sock_close+0x14/0x20 > __fput+0x8c/0x1b8 > ____fput+0xc/0x18 > task_work_run+0xa8/0xd8 > do_exit+0x2e4/0xa50 > do_group_exit+0x34/0xc8 > get_signal+0xd4/0x600 > do_signal+0x174/0x268 > do_notify_resume+0xcc/0x110 > work_pending+0x8/0x10 > Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace > 0b98c9ccbfd9f6fd ]--- > Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal > exception in interrupt SMP: stopping secondary CPUs Kernel Offset: > disabled CPU features: 0x0,21806000 Memory Limit: none > > What I saw is when I try to bind on some mount point to > /proc/{pid}/ns/net which made by child process, That's doesn't > increase the netns' refcnt. Why would it? Increase of netns refcount should happen when you follow /proc/*/ns/net and stay for as long as nsfs inode is alive, not by cloning that mount. And while we are at it, you don't need to bind it anywhere in order to call setns() - just open the sucker and then feed the resulting descriptor to setns(2). No mount --bind involved and if child exiting between open() and setns() would free netns, we would have exact same problem. > And when the child process's going to exit, it frees the netns But > Still bind mount point's inode's private data point to netns which was > freed by child when it exits. Look: we call ns_get_path(), which calls ns_get_path_cb(), which calls the callback passed to it (ns_get_path_task(), in this case), which grabs a reference to ns. Then we pass that reference to __ns_get_path(). __ns_get_path() looks for dentry stashed in ns. If there is one and it's still alive, we grab a reference to that dentry and drop the reference to ns - no new inodes had been created, so no new namespace references have appeared. Existing inode is pinned by dentry and dentry is pinned by _dentry_ reference we've got. If dentry is not there or is already in the middle of being destroyed, we allocate a new inode, stash our namespace reference into it, create a dentry referencing that new inode, stash it into namespace and return that dentry. Without dropping namespace reference we'd obtained in ns_get_path_task() - it went into new inode. If inode or dentry creation fails (out of memory), we drop what we'd obtained (namespace if inode creation fails, just the inode if dentry creation fails - namespace reference that went into inode will be dropped by inode destructor in the latter case) and return an error. If somebody else manages to get through the entire thing while we'd been allocating stuff and we see _their_ dentry already stashed into namespace, we just drop our dentry (its destructor will drop inode reference, which will lead to inode destructor, which will drop namespace reference) and bugger off with -EAGAIN. The caller (ns_get_path_cb()) will retry the entire thing. The invariant to be preserved here is "each of those inodes holds a reference to namespace for as long as the inode exists". ns_get_path() increments namespace refcount if and only if it has allocated an nsfs inode. If it grabs an existing dentry instead, the namespace refcount remains unchanged. Anyway, I would really like to see .config and C (or shell) reproducer. Ideally - .config that could be run in a KVM.