Krister Johansen <kjlx@xxxxxxxxxxxxxxxxxx> writes: > Gents, > This is the follow-up e-mail I referenced in our discussion about the > put_mountpoint locking problem. > > The problem manifested itself as a situation where our container > provisioner would sometimes fail to re-start a container that it had > made configuration changes. The IP address chosen by the provisioner > was still in use in another container. This meant that the system had a > network namespace with an IP address that was still in use, despite the > provisoner having torn down the container as part of the reconfig > operation. > > In order to keep the network namespace in use while the container is > alive, the software bind mounts the net and user namespaces out of > /proc/<pid>/ns/ into a directory that's used as the top level for the > container instance. > > After forcing a crash dump and looking through the results, I was able > to confirm that the only reference keeping the net namespace alive was > the one held by the dentry on the mountpoint for the nsfs mount of the > network namespace. The problem was that the container software had > unmounted this mountpoint, so it wasn't even in the host container's > mount namespace. > > Since the software was using shared mounts, the nsfs bind mount was > getting copied into the mount namespaces of any container that was > created after the nsfs bind mount was established. However, this isn't > obvious because each new namespace executes a pivot_root(2), followed by > an immediate and subsequent umount2(MNT_DETACH) on the old part of the > root filesystem that is no longer in use. These mounts of the nsfs bind > mount weren't visibile in the kernel debugger, because they'd been > detached from the mount namespace's mount tree. > > After looking at how iproute handles net namespaces, I ran a test where > every unmount of the net nsfs bind mount was followed by a rm of that > mountpoint. That always resulted in the mountpoint getting freed and > the refcount on the dentry going to zero. It wsa enough for to make > forward progress on the other tasks at hand. I was able to verify that > the nsfs refcount was getting dropped, and we were going through the > __detach_mounts() cleanup path: > > rm 14633 [013] 29947.047071: probe:nsfs_evict: (ffffffff81254fb0) > 7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms]) > 7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms]) > 7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms]) > 7fff81239611 dput+0x80007f002151 ([kernel.kallsyms]) > 7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms]) > 7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms]) > 7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms]) > 7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms]) > 7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms]) > 7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms]) > 7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms]) > 7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms]) > 7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms]) > 7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms]) > 7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms]) > 7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms]) > 7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms]) > e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so) > > Over the holiday, I had some more time to debug this and was able to > narrow it down to the following case. > > 1. The mount namespace that gets a copy of the nsfs bind mount must be > created in a different user namespace than the host container. This > causes MNT_LOCKED to get set on the cloned mounts. > > 2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old > part of the tree from pivot_root. Ensure that the nsfs mount is beneath > the root of this tree. > > 3. Umount the nsfs mount in the host container. If the mount wasn't > locked in the other container, you'll see a kprobe on nsfs_evict trigger > immediately. If it was MNT_LOCKED, then you'll need to rm the > mountpoint in the host to trigger the nsfs_evict. > > For a nsfs mount, it's not particularly problematic to have to rm the > mount to clean it up, but the other mounts in the tree that are detached > and locked are often on mountpoints that can't be easily rm'd from the > host. These are harder to clean up, and essentially orphaned until the > container's mount ns goes away. > > It would be ideal if we could release these mounts sooner, but I'm > unsure of the best approach here. > > Debugging further, I was able to see that: > > a) The reason the nsfs isn't considere as part of propagate_mount_unlock > is that the 'mnt' passed to that function is the top of the mount tree > and it appears to only be considering mounts directly related to 'mnt'. > > b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop > in umount_tree() is what ends up hiding these mounts from the host > container. Once they're no longer slaved or shared, we never again > consider them as candiates for unlocking. > > c) Also note that these detached mounts that aren't free'd aren't > charged against a container's ns->mounts limit, so it may be possible > for a mount ns to be using more mounts than it has officially accounted > for. > > I wondered if a naive solution could re-walk the list of mounts > processed in umount_tree() and if all of the detached but locked mounts > had a refcount that indicated they're unused, they could be unlocked and > unmounted. At least in the case of the containers I'm dealing with, the > the container software should be ensuring that nothing in the container > has a reference on anything that's under the detached portion of the > tree. However, there's probably a better way to do this. > > Thoughts? Any chance you have a trivial reproducer script? >From you description I don't quite see the problem. I know where to look but if could give a script that reproduces the conditions you see that would make it easier for me to dig into, and would certainly would remove ambiguity. Ideally such a script would be runnable under unshare -Urm for easy repeated testing. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html