On Thu, Jan 08, 2015 at 12:22:27AM +0000, Al Viro wrote: > * external references countribute > * mnt_parent contributes unless it points to ourselves *or* mnt_ns is > NULL > * reachable from mount hash => add 1 > * in addition to the wart around namespace_unlock(), we have a similar > wart between mntput_no_expire() and cleanup_mnt(), only there we thread the > suckers on mnt_child instead of mnt_list and use slightly different logics to > prevent shutdown of parent fs before the dput(mountpoint). BTW, there's something I wanted to do for a while - these games with delaying dput(mountpoint) look a whole lot like a job for mnt_pin_kill(). Look: even now we have (at the exit from umount_tree()) a bunch of remnant mountpoints associated with ex-parents. And we want to have them taken out and shot before said ex-parents shut down. The tricky part here is ordering - at the beginning of namespace_unlock() we have a bunch of mounts about to be dropped; quite a few might be keeping references to the ex-mountpoints. We want mntput() on all of those *and* we want dput() on said ex-mountpoints. We'd already dropped the references to ex-parents (in umount_tree()). The trouble being, some of those ex-mountpoints might be on other victim mounts and we really do NOT want to drop such a victim until all dput() within it had been done. That's why we play those games with bumping ex-parent mount counts first, so that dput() is immediately followed with matching mntput(). But we don't have to do it that way. Alternative, and IMO a cleaner one, would be to embed struct fs_pin into struct mount and, in umount_tree() simply add that fs_pin->m_list into the ->mnt_pins of parent and ->s_list into a global list (replacement of 'unmounted'). With ->kill() callback dropping references to dentry of mountpoint *and* containing mount. Then we don't need to care about the ordering at all - we just go through the global list of fs_pin and do ->kill() on all of them, in whatever order they happen to be. If mount gets dropped while there still are ex-mountpoints on it waiting to be dropped - fine, we'll just drop them there and then (from mnt_kill_pins()). No games with refcounts that way, no transiently busy mounts, which we do have right now. And that scheme actually covers your "delayed detaching" just fine - if we want detaching to be delayed until the parent really buys it, all we need to do is to skip dropping the child from mnt_mounts of parent and putting fs_pin of child on the global list. And detach_mount() on all remaining children at mntput_no_expire() time, as you do, only with removal from child list. No changes needed in cleanup_mnt() that way - remaining fs_pin of children will play out, yielding the effect you want. The problem with that approach (and the reason why it hadn't been done yet) is that we'll need some careful tweaking of locking in fs/fs_pin.c. Back in 3.17 I really wanted to get the kernel/acct.c crap out of the way, so that umount-on-rmdir stuff could be done safely. So the minimal solution went in, with plans for using it to avoid the headache with ordering left for later. I had completely missed the possibility to do less eager dissolving of detached trees that way (and it was a long-standing wishlist thing) back then, or I probably would've bitten the bullet and gone for it. What you are proposing is a) easily expressed with that scheme (behaviour is almost identical to what your series does, except that we'd empty mnt_child right in mntput_no_expire(), rather than keeping it until cleanup_mnt()) and b) easily generalized to MNT_DETACH with lazy dissolving; in fact, the only difference would be to treat *all* submounts as "don't put on global list, don't remove from child list", not just the MNT_LOCKED ones. I'll play around with that today and tomorrow; hopefully, I'll have a postable variant by the weekend... _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers