On Wed, May 24, 2017 at 04:54:34PM -0500, Eric W. Biederman wrote: > Ram Pai <linuxram@xxxxxxxxxx> writes: > > > On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote: > >> > >> While investigating some poor umount performance I realized that in > >> the case of overlapping mount trees where some of the mounts are locked > >> the code has been failing to unmount all of the mounts it should > >> have been unmounting. > >> > >> This failure to unmount all of the necessary > >> mounts can be reproduced with: > >> > >> $ cat locked_mounts_test.sh > >> > >> mount -t tmpfs test-base /mnt > >> mount --make-shared /mnt > >> mkdir -p /mnt/b > >> > >> mount -t tmpfs test1 /mnt/b > >> mount --make-shared /mnt/b > >> mkdir -p /mnt/b/10 > >> > >> mount -t tmpfs test2 /mnt/b/10 > >> mount --make-shared /mnt/b/10 > >> mkdir -p /mnt/b/10/20 > >> > >> mount --rbind /mnt/b /mnt/b/10/20 > >> > >> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi' > >> sleep 1 > >> umount -l /mnt/b > >> wait %% > >> > >> $ unshare -Urm ./locked_mounts_test.sh > >> > >> This failure is corrected by removing the prepass that marks mounts > >> that may be umounted. > >> > >> A first pass is added that umounts mounts if possible and if not sets > >> mount mark if they could be unmounted if they weren't locked and adds > >> them to a list to umount possibilities. This first pass reconsiders > >> the mounts parent if it is on the list of umount possibilities, ensuring > >> that information of umoutability will pass from child to mount parent. > >> > >> A second pass then walks through all mounts that are umounted and processes > >> their children unmounting them or marking them for reparenting. > >> > >> A last pass cleans up the state on the mounts that could not be umounted > >> and if applicable reparents them to their first parent that remained > >> mounted. > >> > >> While a bit longer than the old code this code is much more robust > >> as it allows information to flow up from the leaves and down > >> from the trunk making the order in which mounts are encountered > >> in the umount propgation tree irrelevant. > > > > Eric, > > I think we can accomplish what you want in a much simpler way. > > Would the patch below; UNTESTED BUT COMPILED, resolve your > > issue? > > The reason I came up with an algorithm where the information flows > both directions is that especially in the case of umount -l > but even in some rare cases of a simple umount, the ordering > of the mount propagation tree can result in parent mounts being > visited before the child mounts. > > This case shows in in the case of a mount or a set of mounts > being mounted below itself. > > So no. Irregardless of tucked mount state we can't do this. Ok. I thought I had taken care, regardles of the order in which the mounts were encountered. I need to understand your patch better. Will relook at it later tonight. RP > > I see this also doesn't have the change to move mnt_change_mountpoint > into another pass. That one is quite important from a practical > point of view as that means the way the mount tree changes in umount > is the same irrespective of the number of times a mount shows > up in the mount propagation trees. Which is a very important > property to have for optimizing umount -l. Which in > the worst case allows reduces umount from O(N^2+) to roughly O(N). > > All of what I am doing should have not effect on an implementation of > MNT_TUCKED. > > That said your code to deal with MNT_TUCKED seems reasonable. > > Eric > > > > > Its a two pass unmount. First pass marks mounts that can > > be unmounted, and second pass does the neccessary unlinks. > > It does mark TUCKED mounts, and uses that information > > to peel off the correct mounts. Key points are > > > > a) a tucked mount never entertain any unmount propagation > > on its root dentry. > > > > b) when the child on the root dentry of a tucked mount is > > unmounted, the tucked mount is not a tucked mount anymore. > > > > c) if the child is a tucked mount, than its child is reparented > > to the parent. > > > > > > Signed-off-by: "Ram Pai" <linuxram@xxxxxxxxxx> > > > > fs/namespace.c | 4 ++- > > fs/pnode.c | 53 +++++++++++++++++++++++++++++++++++++------------- > > fs/pnode.h | 3 ++ > > include/linux/mount.h | 1 > > > > diff --git a/fs/namespace.c b/fs/namespace.c > > index cc1375ef..ff3ec90 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -2050,8 +2050,10 @@ static int attach_recursive_mnt(struct mount *source_mnt, > > hlist_del_init(&child->mnt_hash); > > q = __lookup_mnt(&child->mnt_parent->mnt, > > child->mnt_mountpoint); > > - if (q) > > + if (q) { > > mnt_change_mountpoint(child, smp, q); > > + SET_MNT_TUCKED(child); > > + } > > commit_tree(child); > > } > > put_mountpoint(smp); > > diff --git a/fs/pnode.c b/fs/pnode.c > > index 5bc7896..b44a544 100644 > > --- a/fs/pnode.c > > +++ b/fs/pnode.c > > @@ -448,31 +448,58 @@ static void __propagate_umount(struct mount *mnt) > > > > for (m = propagation_next(parent, parent); m; > > m = propagation_next(m, parent)) { > > - struct mount *topper; > > - struct mount *child = __lookup_mnt(&m->mnt, > > - mnt->mnt_mountpoint); > > - /* > > - * umount the child only if the child has no children > > - * and the child is marked safe to unmount. > > + struct mount *topper, *child; > > + > > + /* Tucked mount must drop umount propagation events on > > + * its **root dentry**. > > + * The tucked mount did not exist when that child came > > + * into existence. It never received that mount propagation. > > + * Hence it should never entertain the umount propagation > > + * aswell. > > */ > > + if (IS_MNT_TUCKED(m) && list_is_singular(&mnt->mnt_mounts)) > > + continue; > > + > > + > > + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); > > + > > if (!child || !IS_MNT_MARKED(child)) > > continue; > > + > > CLEAR_MNT_MARK(child); > > > > - /* If there is exactly one mount covering all of child > > - * replace child with that mount. > > - */ > > - topper = find_topper(child); > > - if (topper) > > - mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, > > - topper); > > + if (IS_MNT_TUCKED(child) && > > + (list_is_singular(&child->mnt_mounts))) { > > + topper = find_topper(child); > > + if (topper) { > > + mnt_change_mountpoint(child->mnt_parent, > > + child->mnt_mp, topper); > > + CLEAR_MNT_TUCKED(child); /*lets be precise*/ > > + } > > + } > > > > if (list_empty(&child->mnt_mounts)) { > > list_del_init(&child->mnt_child); > > child->mnt.mnt_flags |= MNT_UMOUNT; > > list_move_tail(&child->mnt_list, &mnt->mnt_list); > > } > > +#if 0 > > + else { > > + mntput(child); /* mark it for deletion. It will > > + be deleted whenever it looses > > + all its remaining references. > > + TODO: some more thought > > + needed, please validate */ > > + } > > +#endif > > } > > + > > + /* > > + * This explicit umount operation is exposing the parent. > > + * In case the parent was a 'tucked' mount, it cannot be so > > + * anymore. > > + */ > > + CLEAR_MNT_TUCKED(parent); > > } > > > > /* > > diff --git a/fs/pnode.h b/fs/pnode.h > > index dc87e65..9ebd1a8 100644 > > --- a/fs/pnode.h > > +++ b/fs/pnode.h > > @@ -18,8 +18,11 @@ > > #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE) > > #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED) > > #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED) > > +#define SET_MNT_TUCKED(m) ((m)->mnt.mnt_flags |= MNT_TUCKED) > > #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED) > > +#define CLEAR_MNT_TUCKED(m) ((m)->mnt.mnt_flags &= ~MNT_TUCKED) > > #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED) > > +#define IS_MNT_TUCKED(m) ((m)->mnt.mnt_flags & MNT_TUCKED) > > > > #define CL_EXPIRE 0x01 > > #define CL_SLAVE 0x02 > > diff --git a/include/linux/mount.h b/include/linux/mount.h > > index 8e0352a..41674e7 100644 > > --- a/include/linux/mount.h > > +++ b/include/linux/mount.h > > @@ -62,6 +62,7 @@ > > #define MNT_SYNC_UMOUNT 0x2000000 > > #define MNT_MARKED 0x4000000 > > #define MNT_UMOUNT 0x8000000 > > +#define MNT_TUCKED 0x10000000 > > > > struct vfsmount { > > struct dentry *mnt_root; /* root of the mounted tree */ -- Ram Pai