On Tue, Oct 25, 2016 at 04:45:44PM -0500, Eric W. Biederman wrote: > Andrei Vagin <avagin@xxxxxxxxxxxxx> writes: > > > On Sat, Oct 22, 2016 at 02:42:03PM -0500, Eric W. Biederman wrote: > >> > >> Andrei, > >> > >> This fixes the issue you have reported and through a refactoring > >> makes the code simpler and easier to verify. That said I find your > >> last test case very interesting. While looking at it in detail > >> I have realized I don't fully understand why we have both lookup_mnt and > >> lookup_mnt_last, so I can't say that this change is fully correct. > >> > >> Outside of propogate_umount I am don't have concerns but I am not 100% > >> convinced that my change to lookup_mnt_last does the right thing > >> in the case of propagate_umount. > >> > >> I do see why your last test case scales badly. Long chains of shared > >> mounts that we can't skip. At the same time I don't really understand > >> that case. Part of it has to do with multiple child mounts of the same > >> mount on the same mountpoint. > >> > >> So I am working through my concerns. In the mean time I figured it > >> would be useful to post this version. As this version is clearly better > >> than the version of this change that have come before it. > > > > Hi Eric, > > > > I have tested this version and it works fine. > > > > As for the the last test case, could you look at the attached patch? > > The idea is that we can skip all mounts from a shared group, if one > > of them already marked. > > > >> > >> Eric > >> > >> From: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > >> Date: Thu, 13 Oct 2016 13:27:19 -0500 > >> > >> Adrei Vagin pointed out that time to executue propagate_umount can go > >> non-linear (and take a ludicrious amount of time) when the mount > >> propogation trees of the mounts to be unmunted by a lazy unmount > >> overlap. > >> > >> While investigating the horrible performance I realized that in > >> the case overlapping mount trees since the addition of locked > >> mount support the code has been failing to unmount all of the > >> mounts it should have been unmounting. > >> > >> Make the walk of the mount propagation trees nearly linear by using > >> MNT_MARK to mark pieces of the mount propagation trees that have > >> already been visited, allowing subsequent walks to skip over > >> subtrees. > >> > >> Make the processing of mounts order independent by adding a list of > >> mount entries that need to be unmounted, and simply adding a mount to > >> that list when it becomes apparent the mount can safely be unmounted. > >> For mounts that are locked on other mounts but otherwise could be > >> unmounted move them from their parnets mnt_mounts to mnt_umounts so > >> that if and when their parent becomes unmounted these mounts can be > >> added to the list of mounts to unmount. > >> > >> Add a final pass to clear MNT_MARK and to restore mnt_mounts > >> from mnt_umounts for anything that did not get unmounted. > >> > >> Add the functions propagation_visit_next and propagation_revisit_next > >> to coordinate walking of the mount tree and setting and clearing the > >> mount mark. > >> > >> The skipping of already unmounted mounts has been moved from > >> __lookup_mnt_last to mark_umount_candidates, so that the new > >> propagation functions can notice when the propagation tree passes > >> through the initial set of unmounted mounts. Except in umount_tree as > >> part of the unmounting process the only place where unmounted mounts > >> should be found are in unmounted subtrees. All of the other callers > >> of __lookup_mnt_last are from mounted subtrees so the not checking for > >> unmounted mounts should not affect them. > >> > >> A script to generate overlapping mount propagation trees: > >> $ cat run.sh > >> mount -t tmpfs test-mount /mnt > >> mount --make-shared /mnt > >> for i in `seq $1`; do > >> mkdir /mnt/test.$i > >> mount --bind /mnt /mnt/test.$i > >> done > >> cat /proc/mounts | grep test-mount | wc -l > >> time umount -l /mnt > >> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done > >> > >> Here are the performance numbers with and without the patch: > >> > >> mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857 > >> mounts | before | after | after (sys) | after | after (sys) | after | after (sys) > >> ------------------------------------------------------------------------------------- > >> 1024 | 0.071s | 0.020s | 0.000s | 0.022s | 0.004s | 0.020s | 0.004s > >> 2048 | 0.184s | 0.022s | 0.004s | 0.023s | 0.004s | 0.022s | 0.008s > >> 4096 | 0.604s | 0.025s | 0.020s | 0.029s | 0.008s | 0.026s | 0.004s > >> 8912 | 4.471s | 0.053s | 0.020s | 0.051s | 0.024s | 0.047s | 0.016s > >> 16384 | 34.826s | 0.088s | 0.060s | 0.081s | 0.048s | 0.082s | 0.052s > >> 32768 | | 0.216s | 0.172s | 0.160s | 0.124s | 0.160s | 0.096s > >> 65536 | | 0.819s | 0.726s | 0.330s | 0.260s | 0.338s | 0.256s > >> 131072 | | 4.502s | 4.168s | 0.707s | 0.580s | 0.709s | 0.592s > >> > >> Andrei Vagin reports fixing the performance problem is part of the > >> work to fix CVE-2016-6213. > >> > >> A script for a pathlogical set of mounts: > >> > >> $ cat pathological.sh > >> > >> mount -t tmpfs base /mnt > >> mount --make-shared /mnt > >> mkdir -p /mnt/b > >> > >> mount -t tmpfs test1 /mnt/b > >> mount --make-shared /mnt/b > >> mkdir -p /mnt/b/10 > >> > >> mount -t tmpfs test2 /mnt/b/10 > >> mount --make-shared /mnt/b/10 > >> mkdir -p /mnt/b/10/20 > >> > >> mount --rbind /mnt/b /mnt/b/10/20 > >> > >> unshare -Urm sleep 2 > >> umount -l /mnt/b > >> wait %% > >> > >> $ unshare -Urm pathlogical.sh > >> > >> Cc: stable@xxxxxxxxxxxxxxx > >> Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") > >> Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts") > >> Reported-by: Andrei Vagin <avagin@xxxxxxxxxx> > >> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > >> --- > >> fs/mount.h | 1 + > >> fs/namespace.c | 7 +-- > >> fs/pnode.c | 179 +++++++++++++++++++++++++++++++++++++++++---------------- > >> fs/pnode.h | 2 +- > >> 4 files changed, 133 insertions(+), 56 deletions(-) > >> > >> diff --git a/fs/mount.h b/fs/mount.h > >> index d2e25d7b64b3..00fe0d1d6ba7 100644 > >> --- a/fs/mount.h > >> +++ b/fs/mount.h > >> @@ -58,6 +58,7 @@ struct mount { > >> struct mnt_namespace *mnt_ns; /* containing namespace */ > >> struct mountpoint *mnt_mp; /* where is it mounted */ > >> struct hlist_node mnt_mp_list; /* list mounts with the same mountpoint */ > >> + struct list_head mnt_umounts; /* list of children that are being unmounted */ > >> #ifdef CONFIG_FSNOTIFY > >> struct hlist_head mnt_fsnotify_marks; > >> __u32 mnt_fsnotify_mask; > >> diff --git a/fs/namespace.c b/fs/namespace.c > >> index e6c234b1a645..73801391bb00 100644 > >> --- a/fs/namespace.c > >> +++ b/fs/namespace.c > >> @@ -237,6 +237,7 @@ static struct mount *alloc_vfsmnt(const char *name) > >> INIT_LIST_HEAD(&mnt->mnt_slave_list); > >> INIT_LIST_HEAD(&mnt->mnt_slave); > >> INIT_HLIST_NODE(&mnt->mnt_mp_list); > >> + INIT_LIST_HEAD(&mnt->mnt_umounts); > >> #ifdef CONFIG_FSNOTIFY > >> INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks); > >> #endif > >> @@ -650,13 +651,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) > >> p = __lookup_mnt(mnt, dentry); > >> if (!p) > >> goto out; > >> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > >> - res = p; > >> + res = p; > >> hlist_for_each_entry_continue(p, mnt_hash) { > >> if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) > >> break; > >> - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > >> - res = p; > >> + res = p; > >> } > >> out: > >> return res; > >> diff --git a/fs/pnode.c b/fs/pnode.c > >> index 234a9ac49958..8fd1a3fb420c 100644 > >> --- a/fs/pnode.c > >> +++ b/fs/pnode.c > >> @@ -134,7 +134,8 @@ void change_mnt_propagation(struct mount *mnt, int type) > >> } > >> > >> /* > >> - * get the next mount in the propagation tree. > >> + * get the next mount that is not a slave of the current mount in the > >> + * propagation tree. > >> * @m: the mount seen last > >> * @origin: the original mount from where the tree walk initiated > >> * > >> @@ -143,13 +144,9 @@ void change_mnt_propagation(struct mount *mnt, int type) > >> * vfsmount found while iterating with propagation_next() is > >> * a peer of one we'd found earlier. > >> */ > >> -static struct mount *propagation_next(struct mount *m, > >> - struct mount *origin) > >> +static struct mount *propagation_next_sib(struct mount *m, > >> + struct mount *origin) > >> { > >> - /* are there any slaves of this mount? */ > >> - if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list)) > >> - return first_slave(m); > >> - > >> while (1) { > >> struct mount *master = m->mnt_master; > >> > >> @@ -164,6 +161,26 @@ static struct mount *propagation_next(struct mount *m, > >> } > >> } > >> > >> +/* > >> + * get the next mount in the propagation tree. > >> + * @m: the mount seen last > >> + * @origin: the original mount from where the tree walk initiated > >> + * > >> + * Note that peer groups form contiguous segments of slave lists. > >> + * We rely on that in get_source() to be able to find out if > >> + * vfsmount found while iterating with propagation_next() is > >> + * a peer of one we'd found earlier. > >> + */ > >> +static struct mount *propagation_next(struct mount *m, > >> + struct mount *origin) > >> +{ > >> + /* are there any slaves of this mount? */ > >> + if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list)) > >> + return first_slave(m); > >> + > >> + return propagation_next_sib(m, origin); > >> +} > >> + > >> static struct mount *next_group(struct mount *m, struct mount *origin) > >> { > >> while (1) { > >> @@ -389,57 +406,92 @@ void propagate_mount_unlock(struct mount *mnt) > >> } > >> } > >> > >> -/* > >> - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted. > >> - */ > >> -static void mark_umount_candidates(struct mount *mnt) > >> +static struct mount *propagation_visit_child(struct mount *last_child, > >> + struct mount *origin_child) > >> { > >> - struct mount *parent = mnt->mnt_parent; > >> - struct mount *m; > >> + struct mount *m = last_child->mnt_parent; > >> + struct mount *origin = origin_child->mnt_parent; > >> + struct dentry *mountpoint = origin_child->mnt_mountpoint; > >> + struct mount *child; > >> > >> - BUG_ON(parent == mnt); > >> + /* Has this part of the propgation tree already been visited? */ > >> + if (IS_MNT_MARKED(last_child)) > >> + return NULL; > >> > >> - for (m = propagation_next(parent, parent); m; > >> - m = propagation_next(m, parent)) { > >> - struct mount *child = __lookup_mnt_last(&m->mnt, > >> - mnt->mnt_mountpoint); > >> - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { > >> - SET_MNT_MARK(child); > >> - } > >> + SET_MNT_MARK(last_child); > >> + > >> + m = propagation_next(m, origin); > >> + while (m) { > >> + child = __lookup_mnt_last(&m->mnt, mountpoint); > >> + if (child && !IS_MNT_MARKED(child)) > >> + return child; > >> + > >> + if (!child) > >> + m = propagation_next(m, origin); > >> + else > >> + m = propagation_next_sib(m, origin); > >> } > >> + return NULL; > >> } > >> > >> -/* > >> - * NOTE: unmounting 'mnt' naturally propagates to all other mounts its > >> - * parent propagates to. > >> - */ > >> -static void __propagate_umount(struct mount *mnt) > >> +static struct mount *propagation_revisit_child(struct mount *last_child, > >> + struct mount *origin_child) > >> { > >> - struct mount *parent = mnt->mnt_parent; > >> - struct mount *m; > >> + struct mount *m = last_child->mnt_parent; > >> + struct mount *origin = origin_child->mnt_parent; > >> + struct dentry *mountpoint = origin_child->mnt_mountpoint; > >> + struct mount *child; > >> > >> - BUG_ON(parent == mnt); > >> + /* Has this part of the propgation tree already been revisited? */ > >> + if (!IS_MNT_MARKED(last_child)) > >> + return NULL; > >> > >> - for (m = propagation_next(parent, parent); m; > >> - m = propagation_next(m, parent)) { > >> + CLEAR_MNT_MARK(last_child); > >> > >> - struct mount *child = __lookup_mnt_last(&m->mnt, > >> - mnt->mnt_mountpoint); > >> - /* > >> - * umount the child only if the child has no children > >> - * and the child is marked safe to unmount. > >> - */ > >> - if (!child || !IS_MNT_MARKED(child)) > >> - continue; > >> - CLEAR_MNT_MARK(child); > >> - if (list_empty(&child->mnt_mounts)) { > >> - list_del_init(&child->mnt_child); > >> - child->mnt.mnt_flags |= MNT_UMOUNT; > >> - list_move_tail(&child->mnt_list, &mnt->mnt_list); > >> - } > >> + m = propagation_next(m, origin); > >> + while (m) { > >> + child = __lookup_mnt_last(&m->mnt, mountpoint); > >> + if (child && IS_MNT_MARKED(child)) > >> + return child; > >> + > >> + if (!child) > >> + m = propagation_next(m, origin); > >> + else > >> + m = propagation_next_sib(m, origin); > >> } > >> + return NULL; > >> } > >> > >> +static void start_umount_propagation(struct mount *child, > >> + struct list_head *to_umount) > >> +{ > >> + do { > >> + struct mount *parent = child->mnt_parent; > >> + > >> + if ((child->mnt.mnt_flags & MNT_UMOUNT) || > >> + !list_empty(&child->mnt_mounts)) > >> + return; > >> + > >> + if (!IS_MNT_LOCKED(child)) > >> + list_move_tail(&child->mnt_child, to_umount); > >> + else > >> + list_move_tail(&child->mnt_child, &parent->mnt_umounts); > >> + > >> + child = NULL; > >> + if (IS_MNT_MARKED(parent)) > >> + child = parent; > >> + } while (child); > >> +} > >> + > >> +static void end_umount_propagation(struct mount *child) > >> +{ > >> + struct mount *parent = child->mnt_parent; > >> + > >> + if (!list_empty(&parent->mnt_umounts)) > >> + list_splice_tail_init(&parent->mnt_umounts, &parent->mnt_mounts); > >> +} > >> + > >> + > >> /* > >> * collect all mounts that receive propagation from the mount in @list, > >> * and return these additional mounts in the same list. > >> @@ -447,14 +499,39 @@ static void __propagate_umount(struct mount *mnt) > >> * > >> * vfsmount lock must be held for write > >> */ > >> -int propagate_umount(struct list_head *list) > >> +void propagate_umount(struct list_head *list) > >> { > >> struct mount *mnt; > >> + LIST_HEAD(to_umount); > >> + LIST_HEAD(tmp_list); > >> + > >> + /* Find candidates for unmounting */ > >> + list_for_each_entry(mnt, list, mnt_list) { > >> + struct mount *child; > >> + for (child = propagation_visit_child(mnt, mnt); child; > >> + child = propagation_visit_child(child, mnt)) > >> + start_umount_propagation(child, &to_umount); > >> + } > >> > >> - list_for_each_entry_reverse(mnt, list, mnt_list) > >> - mark_umount_candidates(mnt); > >> + /* Begin unmounting */ > >> + while (!list_empty(&to_umount)) { > >> + mnt = list_first_entry(&to_umount, struct mount, mnt_child); > >> > >> - list_for_each_entry(mnt, list, mnt_list) > >> - __propagate_umount(mnt); > >> - return 0; > >> + list_del_init(&mnt->mnt_child); > >> + mnt->mnt.mnt_flags |= MNT_UMOUNT; > >> + list_move_tail(&mnt->mnt_list, &tmp_list); > >> + > >> + if (!list_empty(&mnt->mnt_umounts)) > >> + list_splice_tail_init(&mnt->mnt_umounts, &to_umount); > >> + } > >> + > >> + /* Cleanup the mount propagation tree */ > >> + list_for_each_entry(mnt, list, mnt_list) { > >> + struct mount *child; > >> + for (child = propagation_revisit_child(mnt, mnt); child; > >> + child = propagation_revisit_child(child, mnt)) > >> + end_umount_propagation(child); > >> + } > >> + > >> + list_splice_tail(&tmp_list, list); > >> } > >> diff --git a/fs/pnode.h b/fs/pnode.h > >> index 550f5a8b4fcf..38c6cdb96b34 100644 > >> --- a/fs/pnode.h > >> +++ b/fs/pnode.h > >> @@ -41,7 +41,7 @@ static inline void set_mnt_shared(struct mount *mnt) > >> void change_mnt_propagation(struct mount *, int); > >> int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, > >> struct hlist_head *); > >> -int propagate_umount(struct list_head *); > >> +void propagate_umount(struct list_head *); > >> int propagate_mount_busy(struct mount *, int); > >> void propagate_mount_unlock(struct mount *); > >> void mnt_release_group_id(struct mount *); > >> -- > >> 2.10.1 > >> > > > > From 8e0f45c0272aa1f789d1657a0acc98c58919dcc3 Mon Sep 17 00:00:00 2001 > > From: Andrei Vagin <avagin@xxxxxxxxxx> > > Date: Tue, 25 Oct 2016 13:57:31 -0700 > > Subject: [PATCH] mount: skip all mounts from a shared group if one is marked > > > > If we meet a marked mount, it means that all mounts from > > its group have been already revised. > > > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxxx> > > --- > > fs/pnode.c | 18 +++++++++++++++--- > > 1 file changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/fs/pnode.c b/fs/pnode.c > > index 8fd1a3f..ebb7134 100644 > > --- a/fs/pnode.c > > +++ b/fs/pnode.c > > @@ -426,10 +426,16 @@ static struct mount *propagation_visit_child(struct mount *last_child, > > if (child && !IS_MNT_MARKED(child)) > > return child; > > > > - if (!child) > > + if (!child) { > > m = propagation_next(m, origin); > > - else > > + } else { > > + if (IS_MNT_MARKED(child)) { > > + if (m->mnt_group_id == origin->mnt_group_id) > > + return NULL; > > + m = m->mnt_master; > > + } > > m = propagation_next_sib(m, origin); > > + } > > } > > return NULL; > > } > > @@ -456,8 +462,14 @@ static struct mount *propagation_revisit_child(struct mount *last_child, > > > > if (!child) > > m = propagation_next(m, origin); > > - else > > + else { > > + if (!IS_MNT_MARKED(child)) { > > + if (m->mnt_group_id == origin->mnt_group_id) > > + return NULL; > > + m = m->mnt_master; > > + } > > m = propagation_next_sib(m, origin); > > + } > > } > > return NULL; > > } > > That is certainly interesting. The problem is that the reason we were > going slow is that there were in fact mounts that had not been traversed > in the share group. You are right. > > And in fact the entire idea of visiting a vfsmount mountpoint pair > exactly once is wrong in the face of shadow mounts. For a vfsmount > mountpoint pair that has shadow mounts the number of shadow mounts needs > to be descreased by one each time the propgation tree is traversed > during unmount. Which means that as far as I can see we have to kill > shadow mounts to correctly optimize this code. Once shadow mounts are > gone I don't know of a case where need your optimization. Without shadow mounts, it will be hard to save predictable behaviour for cases like this: $ unshare --propagation private -m sh test.sh + mount -t tmpfs --make-shared zzzz A + mkdir A/a + mount -t tmpfs zzzz A/a + mount --bind A B + mount -t tmpfs zzzz B/a + grep zzzz + cat /proc/self/mountinfo 155 123 0:44 / /root/tmp/A rw,relatime shared:70 - tmpfs zzzz rw 156 155 0:45 / /root/tmp/A/a rw,relatime shared:71 - tmpfs zzzz rw 157 123 0:44 / /root/tmp/B rw,relatime shared:70 - tmpfs zzzz rw 158 157 0:46 / /root/tmp/B/a rw,relatime shared:72 - tmpfs zzzz rw 159 155 0:46 / /root/tmp/A/a rw,relatime shared:72 - tmpfs zzzz rw + umount B/a + grep zzzz + cat /proc/self/mountinfo 155 123 0:44 / /root/tmp/A rw,relatime shared:70 - tmpfs zzzz rw 156 155 0:45 / /root/tmp/A/a rw,relatime shared:71 - tmpfs zzzz rw 157 123 0:44 / /root/tmp/B rw,relatime shared:70 - tmpfs zzzz rw X + a - a = X Maybe we need to add another ID for propagated mounts and when we do umount, we will detach only mounts with the same propagation id. I support the idea to kill shadow mounts. I guess it will help us to simplify algorithm of dumping and restoring a mount tree in CRIU. Currently it is a big pain for us. > > I am busily verifying my patch to kill shadow mounts but the following > patch is the minimal version. As far as I can see propagate_one > is the only place we create shadow mounts, and holding the > namespace_lock over attach_recursive_mnt, propagate_mnt, and > propgate_one is sufficient for that __lookup_mnt to be competely safe. > > diff --git a/fs/pnode.c b/fs/pnode.c > index 234a9ac49958..b14119b370d4 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -217,6 +217,9 @@ static int propagate_one(struct mount *m) > /* skip if mountpoint isn't covered by it */ > if (!is_subdir(mp->m_dentry, m->mnt.mnt_root)) > return 0; > + /* skip if mountpoint already has a mount on it */ > + if (__lookup_mnt(&m->mnt, mp->m_dentry)) > + return 0; > if (peers(m, last_dest)) { > type = CL_MAKE_SHARED; > } else { > > If you run with that patch you will see that there are go faster stripes. > > Eric > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html