On Tue, Oct 18, 2016 at 10:46:40PM -0500, Eric W. Biederman wrote: > > Adrei Vagin pointed out that time to executue propagate_umount can go > non-linear (and take a ludicrious amount of time) when the mount > propogation trees of the mounts to be unmunted by a lazy unmount > overlap. > > While investigating the horrible performance I realized that in > the case overlapping mount trees since the addition of locked > mount support the code has been failing to unmount all of the > mounts it should have been unmounting. > > Make the walk of the mount propagation trees nearly linear by using > MNT_MARK to mark pieces of the mount propagation trees that have > already been visited, allowing subsequent walks to skip over > subtrees. > > Make the processing of mounts order independent by adding a list of > mount entries that need to be unmounted, and simply adding a mount to > that list when it becomes apparent the mount can safely be unmounted. > For mounts that are locked on other mounts but otherwise could be > unmounted move them from their parnets mnt_mounts to mnt_umounts so > that if and when their parent becomes unmounted these mounts can be > added to the list of mounts to unmount. > > Add a final pass to clear MNT_MARK and to restore mnt_mounts > from mnt_umounts for anything that did not get unmounted. > > Add the functions propagation_visit_next and propagation_revisit_next > to coordinate walking of the mount tree and setting and clearing the > mount mark. > > The skipping of already unmounted mounts has been moved from > __lookup_mnt_last to mark_umount_candidates, so that the new > propagation functions can notice when the propagation tree passes > through the initial set of unmounted mounts. Except in umount_tree as > part of the unmounting process the only place where unmounted mounts > should be found are in unmounted subtrees. All of the other callers > of __lookup_mnt_last are from mounted subtrees so the not checking for > unmounted mounts should not affect them. > > A script to generate overlapping mount propagation trees: > $ cat run.sh > mount -t tmpfs test-mount /mnt > mount --make-shared /mnt > for i in `seq $1`; do > mkdir /mnt/test.$i > mount --bind /mnt /mnt/test.$i > done > cat /proc/mounts | grep test-mount | wc -l > time umount -l /mnt > $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done > > Here are the performance numbers with and without the patch: > > mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857 > mounts | before | after | after (sys) | after | after (sys) | after | after (sys) > ------------------------------------------------------------------------------------- > 1024 | 0.071s | 0.020s | 0.000s | 0.022s | 0.004s | 0.020s | 0.004s > 2048 | 0.184s | 0.022s | 0.004s | 0.023s | 0.004s | 0.022s | 0.008s > 4096 | 0.604s | 0.025s | 0.020s | 0.029s | 0.008s | 0.026s | 0.004s > 8912 | 4.471s | 0.053s | 0.020s | 0.051s | 0.024s | 0.047s | 0.016s > 16384 | 34.826s | 0.088s | 0.060s | 0.081s | 0.048s | 0.082s | 0.052s > 32768 | | 0.216s | 0.172s | 0.160s | 0.124s | 0.160s | 0.096s > 65536 | | 0.819s | 0.726s | 0.330s | 0.260s | 0.338s | 0.256s > 131072 | | 4.502s | 4.168s | 0.707s | 0.580s | 0.709s | 0.592s > > Andrei Vagin reports fixing the performance problem is part of the > work to fix CVE-2016-6213. > > A script for a pathlogical set of mounts: > > $ cat pathological.sh > > mount -t tmpfs base /mnt > mount --make-shared /mnt > mkdir -p /mnt/b > > mount -t tmpfs test1 /mnt/b > mount --make-shared /mnt/b > mkdir -p /mnt/b/10 > > mount -t tmpfs test2 /mnt/b/10 > mount --make-shared /mnt/b/10 > mkdir -p /mnt/b/10/20 > > mount --rbind /mnt/b /mnt/b/10/20 > > unshare -Urm sleep 2 > umount -l /mnt/b > wait %% > > $ unsahre -Urm pathlogical.sh > > Cc: stable@xxxxxxxxxxxxxxx > Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") > Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts") > Reported-by: Andrei Vagin <avagin@xxxxxxxxxx> > Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > --- > > Barring some stupid mistake this looks like it fixes both the performance > and the correctness issues I was able to spot earlier. Andrei if you > could give this version a look over I would appreciate it. Eric, could you try out this script: [root@fc24 mounts]# cat run.sh set -e -m mount -t tmpfs zdtm /mnt mkdir -p /mnt/1 /mnt/2 mount -t tmpfs zdtm /mnt/1 mount --make-shared /mnt/1 mkdir /mnt/1/1 iteration=30 if [ -n "$1" ]; then iteration=$1 fi for i in `seq $iteration`; do mount --bind /mnt/1/1 /mnt/1/1 & done ret=0 for i in `seq $iteration`; do wait -n || ret=1 done [ "$ret" -ne 0 ] && { time umount -l /mnt/1 exit 0 } mount --rbind /mnt/1 /mnt/2 mount --make-slave /mnt/2 mount -t tmpfs zzz /mnt/2/1 nr=`cat /proc/self/mountinfo | grep zdtm | wc -l` echo -n "umount -l /mnt/1 -> $nr " /usr/bin/time -f '%E' umount -l /mnt/1 nr=`cat /proc/self/mountinfo | grep zdtm | wc -l` echo -n "umount -l /mnt/2 -> $nr " /usr/bin/time -f '%E' umount -l /mnt/2 [root@fc24 mounts]# unshare -Urm sh run.sh 4 It hangs up on my host with this patch. NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [umount:789] Modules linked in: nfsv3 nfs fscache bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel virtio_balloon i2c_piix4 parport_pc parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc virtio_net virtio_blk virtio_console crc32c_intel serio_raw virtio_pci virtio_ring virtio ata_generic pata_acpi CPU: 0 PID: 789 Comm: umount Not tainted 4.9.0-rc1+ #137 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 task: ffff88007c11c100 task.stack: ffffc900007c0000 RIP: 0010:[<ffffffff8125bd87>] [<ffffffff8125bd87>] __lookup_mnt_last+0x67/0x80 RSP: 0018:ffffc900007c3db0 EFLAGS: 00000286 RAX: ffff88007a5f0900 RBX: ffff88007b136620 RCX: ffff88007a3e2900 RDX: ffff88007a3e2900 RSI: ffff88007b136600 RDI: ffff88007b136600 RBP: ffffc900007c3dc0 R08: ffff880036df5850 R09: ffffffff81249664 R10: ffff88007bd84c38 R11: 0000000100000000 R12: ffff88007bce3f00 R13: ffffc900007c3e00 R14: ffff88007bce3f00 R15: 00007ffe54245328 FS: 00007ff465de0840(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f86b328128 CR3: 000000007ba23000 CR4: 00000000003406f0 Stack: ffff88007b136600 ffff88007b136480 ffffc900007c3df0 ffffffff8126a70a ffffc900007c3e58 ffffc900007c3e10 ffffc900007c3e00 ffff880079f99980 ffffc900007c3e48 ffffffff8126b0d5 ffff88007a3e2660 ffff88007a3e24e0 Call Trace: [<ffffffff8126a70a>] propagation_visit_child.isra.8+0x5a/0xd0 [<ffffffff8126b0d5>] propagate_umount+0x65/0x2e0 [<ffffffff8125a76e>] umount_tree+0x2be/0x2d0 [<ffffffff8125b75f>] do_umount+0x13f/0x340 [<ffffffff8125c3ce>] SyS_umount+0x10e/0x120 [<ffffffff817ba837>] entry_SYSCALL_64_fastpath+0x1a/0xa9 Thanks, Andrei > > Unless we can find a problem I am going to call this the final version. > > fs/mount.h | 1 + > fs/namespace.c | 7 +- > fs/pnode.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++----------- > fs/pnode.h | 2 +- > 4 files changed, 165 insertions(+), 43 deletions(-) > > diff --git a/fs/mount.h b/fs/mount.h > index d2e25d7b64b3..00fe0d1d6ba7 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -58,6 +58,7 @@ struct mount { > struct mnt_namespace *mnt_ns; /* containing namespace */ > struct mountpoint *mnt_mp; /* where is it mounted */ > struct hlist_node mnt_mp_list; /* list mounts with the same mountpoint */ > + struct list_head mnt_umounts; /* list of children that are being unmounted */ > #ifdef CONFIG_FSNOTIFY > struct hlist_head mnt_fsnotify_marks; > __u32 mnt_fsnotify_mask; > diff --git a/fs/namespace.c b/fs/namespace.c > index e6c234b1a645..73801391bb00 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -237,6 +237,7 @@ static struct mount *alloc_vfsmnt(const char *name) > INIT_LIST_HEAD(&mnt->mnt_slave_list); > INIT_LIST_HEAD(&mnt->mnt_slave); > INIT_HLIST_NODE(&mnt->mnt_mp_list); > + INIT_LIST_HEAD(&mnt->mnt_umounts); > #ifdef CONFIG_FSNOTIFY > INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks); > #endif > @@ -650,13 +651,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) > p = __lookup_mnt(mnt, dentry); > if (!p) > goto out; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; > + res = p; > hlist_for_each_entry_continue(p, mnt_hash) { > if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) > break; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; > + res = p; > } > out: > return res; > diff --git a/fs/pnode.c b/fs/pnode.c > index 234a9ac49958..15e30e861a14 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -390,56 +390,153 @@ void propagate_mount_unlock(struct mount *mnt) > } > > /* > - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted. > + * get the next mount in the propagation tree (that has not been visited) > + * @m: the mount seen last > + * @origin: the original mount from where the tree walk initiated > + * > + * Note that peer groups form contiguous segments of slave lists. > + * We rely on that in get_source() to be able to find out if > + * vfsmount found while iterating with propagation_next() is > + * a peer of one we'd found earlier. > */ > -static void mark_umount_candidates(struct mount *mnt) > +static struct mount *propagation_visit_child(struct mount *last_child, > + struct mount *origin_child) > { > - struct mount *parent = mnt->mnt_parent; > - struct mount *m; > + struct mount *m = last_child->mnt_parent; > + struct mount *origin = origin_child->mnt_parent; > + struct dentry *mountpoint = origin_child->mnt_mountpoint; > + struct mount *child; > > - BUG_ON(parent == mnt); > + /* Has this part of the propgation tree already been visited? */ > + if (IS_MNT_MARKED(last_child)) > + return NULL; > > - for (m = propagation_next(parent, parent); m; > - m = propagation_next(m, parent)) { > - struct mount *child = __lookup_mnt_last(&m->mnt, > - mnt->mnt_mountpoint); > - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { > - SET_MNT_MARK(child); > + SET_MNT_MARK(last_child); > + > + /* are there any slaves of this mount? */ > + if (!list_empty(&m->mnt_slave_list)) { > + m = first_slave(m); > + goto check_slave; > + } > + while (1) { > + struct mount *master = m->mnt_master; > + > + if (master == origin->mnt_master) { > + struct mount *next = next_peer(m); > + while (1) { > + if (next == origin) > + return NULL; > + child = __lookup_mnt_last(&next->mnt, mountpoint); > + if (child && !IS_MNT_MARKED(child)) > + return child; > + next = next_peer(next); > + } > + } else { > + while (1) { > + if (m->mnt_slave.next == &master->mnt_slave_list) > + break; > + m = next_slave(m); > + check_slave: > + child = __lookup_mnt_last(&m->mnt, mountpoint); > + if (child && !IS_MNT_MARKED(child)) > + return child; > + } > } > + > + /* back at master */ > + m = master; > } > } > > /* > - * NOTE: unmounting 'mnt' naturally propagates to all other mounts its > - * parent propagates to. > + * get the next mount in the propagation tree (that has not been revisited) > + * @m: the mount seen last > + * @origin: the original mount from where the tree walk initiated > + * > + * Note that peer groups form contiguous segments of slave lists. > + * We rely on that in get_source() to be able to find out if > + * vfsmount found while iterating with propagation_next() is > + * a peer of one we'd found earlier. > */ > -static void __propagate_umount(struct mount *mnt) > +static struct mount *propagation_revisit_child(struct mount *last_child, > + struct mount *origin_child) > { > - struct mount *parent = mnt->mnt_parent; > - struct mount *m; > + struct mount *m = last_child->mnt_parent; > + struct mount *origin = origin_child->mnt_parent; > + struct dentry *mountpoint = origin_child->mnt_mountpoint; > + struct mount *child; > > - BUG_ON(parent == mnt); > + /* Has this part of the propgation tree already been revisited? */ > + if (!IS_MNT_MARKED(last_child)) > + return NULL; > > - for (m = propagation_next(parent, parent); m; > - m = propagation_next(m, parent)) { > + CLEAR_MNT_MARK(last_child); > > - struct mount *child = __lookup_mnt_last(&m->mnt, > - mnt->mnt_mountpoint); > - /* > - * umount the child only if the child has no children > - * and the child is marked safe to unmount. > - */ > - if (!child || !IS_MNT_MARKED(child)) > - continue; > - CLEAR_MNT_MARK(child); > - if (list_empty(&child->mnt_mounts)) { > - list_del_init(&child->mnt_child); > - child->mnt.mnt_flags |= MNT_UMOUNT; > - list_move_tail(&child->mnt_list, &mnt->mnt_list); > + /* are there any slaves of this mount? */ > + if (!list_empty(&m->mnt_slave_list)) { > + m = first_slave(m); > + goto check_slave; > + } > + while (1) { > + struct mount *master = m->mnt_master; > + > + if (master == origin->mnt_master) { > + struct mount *next = next_peer(m); > + while (1) { > + if (next == origin) > + return NULL; > + child = __lookup_mnt_last(&next->mnt, mountpoint); > + if (child && IS_MNT_MARKED(child)) > + return child; > + next = next_peer(next); > + } > + } else { > + while (1) { > + if (m->mnt_slave.next == &master->mnt_slave_list) > + break; > + m = next_slave(m); > + check_slave: > + child = __lookup_mnt_last(&m->mnt, mountpoint); > + if (child && IS_MNT_MARKED(child)) > + return child; > + } > } > + > + /* back at master */ > + m = master; > } > } > > +static void start_umount_propagation(struct mount *child, > + struct list_head *to_umount) > +{ > + do { > + struct mount *parent = child->mnt_parent; > + > + if ((child->mnt.mnt_flags & MNT_UMOUNT) || > + !list_empty(&child->mnt_mounts)) > + return; > + > + if (!IS_MNT_LOCKED(child)) > + list_move_tail(&child->mnt_child, to_umount); > + else > + list_move_tail(&child->mnt_child, &parent->mnt_umounts); > + > + child = NULL; > + if (IS_MNT_MARKED(parent)) > + child = parent; > + } while (child); > +} > + > +static void end_umount_propagation(struct mount *child) > +{ > + struct mount *parent = child->mnt_parent; > + > + if (!list_empty(&parent->mnt_umounts)) > + list_splice_tail_init(&parent->mnt_umounts, &parent->mnt_mounts); > +} > + > + > /* > * collect all mounts that receive propagation from the mount in @list, > * and return these additional mounts in the same list. > @@ -447,14 +544,39 @@ static void __propagate_umount(struct mount *mnt) > * > * vfsmount lock must be held for write > */ > -int propagate_umount(struct list_head *list) > +void propagate_umount(struct list_head *list) > { > struct mount *mnt; > + LIST_HEAD(to_umount); > + LIST_HEAD(tmp_list); > + > + /* Find candidates for unmounting */ > + list_for_each_entry(mnt, list, mnt_list) { > + struct mount *child; > + for (child = propagation_visit_child(mnt, mnt); child; > + child = propagation_visit_child(child, mnt)) > + start_umount_propagation(child, &to_umount); > + } > > - list_for_each_entry_reverse(mnt, list, mnt_list) > - mark_umount_candidates(mnt); > + /* Begin unmounting */ > + while (!list_empty(&to_umount)) { > + mnt = list_first_entry(&to_umount, struct mount, mnt_child); > > - list_for_each_entry(mnt, list, mnt_list) > - __propagate_umount(mnt); > - return 0; > + list_del_init(&mnt->mnt_child); > + mnt->mnt.mnt_flags |= MNT_UMOUNT; > + list_move_tail(&mnt->mnt_list, &tmp_list); > + > + if (!list_empty(&mnt->mnt_umounts)) > + list_splice_tail_init(&mnt->mnt_umounts, &to_umount); > + } > + > + /* Cleanup the mount propagation tree */ > + list_for_each_entry(mnt, list, mnt_list) { > + struct mount *child; > + for (child = propagation_revisit_child(mnt, mnt); child; > + child = propagation_revisit_child(child, mnt)) > + end_umount_propagation(child); > + } > + > + list_splice_tail(&tmp_list, list); > } > diff --git a/fs/pnode.h b/fs/pnode.h > index 550f5a8b4fcf..38c6cdb96b34 100644 > --- a/fs/pnode.h > +++ b/fs/pnode.h > @@ -41,7 +41,7 @@ static inline void set_mnt_shared(struct mount *mnt) > void change_mnt_propagation(struct mount *, int); > int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, > struct hlist_head *); > -int propagate_umount(struct list_head *); > +void propagate_umount(struct list_head *); > int propagate_mount_busy(struct mount *, int); > void propagate_mount_unlock(struct mount *); > void mnt_release_group_id(struct mount *); > -- > 2.10.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html