Re: [REVIEW][PATCH] mnt: Tuck mounts under others instead of creating shadow/side mounts.

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Wed, 11 Jan 2017 13:10:57 +1300

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

> On Thu, Jan 05, 2017 at 10:04:14AM +1300, Eric W. Biederman wrote:
>
>>  - attach_shadows to be renamed __attach_mnt and the it's shadow
>>    handling to be removed.
>
> Er...  s/the it's/its/, presumably?  Or am I misparsing that?
>
>> v2: Updated to mnt_change_mountpoint to not call dput or mntput
>> and instead to decrement the counts directly.  It is guaranteed
>> that there will be other references when mnt_change_mountpoint is
>> called so this is safe.
>
>> +void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
>>  {
>
> Too generic name, IMO, and I really wonder if "mount" (== interpose) and
> "umount" (== excise?) cases would be better off separately.
>
>> +	struct mountpoint *old_mp = mnt->mnt_mp;
>> +	struct dentry *old_mountpoint = mnt->mnt_mountpoint;
>> +	struct mount *old_parent = mnt->mnt_parent;
>> +
>> +	list_del_init(&mnt->mnt_child);
>> +	hlist_del_init(&mnt->mnt_mp_list);
>> +	hlist_del_init_rcu(&mnt->mnt_hash);
>> +
>> +	attach_mnt(mnt, parent, mp);
>> +
>> +	put_mountpoint(old_mp);
>
>> +	 *
>> +	 * During mounting, another mount will continue to use the old
>> +	 * mountpoint and during unmounting, the old mountpoint will
>> +	 * continue to exist until namespace_unlock which happens well
>> +	 * after mnt_change_mountpoint.
>> +	 */
>
> Umm...  AFAICS, in the former case "another mount" is simply parent,
> right?

Yes it is the new parent mountpoint.  I was looking at it from a
different perspective.

>> +	spin_lock(&old_mountpoint->d_lock);
>> +	old_mountpoint->d_lockref.count--;
>> +	spin_unlock(&old_mountpoint->d_lock);
>> +	mnt_add_count(old_parent, -1);
>
>
>> +		if (child->mnt.mnt_root == smp->m_dentry) {
>
> Explain, please.  In which case is that condition _not_ satisfied, and
> what should happen i

When a tree is grafted in that condition does not apply to the lower
leaves of the tree.  At the same time nothing needs to be done for those
leaves.  Only the primary mountpoint needs to worry about tucking.

>> +			struct mount *q;
>> +			q = __lookup_mnt(&child->mnt_parent->mnt,
>> +					 child->mnt_mountpoint);
>> +			if (q)
>> +				mnt_change_mountpoint(child, smp, q);
>
>
>>  	unlock_mount_hash();
>> +	put_mountpoint(smp);
>
> Wrong order...
>
>>  	ns->pending_mounts = 0;
>> +	put_mountpoint(smp);
>
> ... and worse yet here.

Definitely.  I totally spaced on propagating the locking changes to this
patch when I rebased it.

>>  static inline int do_refcount_check(struct mount *mnt, int count)
>>  {
>> +	struct mount *topper = __lookup_mnt(&mnt->mnt, mnt->mnt.mnt_root);
>> +	if (topper)
>> +		count++;
>>  	return mnt_get_count(mnt) > count;
>>  }
>
>> @@ -359,7 +362,7 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  	     		m = propagation_next(m, parent)) {
>> -		child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint);
>> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
>>  		if (child && list_empty(&child->mnt_mounts) &&
>>  		    (ret = do_refcount_check(child, 1)))
>>  			break;
>
> Er...  You do realize that you can end up with more that one such
> propagation, right?  IOW, there might be more than one thing slipped
> in.

So I have stared at this a lot and I don't see what you seem to be
seeing here.  I do however see that propagate_mount_busy has been
buggy since the beginning, as it only fails in the propagation case
if list of children is empty.

I also see my modification to the code is buggy since the list empty
precludes my changes to do_refcount_check from being effective.

I have looked hard and your point with multiple propagations elludes me.

I am going to add a patch to fix propagate_mount_busy, and then rebase
this patch on top of that, and post it all for review.

>> +		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>>  			SET_MNT_MARK(child);
>
> Reread the condition, please...  And yes, I realize that original is
> also rather odd; at a guess it was meant to be !(, not (!, but that's
> just a guess - it's your code, IIRC.

The intent is to find all trees that we can unmount where the point
at which the tree meets the rest of the mounts is not locked.

The mark is later used to see if it ok to unmount a mount or if
we will reveal information to userspace (by breaking a lock).

Therefore the mark needs to be set if the mount is unlocked,
and recursively the mark needs to be set for every child of
that mount where the mark is set (the second condition).

Which makes the code essentially correct.

Unfortunately the code does not handle multiple unmounts from the same
parent mount point.  Which means shadow/side mount support and untucking
of mounts fails to handle multiple unmounts from the same parent mount.

The way the mark is used fundamentally assumes only one operation on
each mountpoint, and that is broken.

I intend to work on propagute_umount some more and fix that brokenness
and hopefully fix the performance issues as well.  But I am leaving that
work for another change as it is going to require stripping out and
replacing algorithms, and so far I don't have good solutions.

>> @@ -420,8 +425,8 @@ static void __propagate_umount(struct mount *mnt)
>>  
>>  	for (m = propagation_next(parent, parent); m;
>>  			m = propagation_next(m, parent)) {
>> -
>> -		struct mount *child = __lookup_mnt_last(&m->mnt,
>> +		struct mount *topper;
>> +		struct mount *child = __lookup_mnt(&m->mnt,
>>  						mnt->mnt_mountpoint);
>>  		/*
>>  		 * umount the child only if the child has no children
>> @@ -430,6 +435,16 @@ static void __propagate_umount(struct mount *mnt)
>>  		if (!child || !IS_MNT_MARKED(child))
>>  			continue;
>>  		CLEAR_MNT_MARK(child);
>> +
>> +		/* If there is exactly one mount covering all of child
>> +		 * replace child with that mount.
>> +		 */
>> +		topper = __lookup_mnt(&child->mnt, child->mnt.mnt_root);
>> +		if (topper &&
>
>> +		    (child->mnt_mounts.next == &topper->mnt_child) &&
>> +		    (topper->mnt_child.next == &child->mnt_mounts))
>
> Weird way to spell child->mnt_mounts.next == child->mnt_mounts.prev, that...
> Or, perhaps, the entire thing ought to be

Except it is clearer than that.  It verifies not just that there is one
list item but that topper is that one list item.

Further it is the exact same logic as is used in do_check_refcnt and
at this stage in development I figured it was more important to have
a recognizable pattern than to have the absolute most performant code.
Especially as the basic complexity of the code is the same either way.

> 		if (list_is_singular(&child->mnt_mounts)) {
> 			topper = list_first_entry(&child->mnt_mounts,
> 						  struct mount, mnt_child);
> 			if (topper->mnt_parent == child &&
> 			    topped->mnt_mountpoint == child->mnt.mnt_root)
>
> to avoid hash lookups.

That it would and now that I see that list_is_singular exists it looks
like a reasonable option.

>> +			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, topper);
>
>
> FWIW, the my main worry here is your handling of the umount.  For
> example, what happens if
> 	* something is mounted on A (m1)
> 	* something else is mounted on A/bar (m2)
> 	* D is a slave of C
> 	* something has been mounted on D/foo (n)
> 	* you do mount --rbind A C/foo (m1' on C/foo, m2' on m1'/bar,
> 					m1'' interposed on D/foo under n,
> 					m2'' on m1''/bar,
> 					m1'' slave of m1', m2'' slave of m2)
> 	* you make C/foo and C/foo/bar private (m1'' and m2'' are not getting
> 					propagation from m1' and m2' anymore)
> 	* you umount C/foo/bar		(m2' is unmounted)
> 	* you umount C/foo
> m1' gets unmounted, all right, but what of m1''?  D is a slave of C, so we
> get propagation of umount from C/foo to D/foo; m1'' still has m2'' attached
> to it.  AFAICS, your logics will happily slip m1'' from under n (assuming
> that n itself is not busy), and leak both m1'' and m2''.

Yes.  This is exactly the same behavior we have today without my patch.
The only difference is who is the parent mount.

$ cat > viro1.sh << EOF
#!/bin/sh
set -e
set -x

mount -t tmpfs base /mnt
mkdir -p /mnt/A
mount -t tmpfs m1 /mnt/A
mkdir -p /mnt/A/bar
mount -t tmpfs m2 /mnt/A/bar

mkdir -p /mnt/D
mkdir -p /mnt/C
mount -t tmpfs mC /mnt/C
mkdir -p /mnt/C/foo
mount --make-shared /mnt/C
mount --bind /mnt/C /mnt/D
mount --make-slave /mnt/D
mount -t tmpfs n /mnt/D/foo
mount --rbind /mnt/A /mnt/C/foo

echo
cat /proc/self/mountinfo

mount --make-private /mnt/C/foo
mount --make-private /mnt/C/foo/bar

echo
cat /proc/self/mountinfo

umount /mnt/C/foo/bar

echo
cat /proc/self/mountinfo

umount /mnt/C/foo

echo
cat /proc/self/mountinfo
EOF
$ chmod +x ./viro1.sh
$ unshare -Urm ./viro1.sh

At least when I run the above on a kernel with and without my patch
under discussion all I see different is mnt_id of the parent.  Which is
exactly what we should expect from this change.

Did I make a mistake in creating my script?

Or are you referring to the fact that mount_propagation_busy is just
plain buggy?

> OTOH, the case of multiple slip-under (Z is slave of Y, which is a slave of X,
> mount on Z, then mount of Y, then mount on X) the check for being busy would
> do very odd things.

I don't see what you are referring to.  Reposting shortly.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html