Re: [RFC] umount/__detach_mounts() race

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Mon, 21 Feb 2022 11:04:32 -0600

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

> 	umount_tree() is very definitely not supposed to be called
> on MNT_UMOUNT subtrees (== stuck-together fragments that got
> unmounted, but not split into individual mount nodes).  Refcounting
> rules are different there and umount_tree() assumes that we start with
> the normal ones.
>
> 	do_umount() appears to be checking for that:
>
> 	if (flags & MNT_DETACH) {
> 		if (!list_empty(&mnt->mnt_list))
> 			umount_tree(mnt, UMOUNT_PROPAGATE);
> 		retval = 0;
> 	} else {
> 		shrink_submounts(mnt);
> 		retval = -EBUSY;
> 		if (!propagate_mount_busy(mnt, 2)) {
> 			if (!list_empty(&mnt->mnt_list))
> 				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> 			retval = 0;
> 		}
> 	}
>
> which would prevent umount_tree() on those - mnt_list eviction happens
> for the same nodes that get MNT_UMOUNT.  However, shrink_submounts()
> will call umount_tree() for e.g. nfs automounts it finds on victim
> mount, and if ours happens to be already unmounted, with automounts
> stuck to it, we have trouble.
>
> It looks like something that should be impossible to hit, but...
>
> A: umount(2) looks the sucker up
> B: rmdir(2) in another namespace (where nothing is mounted on that mountpoint)
> does __detach_mounts(), which grabs namespace_sem, sees the mount A is about
> to try and kill and calls umount_tree(mnt, UMOUNT_CONNECTED).  Which detaches
> our mount (and its children, automounts included) from the namespace it's in,
> modifies their refcounts accordingly and keeps the entire thing in one
> piece.
> A: in do_umount() blocks on namespace_sem
> B: drops namespace_sem
> A: gets to the quoted code.  mnt is already MNT_UMOUNT (and has empty
> ->mnt_list), but it does have (equally MNT_UMOUNT) automounts under it,
> etc.  So shrink_submounts() finds something to umount and calls umount_tree().
> Buggered refcounts happen.
>
> Does anybody see a problem with the following patch?

I think it looks like 2 patches.
One patch to remove some list_empty checks.
> -		if (!list_empty(&mnt->mnt_list))
> -			umount_tree(mnt, UMOUNT_PROPAGATE);
> +		umount_tree(mnt, UMOUNT_PROPAGATE);

Another patch to add a MNT_UMOUNT condition.

The MNT_UMOUNT condition should probably be checked optimistically along
with MNT_FORCE in can_umount as well to be explicit.  I think it is
redundant with check_mnt but it would add clarity to what we are looking
for, and keep people thinking about the MNT_UMOUNT races.

Testing MNT_UMOUNT after the locks have been grabbed seem very
reasonable.  Alternately the code could call check_mnt again and lean on
that test.  But testing MNT_UMOUNT seems sufficient.

> diff --git a/fs/namespace.c b/fs/namespace.c
> index 42d4fc21263b2..1604b9d7a69d9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1654,21 +1654,20 @@ static int do_umount(struct mount *mnt, int flags)
>  	lock_mount_hash();
>  
>  	/* Recheck MNT_LOCKED with the locks held */
> +	/* ... and don't go there if we'd raced and it's already unmounted */
>  	retval = -EINVAL;
> -	if (mnt->mnt.mnt_flags & MNT_LOCKED)
> +	if (mnt->mnt.mnt_flags & (MNT_LOCKED|MNT_UMOUNT))
>  		goto out;
>  
>  	event++;
>  	if (flags & MNT_DETACH) {
> -		if (!list_empty(&mnt->mnt_list))
> -			umount_tree(mnt, UMOUNT_PROPAGATE);
> +		umount_tree(mnt, UMOUNT_PROPAGATE);
>  		retval = 0;
>  	} else {
>  		shrink_submounts(mnt);
>  		retval = -EBUSY;
>  		if (!propagate_mount_busy(mnt, 2)) {
> -			if (!list_empty(&mnt->mnt_list))
> -				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> +			umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
>  			retval = 0;
>  		}
>  	}

Eric