Re: [RFC v3 1/1] fs/namespace: remove RCU sync for MNT_DETACH umount

Ian Kent <ikent@xxxxxxxxxx> · Mon, 1 Jul 2024 16:03:41 +0800

On 1/7/24 13:50, Christian Brauner wrote:
I always thought the rcu delay was to ensure concurrent path walks "see" the

umount not to ensure correct operation of the following mntput()(s).

Isn't the sequence of operations roughly, resolve path, lock, deatch,
release

lock, rcu wait, mntput() subordinate mounts, put path.
The crucial bit is really that synchronize_rcu_expedited() ensures that
the final mntput() won't happen until path walk leaves RCU mode.

This allows caller's like legitimize_mnt() which are called with only
the RCU read-lock during lazy path walk to simple check for
MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see
that this mount is MNT_SYNC_UMOUNT then they know that the mount won't
be freed until an RCU grace period is up and so they know that they can
simply put the reference count they took _without having to actually
call mntput()_.

Because if they did have to call mntput() they might end up shutting the
filesystem down instead of umount() and that will cause said EBUSY
errors I mentioned in my earlier mails.

Yes, I get that, the problem with this was always whether lockless path 
walks

would correctly see the mount had become invalid when being checked for

legitimacy.

So the mount gets detached in the critical section, then we wait followed by

the mntput()(s). The catch is that not waiting might increase the likelyhood

that concurrent path walks don't see the umount (so that possibly the umount

goes away before the walks see the umount) but I'm not certain. What looks
to

be as much of a problem is mntput() racing with a concurrent mount beacase
while

the detach is done in the critical section the super block instance list
deletion

is not and the wait will make the race possibility more likely. What's more
Concurrent mounters of the same filesystem will wait for each other via
grab_super(). That has it's own logic based on sb->s_active which goes
to zero when all mounts are gone.

Yep, missed that, I'm too hasty, thanks for your patience.

mntput() delegates the mount cleanup (which deletes the list instance) to a

workqueue job so this can also occur serially in a following mount command.
No, that only happens when it's a kthread. Regular umount() call goes
via task work which finishes before the caller returns to userspace
(same as closing files work).

Umm, misread that, oops!

Ian

In fact I might have seen exactly this behavior in a recent xfs-tests run
where I

was puzzled to see occasional EBUSY return on mounting of mounts that should
not

have been in use following their umount.
That's usually very much other bugs. See commit 2ae4db5647d8 ("fs: don't
misleadingly warn during thaw operations") in vfs.fixes for example.

So I think there are problems here but I don't think the removal of the wait
for

lazy umount is the worst of it.

The question then becomes, to start with, how do we resolve this unjustified
EBUSY

return. Perhaps a completion (used between the umount and mount system
calls) would

work well here?
Again, this already exists deeper down the stack...