Re: [RFC v3 1/1] fs/namespace: remove RCU sync for MNT_DETACH umount

Ian Kent <ikent@xxxxxxxxxx> · Mon, 1 Jul 2024 08:58:44 +0800




On 28/6/24 23:13, Alexander Larsson wrote:
On Fri, Jun 28, 2024 at 2:54 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
On Fri, Jun 28, 2024 at 11:17:43AM GMT, Ian Kent wrote:
On 27/6/24 23:16, Christian Brauner wrote:
On Thu, Jun 27, 2024 at 01:54:18PM GMT, Jan Kara wrote:
On Thu 27-06-24 09:11:14, Ian Kent wrote:
On 27/6/24 04:47, Matthew Wilcox wrote:
On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote:
+++ b/fs/namespace.c
@@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init;
    static DECLARE_RWSEM(namespace_sem);
    static HLIST_HEAD(unmounted);    /* protected by namespace_sem */
    static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
+static bool lazy_unlock = false; /* protected by namespace_sem */
That's a pretty ugly way of doing it.  How about this?
Ha!

That was my original thought but I also didn't much like changing all the
callers.

I don't really like the proliferation of these small helper functions either
but if everyone

is happy to do this I think it's a great idea.
So I know you've suggested removing synchronize_rcu_expedited() call in
your comment to v2. But I wonder why is it safe? I *thought*
synchronize_rcu_expedited() is there to synchronize the dropping of the
last mnt reference (and maybe something else) - see the comment at the
beginning of mntput_no_expire() - and this change would break that?
Yes. During umount mnt->mnt_ns will be set to NULL with namespace_sem
and the mount seqlock held. mntput() doesn't acquire namespace_sem as
that would get rather problematic during path lookup. It also elides
lock_mount_hash() by looking at mnt->mnt_ns because that's set to NULL
when a mount is actually unmounted.

So iirc synchronize_rcu_expedited() will ensure that it is actually the
system call that shuts down all the mounts it put on the umounted list
and not some other task that also called mntput() as that would cause
pretty blatant EBUSY issues.

So callers that come before mnt->mnt_ns = NULL simply return of course
but callers that come after mnt->mnt_ns = NULL will acquire
lock_mount_hash() _under_ rcu_read_lock(). These callers see an elevated
reference count and thus simply return while namespace_lock()'s
synchronize_rcu_expedited() prevents the system call from making
progress.

But I also don't see it working without risk even with MNT_DETACH. It
still has potential to cause issues in userspace. Any program that
always passes MNT_DETACH simply to ensure that even in the very rare
case that a mount might still be busy is unmounted might now end up
seeing increased EBUSY failures for mounts that didn't actually need to
be unmounted with MNT_DETACH. In other words, this is only inocuous if
userspace only uses MNT_DETACH for stuff they actually know is busy when
they're trying to unmount. And I don't think that's the case.

I'm sorry but how does an MNT_DETACH umount system call return EBUSY, I
can't

see how that can happen?
Not the umount() call is the problem. Say you have the following
sequence:

(1) mount(ext4-device, /mnt)
     umount(/mnt, 0)
     mount(ext4-device, /mnt)

If that ext4 filesystem isn't in use anymore then umount() will succeed.
The same task can immediately issue a second mount() call on the same
device and it must succeed.

Today the behavior for this is the same whether or no the caller uses
MNT_DETACH. So:

(2) mount(ext4-device, /mnt)
     umount(/mnt, MNT_DETACH)
     mount(ext4-device, /mnt)

All that MNT_DETACH does is to skip the check for busy mounts otherwise
it's identical to a regular umount. So (1) and (2) will behave the same
as long as the filesystem isn't used anymore.

But afaict with your changes this wouldn't be true anymore. If someone
uses (2) on a filesystem that isn't busy then they might end up getting
EBUSY on the second mount. And if I'm right then that's potentially a
rather visible change.

I'm not sure this change affects the the likelyhood of an EBUSY return 
in the

described case, in fact it looks like it does the opposite.


I always thought the rcu delay was to ensure concurrent path walks "see" the

umount not to ensure correct operation of the following mntput()(s).


Isn't the sequence of operations roughly, resolve path, lock, deatch, 
release

lock, rcu wait, mntput() subordinate mounts, put path.


So the mount gets detached in the critical section, then we wait followed by

the mntput()(s). The catch is that not waiting might increase the likelyhood

that concurrent path walks don't see the umount (so that possibly the umount

goes away before the walks see the umount) but I'm not certain. What 
looks to

be as much of a problem is mntput() racing with a concurrent mount 
beacase while

the detach is done in the critical section the super block instance list 
deletion

is not and the wait will make the race possibility more likely. What's more

mntput() delegates the mount cleanup (which deletes the list instance) to a

workqueue job so this can also occur serially in a following mount command.


In fact I might have seen exactly this behavior in a recent xfs-tests 
run where I

was puzzled to see occasional EBUSY return on mounting of mounts that 
should not

have been in use following their umount.


So I think there are problems here but I don't think the removal of the 
wait for

lazy umount is the worst of it.


The question then becomes, to start with, how do we resolve this 
unjustified EBUSY

return. Perhaps a completion (used between the umount and mount system 
calls) would

work well here?


This is rather unfortunate, as the synchronize_rcu call is quite
expensive. In particular on a real-time kernel where there are no
expedited RCUs. This is causing container startup to be slow, as there
are several umount(MNT_DETACH) happening during container setup (after
the pivot_root, etc).

Maybe we can add a umount flag for users that don't need the current
behaviour wrt EBUSY? In the container usecase the important part is
that the old mounts are disconnected from the child namespace and not
really what the mount busy state is (typically it is still mounted in
the parent namespace anyway).

I think it's a little too soon to try and work out what to do about the

speed of umount, lazy or not.


Umount has taken progressively longer over the years and is in fact quite

slow now. I'm really not sure what to do about that having looked at it a

number of times without joy. Nevertheless, I believe we need to find a way

to do this or something like it to reduce the delays involved in umount.


Ian