Re: [PATCH 17/17] RCU'd vfsmounts

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 3 Oct 2013 13:19:16 -0700

On Thu, Oct 3, 2013 at 12:43 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> In the common case it's ->mnt_ns is *not* NULL; that's what we get if
> the damn thing is still mounted.

Yeah, I misread the profile assembly code. The point being that the
nice fast case now has the smp_mb() in it, and it accounts for about
60% of the cost of that function on my performance profile.

> What we need to avoid is this:
>
> mnt_ns non-NULL, mnt_count is 2
> CPU1: umount -l                                 CPU2: mntput
> umount_tree() clears mnt_ns
> drop mount_lock.lock
> namespace_unlock() calls mntput()
> decrement mnt_count
> see that mnt_ns is NULL
> grab mount_lock.lock
> check mnt_count
>                                                 decrement mnt_count
>                                                 see old value of mnt_ns
>                                                 decide to bugger off
> see it equal to 1 (i.e. miss decrement on CPU2)
> decide to bugger off
>
> The barrier in mntput() is to prevent that combination, so that either CPU2
> would see mnt_ns cleared by CPU1, or CPU1 would see mnt_count decrement done
> by CPU2.  Its counterpart on CPU1 is provided by spin_unlock/spin_lock we've
> done between clearing mnt_ns and checking mnt_count.  Note that
> synchronize_rcu() in namespace_unlock() and rcu_read_lock() in mntput() are
> irrelevant here - the latter on CPU2 might very well have happened after the
> former on CPU1, so umount -l did *not* wait for CPU2 to do anything.
>
> Any suggestions re getting rid of that barrier?

Hmm. The CPU2 mntput can only happen under RCU readlock, right? After
the RCU grace period _and_ if the umount is going ahead, nothing
should have a mnt pointer, right?

So I'm wondering if you couldn't just have a synchronize_rcu() in that
umount path, after clearing mnt_ns. At that point you _know_ you're
the only one that should have access to the mnt.

You'd need to drop the mount-hash lock for that. But I think you can
do it in umount_tree(), right? IOW, you could make the rule be that
umount_tree() must be called with the namespace lock and the
mount-hash lock, and it will drop both. Or does that get too painful
too?

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html