Re: [RFC v3 1/1] fs/namespace: remove RCU sync for MNT_DETACH umount

Jan Kara <jack@xxxxxxx> · Tue, 2 Jul 2024 12:01:47 +0200

On Tue 02-07-24 15:01:54, Ian Kent wrote:
> On 2/7/24 12:58, Christian Brauner wrote:
> > On Fri, Jun 28, 2024 at 01:13:45PM GMT, Jan Kara wrote:
> > > On Fri 28-06-24 10:58:54, Ian Kent wrote:
> > > > On 27/6/24 19:54, Jan Kara wrote:
> > > > > On Thu 27-06-24 09:11:14, Ian Kent wrote:
> > > > > > On 27/6/24 04:47, Matthew Wilcox wrote:
> > > > > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote:
> > > > > > > > +++ b/fs/namespace.c
> > > > > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init;
> > > > > > > >     static DECLARE_RWSEM(namespace_sem);
> > > > > > > >     static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
> > > > > > > >     static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
> > > > > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */
> > > > > > > That's a pretty ugly way of doing it.  How about this?
> > > > > > Ha!
> > > > > > 
> > > > > > That was my original thought but I also didn't much like changing all the
> > > > > > callers.
> > > > > > 
> > > > > > I don't really like the proliferation of these small helper functions either
> > > > > > but if everyone
> > > > > > 
> > > > > > is happy to do this I think it's a great idea.
> > > > > So I know you've suggested removing synchronize_rcu_expedited() call in
> > > > > your comment to v2. But I wonder why is it safe? I *thought*
> > > > > synchronize_rcu_expedited() is there to synchronize the dropping of the
> > > > > last mnt reference (and maybe something else) - see the comment at the
> > > > > beginning of mntput_no_expire() - and this change would break that?
> > > > Interesting, because of the definition of lazy umount I didn't look closely
> > > > enough at that.
> > > > 
> > > > But I wonder, how exactly would that race occur, is holding the rcu read
> > > > lock sufficient since the rcu'd mount free won't be done until it's
> > > > released (at least I think that's how rcu works).
> > > I'm concerned about a race like:
> > > 
> > > [path lookup]				[umount -l]
> > > ...
> > > path_put()
> > >    mntput(mnt)
> > >      mntput_no_expire(m)
> > >        rcu_read_lock();
> > >        if (likely(READ_ONCE(mnt->mnt_ns))) {
> > > 					do_umount()
> > > 					  umount_tree()
> > > 					    ...
> > > 					    mnt->mnt_ns = NULL;
> > > 					    ...
> > > 					  namespace_unlock()
> > > 					    mntput(&m->mnt)
> > > 					      mntput_no_expire(mnt)
> > > 				              smp_mb();
> > > 					      mnt_add_count(mnt, -1);
> > > 					      count = mnt_get_count(mnt);
> > > 					      if (count != 0) {
> > > 						...
> > > 						return;
> > >          mnt_add_count(mnt, -1);
> > >          rcu_read_unlock();
> > >          return;
> > > -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount!
> > >        }
> > Yeah, I think that's a valid concern. mntput_no_expire() requires that
> > the last reference is dropped after an rcu grace period and that can
> > only be done by synchronize_rcu_*() (It could be reworked but that would
> > be quite ugly.). See also mnt_make_shortterm() caller's for kernel
> > initiated unmounts.
> 
> I've thought about this a couple of times now.
> 
> Isn't it the case here that the path lookup thread will have taken a
> reference (because it's calling path_put()) and the umount will have
> taken a reference on system call entry.

Yes, path lookup has taken a reference to mnt in this case. Umount syscall
also has a reference to the mount in its struct path it has got from
user_path_at(). But note that single umount call can end up tearing the
whole tree of mounts AFAICT (in umount_tree()) so you cannot really rely on
the fact that the syscall holds a ref to the mount it is tearing down.

Secondly, even if the path_umount() would be holding a reference to the
mount being torn down, it is trivial to extend the race window so that
the task doing 'umount -l' continues until it gets past mntput_no_expire()
in path_umount() and only then the task doing path_put() wakes up and you
get the same problem.

> So for the mount being umounted the starting count will be at lest three
> then if the umount mntput() is called from namespace_unlock() it will
> correctly see count != 0 and the path lookup mntput() to release it's
> reference finally leaving the mntput() of the path_put() from the top
> level system call function to release the last reference.
> 
> Once again I find myself thinking this should be independent of the rcu
> wait because only path walks done before the mount being detached can be
> happening and the lockless walks are done holding the rcu read lock and
> how likely is it a ref-walk path lookup (that should follow a failed
> rcu-walk in this case) has been able to grab a reference anyway?

No, it really is not independent of the RCU wait. mntput_no_expire() uses
RCU for proper synchronization of dropping of the last mount reference.
AFAIU there's a rule - after you clear mnt->mnt_ns, you must wait for RCU
period to expire before you can drop the mnt reference you are holding. RCU
path walk uses this fact but also any part of the kernel calling
mntput_no_expire() implicitely depends on this behavior. And the changes to
lazy umount path must not break this rule (or they have to come up with a
different way to synchronize dropping of the last mnt reference).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR