Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 5 Dec 2024 12:01:06 -0800

On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > >  {
> > > > > -     struct files_struct *files = current->files;
> > > > > +     struct files_struct *files;
> > > > >       struct fdtable *fdt;
> > > > >
> > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > >               return;
> > > > >
> > > > > +     /*
> > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > +      * explanation.
> > > > > +      */
> > > > >       rcu_read_lock_sched();
> > > > > +     files = READ_ONCE(current->files);
> > > >
> > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
> 
> ok

And rcu_dereference(), and for that matter memory_order_consume, only
orders the load of the pointer against subsequent dereferences of that
same pointer against dereferences of that same pointer preceding the
store of that pointer.

	T1				T2
	a: p->a = 1;			d: q = rcu_dereference(gp);
	b: r1 = p->b;			e: r2 = p->a;
	c: rcu_assign_pointer(gp, p);	f: p->b = 42;

Here, if (and only if!) T2's load into q gets the value stored by
T1, then T1's statements e and f are guaranteed to happen after T2's
statements a and b.  In your patch, I do not see this pattern for the
files->resize_in_progress flag.

> > > Anyway to elaborate I'm gunning for a setup where the code is
> > > semantically equivalent to having a lock around the work.
> >
> > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > only with later RCU grace periods, such as those implemented by
> > synchronize_rcu().
> 
> To my understanding the pre-case is already with the flag set upfront
> and waiting for everyone to finish (which is already taking place in
> stock code) + looking at it within the section.

I freely confess that I do not understand the purpose of assigning to
files->resize_in_progress both before (pre-existing) and within (added)
expand_fdtable().  If the assignments before and after the call to
expand_fdtable() and the checks were under that lock, that could work,
but removing that lockless check might have performance and scalability
consequences.

> > > Pretend ->resize_lock exists, then:
> > > fd_install:
> > > files = current->files;
> > > read_lock(files->resize_lock);
> > > fdt = rcu_dereference_sched(files->fdt);
> > > rcu_assign_pointer(fdt->fd[fd], file);
> > > read_unlock(files->resize_lock);
> > >
> > > expand_fdtable:
> > > write_lock(files->resize_lock);
> > > [snip]
> > > rcu_assign_pointer(files->fdt, new_fdt);
> > > write_unlock(files->resize_lock);
> > >
> > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > synchronize_rcu do it.
> >
> > OK, good, you did get the grace-period part of the puzzle.
> >
> > Howver, please keep in mind that synchronize_rcu() has significant
> > latency by design.  There is a tradeoff between CPU consumption and
> > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > several milliseconds (not microseconds or nanoseconds).  I would be very
> > surprised if expand_fdtable() users would be happy with such a long delay.
> 
> The call is already there since 2015 and I only know of one case where
> someone took an issue with it (and it could have been sorted out with
> dup2 upfront to grow the table to the desired size). Amusingly I see
> you patched it in 2018 from synchronize_sched to synchronize_rcu.
> Bottom line though is that I'm not *adding* it. latency here. :)

Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
you are saying that, because otherwise your patch could simply remove
it without additional code changes.  On the other hand, if it is a key
component of the synchronization, I don't see how that smp_rmb() can be
removed while still preserving that synchronization without adding another
synchronize_rcu() to that function to compensate.

Now, it might be that you are somehow cleverly reusing the pre-existing
synchronize_rcu(), but I am not immediately seeing how this would work.

And no, I do not recall making that particular change back in the
day, only that I did change all the calls to synchronize_sched() to
synchronize_rcu().  Please accept my apologies for my having failed
to meet your expectations.  And do not be too surprised if others have
similar expectations of you at some point in the future.  ;-)

> So assuming the above can be ignored, do you confirm the patch works
> (even if it needs some cosmetic changes)?
> 
> The entirety of the patch is about removing smp_rmb in fd_install with
> small code rearrangement, while relying on the machinery which is
> already there.

The code to be synchronized is fairly small.  So why don't you
create a litmus test and ask herd7?  Please see tools/memory-model for
documentation and other example litmus tests.  This tool does the moral
equivalent of a full state-space search of the litmus tests, telling you
whether your "exists" condition is always, sometimes, or never satisfied.

							Thnax, Paul