Re: [PATCH RFC v2 1/6] fs: add i_state helpers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 22, 2024 at 08:12:15AM +1000, NeilBrown wrote:
> On Thu, 22 Aug 2024, Christian Brauner wrote:
> > The i_state member is an unsigned long so that it can be used with the
> > wait bit infrastructure which expects unsigned long. This wastes 4 bytes
> > which we're unlikely to ever use. Switch to using the var event wait
> > mechanism using the address of the bit. Thanks to Linus for the address
> > idea.
> > 
> > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
> > ---
> >  fs/inode.c         | 10 ++++++++++
> >  include/linux/fs.h | 16 ++++++++++++++++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 154f8689457f..f2a2f6351ec3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
> >  		inode->i_state |= I_REFERENCED;
> >  }
> >  
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +					    struct inode *inode, u32 bit)
> > +{
> > +        void *bit_address;
> > +
> > +        bit_address = inode_state_wait_address(inode, bit);
> > +        init_wait_var_entry(wqe, bit_address, 0);
> > +        return __var_waitqueue(bit_address);
> > +}
> > +
> >  /*
> >   * Add inode to LRU if needed (inode is unused and clean).
> >   *
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 23e7d46b818a..a5b036714d74 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -744,6 +744,22 @@ struct inode {
> >  	void			*i_private; /* fs or device private pointer */
> >  } __randomize_layout;
> >  
> > +/*
> > + * Get bit address from inode->i_state to use with wait_var_event()
> > + * infrastructre.
> > + */
> > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
> > +
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +					    struct inode *inode, u32 bit);
> > +
> > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
> > +{
> > +	/* Ensure @bit will be seen cleared/set when caller is woken up. */
> 
> The above comment is wrong.  I think I once thought it was correct too
> but now I know better (I hope).
> A better comment might be
>        /* Insert memory barrier as recommended by wake_up_var() */
> but even that is unnecessary as we don't need the memory barrier.
> 
> A careful reading of memory-barriers.rst shows that *when the process is
> actually woken* there are sufficient barriers in wake_up_process() and
> prepare_wait_event() and the scheduler and (particularly)
> set_current_state() so that a value set before the wake_up is seen after
> the schedule().
> 
> So this barrier isn't about the bit.  This barrier is about the
> wait_queue.  In particular it is about waitqueue_active() call at the
> start of wake_up_var().  If that test wasn't there and if instead
> wake_up_var() conditionally called __wake_up(), then there would be no
> need for any barrier.  A comment near wake_up_bit() makes it clear that
> the barrier is only needed when the spinlock is avoided.

Oh, I missed that *landmine*.

So almost none of the current uses of wake_up_var() have an explicit
memory barrier before them, and a lot of them are not done under
spin locks. I suspect that the implicit store ordering of the wake
queue check is mostly only handled by chance - an
atomic_dec_and_test() or smp_store_release() call is made before
wake_up_var() or the wait/wakeup is synchronised by an external
lock.

> On a weakly ordered arch, this test can be effected *before* the write
> of the bit.  If the waiter adds itself to the wait queue and then tests
> the bit before the bit is set, but after the waitqueue_active() test is
> put in effect, then the wake_up will never be sent.
> 
> But ....  this is all academic of this code because you don't need a
> barrier at all.  The wake_up happens in a spin_locked region, and the
> wait is entirely inside the same spin_lock, except for the schedule.  A
> later patch has:
>      spin_unlock(); schedule(); spin_lock();
> 
> So there is no room for a race.  If the bit is cleared before the
> wait_var_event() equivalent, then no wakeup is needed.  When the lock is
> dropped after the bit is cleared the unlock will have all the barriers
> needed for the bit to be visible.

Right, that's exactly the problem with infrastructure that
externalises memory ordering dependencies. Most of the time they
just work because of external factors, but sometimes they don't and
we get random memory barriers being cargo culted around the place
because that seems to fix weird problems....

> The only other place that the bit can be cleared is concurrent with the
> above schedule() while the spinlock isn't held by the waiter.  In that
> case it is again clear that no barrier is needed - or that the
> spin_unlock/lock provide all the needed barriers.
> 
> So a better comment would be
> 
>    /* no barrier needs as both waker and waiter are in spin-locked regions */
           ^^^^^^^
	   ordering

Even better would be to fix wake_up_var() to not have an implicit
ordering requirement. Add __wake_up_var() as the "I know what I'm
doing" API, and have wake_up_var() always issue the memory barrier
like so:

__wake_up_var(var)
{
	__wake_up_bit(....);
}

wake_up_var(var)
{
	smp_mb();
	__wake_up_var(var);
}

Then people who don't intimately understand ordering (i.e. just want
to use a conditional wait variable) or just don't want to
think about complex memory ordering issues for otherwise obvious and
simple code can simply use the safe variant.

Those that need to optimise for speed or know they have solid
ordering or external serialisation can use __wake_up_var() and leave
a comment as to why that is safe. i.e.:

	/*
	 * No wait/wakeup ordering required as both waker and waiters
	 * are serialised by the inode->i_lock.
	 */
	__wake_up_var(....);

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux