On Thu, Aug 22, 2024 at 08:12:15AM GMT, NeilBrown wrote: > On Thu, 22 Aug 2024, Christian Brauner wrote: > > The i_state member is an unsigned long so that it can be used with the > > wait bit infrastructure which expects unsigned long. This wastes 4 bytes > > which we're unlikely to ever use. Switch to using the var event wait > > mechanism using the address of the bit. Thanks to Linus for the address > > idea. > > > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > > --- > > fs/inode.c | 10 ++++++++++ > > include/linux/fs.h | 16 ++++++++++++++++ > > 2 files changed, 26 insertions(+) > > > > diff --git a/fs/inode.c b/fs/inode.c > > index 154f8689457f..f2a2f6351ec3 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -472,6 +472,16 @@ static void __inode_add_lru(struct inode *inode, bool rotate) > > inode->i_state |= I_REFERENCED; > > } > > > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > > + struct inode *inode, u32 bit) > > +{ > > + void *bit_address; > > + > > + bit_address = inode_state_wait_address(inode, bit); > > + init_wait_var_entry(wqe, bit_address, 0); > > + return __var_waitqueue(bit_address); > > +} > > + > > /* > > * Add inode to LRU if needed (inode is unused and clean). > > * > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 23e7d46b818a..a5b036714d74 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -744,6 +744,22 @@ struct inode { > > void *i_private; /* fs or device private pointer */ > > } __randomize_layout; > > > > +/* > > + * Get bit address from inode->i_state to use with wait_var_event() > > + * infrastructre. > > + */ > > +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) > > + > > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, > > + struct inode *inode, u32 bit); > > + > > +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) > > +{ > > + /* Ensure @bit will be seen cleared/set when caller is woken up. */ > > The above comment is wrong. I think I once thought it was correct too > but now I know better (I hope). > A better comment might be > /* Insert memory barrier as recommended by wake_up_var() */ > but even that is unnecessary as we don't need the memory barrier. > > A careful reading of memory-barriers.rst shows that *when the process is > actually woken* there are sufficient barriers in wake_up_process() and > prepare_wait_event() and the scheduler and (particularly) > set_current_state() so that a value set before the wake_up is seen after > the schedule(). > > So this barrier isn't about the bit. This barrier is about the > wait_queue. In particular it is about waitqueue_active() call at the > start of wake_up_var(). If that test wasn't there and if instead > wake_up_var() conditionally called __wake_up(), then there would be no Did you mean "unconditionally called"? > need for any barrier. A comment near wake_up_bit() makes it clear that > the barrier is only needed when the spinlock is avoided. > > On a weakly ordered arch, this test can be effected *before* the write > of the bit. If the waiter adds itself to the wait queue and then tests > the bit before the bit is set, but after the waitqueue_active() test is > put in effect, then the wake_up will never be sent. > > But .... this is all academic of this code because you don't need a > barrier at all. The wake_up happens in a spin_locked region, and the > wait is entirely inside the same spin_lock, except for the schedule. A > later patch has: > spin_unlock(); schedule(); spin_lock(); > > So there is no room for a race. If the bit is cleared before the > wait_var_event() equivalent, then no wakeup is needed. When the lock is > dropped after the bit is cleared the unlock will have all the barriers > needed for the bit to be visible. > > The only other place that the bit can be cleared is concurrent with the > above schedule() while the spinlock isn't held by the waiter. In that > case it is again clear that no barrier is needed - or that the > spin_unlock/lock provide all the needed barriers. > > So a better comment would be > > /* no barrier needs as both waker and waiter are in spin-locked regions */ Thanks for the analysis. I was under the impression that wait_on_inode() was called in contexts where no barrier is guaranteed and the bit isn't checked with spin_lock() held.