[+fsdevel] On Thu, Nov 24, 2022 at 2:21 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Wed, Nov 23, 2022 at 12:46:45PM +0100, Jan Kara wrote: > > Hello! > > > > We were pondering with Amir about some issues with fsnotify subsystem and > > as a building block we would need a mechanism to make sure write(2) has > > completed. For simplicity we could imagine it like a sequence > > > > write(2) > > START > > do stuff to perform write > > END > > > > and we need a mechanism to wait for all processes that already passed START > > to reach END. Ideally without blocking new writes while we wait for the > > pending ones. Now this seems like a good task for SRCU. We could do: > > > > write(2) > > srcu_read_lock(&sb->s_write_rcu); > > do stuff to perform write > > srcu_read_unlock(&sb->s_write_rcu); > > > > and use synchronize_srcu(&sb->s_write_rcu) for waiting. > > > > But the trouble with writes is there are things like aio or io_uring where > > the part with srcu_read_lock() happens from one task (the submitter) while > > the part with srcu_read_unlock() happens from another context (usually worker > > thread triggered by IRQ reporting that the HW has finished the IO). > > > > Is there any chance to make SRCU work in a situation like this? It seems to > > me in principle it should be possible to make this work but maybe there are > > some implementation constraints I'm missing... > > The srcu_read_lock_notrace() and srcu_read_unlock_notrace() functions > will work for this, though that is not their intended purpose. Plus you > might want to trace these functions, which, as their names indicate, is > not permitted. I assume that you do not intend to use these functions > from NMI handlers, though that really could be accommodated. (But why > would you need that?) > > So how about srcu_down_read() and srcu_up_read(), as shown in the > (untested) patch below? > > Note that you do still need to pass the return value from srcu_down_read() > into srcu_up_read(). I am guessing that io_uring has a convenient place > that this value can be placed. No idea about aio. > Sure, aio completion has context. > Thoughts? That looks great! Thank you. Followup question: Both fs/aio.c:aio_write() and io_uring/rw.c:io_write() do this ugly thing: /* * Open-code file_start_write here to grab freeze protection, * which will be released by another thread in * aio_complete_rw(). Fool lockdep by telling it the lock got * released so that it doesn't complain about the held lock when * we return to userspace. */ if (S_ISREG(file_inode(file)->i_mode)) { sb_start_write(file_inode(file)->i_sb); __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); } And in write completion: /* * Tell lockdep we inherited freeze protection from submission * thread. */ if (S_ISREG(inode->i_mode)) __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); file_end_write(kiocb->ki_filp); I suppose we also need to "fool lockdep" w.r.t returning to userspace with an acquired srcu? Thanks, Amir. > > Thanx, Paul > > ------------------------------------------------------------------------ > > commit 0efa1e7b5a862e9c2f1bf8c19db6bd142ad35355 > Author: Paul E. McKenney <paulmck@xxxxxxxxxx> > Date: Wed Nov 23 15:49:55 2022 -0800 > > rcu: Add srcu_down_read() and srcu_up_read() > > A pair of matching srcu_read_lock() and srcu_read_unlock() invocations > must take place within the same context, for example, within the same > task. Otherwise, lockdep complains, as is the right thing to do for > most use cases. > > However, there are use cases involving asynchronous I/O where the > SRCU reader needs to begin on one task and end on another. This commit > therefore supplies the semaphore-like srcu_down_read() and srcu_up_read(), > which act like srcu_read_lock() and srcu_read_unlock(), but permitting > srcu_up_read() to be invoked in a different context than was the matching > srcu_down_read(). > > Neither srcu_down_read() nor srcu_up_read() may be invoked from an > NMI handler. > > Reported-by: Jan Kara <jack@xxxxxxx> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > diff --git a/include/linux/srcu.h b/include/linux/srcu.h > index 9b9d0bbf1d3cf..74796cd7e7a9d 100644 > --- a/include/linux/srcu.h > +++ b/include/linux/srcu.h > @@ -214,6 +214,34 @@ srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp) > return retval; > } > > +/** > + * srcu_down_read - register a new reader for an SRCU-protected structure. > + * @ssp: srcu_struct in which to register the new reader. > + * > + * Enter a semaphore-like SRCU read-side critical section. Note that > + * SRCU read-side critical sections may be nested. However, it is > + * illegal to call anything that waits on an SRCU grace period for the > + * same srcu_struct, whether directly or indirectly. Please note that > + * one way to indirectly wait on an SRCU grace period is to acquire > + * a mutex that is held elsewhere while calling synchronize_srcu() or > + * synchronize_srcu_expedited(). But if you want lockdep to help you > + * keep this stuff straight, you should instead use srcu_read_lock(). > + * > + * The semaphore-like nature of srcu_down_read() means that the matching > + * srcu_up_read() can be invoked from some other context, for example, > + * from some other task or from an irq handler. However, neither > + * srcu_down_read() nor srcu_up_read() may be invoked from an NMI handler. > + * > + * Calls to srcu_down_read() may be nested, similar to the manner in > + * which calls to down_read() may be nested. > + */ > +static inline int srcu_down_read(struct srcu_struct *ssp) __acquires(ssp) > +{ > + WARN_ON_ONCE(in_nmi()); > + srcu_check_nmi_safety(ssp, false); > + return __srcu_read_lock(ssp); > +} > + > /** > * srcu_read_unlock - unregister a old reader from an SRCU-protected structure. > * @ssp: srcu_struct in which to unregister the old reader. > @@ -254,6 +282,23 @@ srcu_read_unlock_notrace(struct srcu_struct *ssp, int idx) __releases(ssp) > __srcu_read_unlock(ssp, idx); > } > > +/** > + * srcu_up_read - unregister a old reader from an SRCU-protected structure. > + * @ssp: srcu_struct in which to unregister the old reader. > + * @idx: return value from corresponding srcu_read_lock(). > + * > + * Exit an SRCU read-side critical section, but not necessarily from > + * the same context as the maching srcu_down_read(). > + */ > +static inline void srcu_up_read(struct srcu_struct *ssp, int idx) > + __releases(ssp) > +{ > + WARN_ON_ONCE(idx & ~0x1); > + WARN_ON_ONCE(in_nmi()); > + srcu_check_nmi_safety(ssp, false); > + __srcu_read_unlock(ssp, idx); > +} > + > /** > * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock > *