On Thu, Nov 24, 2022 at 10:58:40AM +0100, Jan Kara wrote: > On Thu 24-11-22 08:21:13, Amir Goldstein wrote: > > [+fsdevel] > > > > On Thu, Nov 24, 2022 at 2:21 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > On Wed, Nov 23, 2022 at 12:46:45PM +0100, Jan Kara wrote: > > > > Hello! > > > > > > > > We were pondering with Amir about some issues with fsnotify subsystem and > > > > as a building block we would need a mechanism to make sure write(2) has > > > > completed. For simplicity we could imagine it like a sequence > > > > > > > > write(2) > > > > START > > > > do stuff to perform write > > > > END > > > > > > > > and we need a mechanism to wait for all processes that already passed START > > > > to reach END. Ideally without blocking new writes while we wait for the > > > > pending ones. Now this seems like a good task for SRCU. We could do: > > > > > > > > write(2) > > > > srcu_read_lock(&sb->s_write_rcu); > > > > do stuff to perform write > > > > srcu_read_unlock(&sb->s_write_rcu); > > > > > > > > and use synchronize_srcu(&sb->s_write_rcu) for waiting. > > > > > > > > But the trouble with writes is there are things like aio or io_uring where > > > > the part with srcu_read_lock() happens from one task (the submitter) while > > > > the part with srcu_read_unlock() happens from another context (usually worker > > > > thread triggered by IRQ reporting that the HW has finished the IO). > > > > > > > > Is there any chance to make SRCU work in a situation like this? It seems to > > > > me in principle it should be possible to make this work but maybe there are > > > > some implementation constraints I'm missing... > > > > > > The srcu_read_lock_notrace() and srcu_read_unlock_notrace() functions > > > will work for this, though that is not their intended purpose. Plus you > > > might want to trace these functions, which, as their names indicate, is > > > not permitted. I assume that you do not intend to use these functions > > > from NMI handlers, though that really could be accommodated. (But why > > > would you need that?) > > > > > > So how about srcu_down_read() and srcu_up_read(), as shown in the > > > (untested) patch below? > > > > > > Note that you do still need to pass the return value from srcu_down_read() > > > into srcu_up_read(). I am guessing that io_uring has a convenient place > > > that this value can be placed. No idea about aio. > > > > > > > Sure, aio completion has context. > > > > > Thoughts? > > > > That looks great! Thank you. > > > > Followup question: > > Both fs/aio.c:aio_write() and io_uring/rw.c:io_write() do this ugly > > thing: > > > > /* > > * Open-code file_start_write here to grab freeze protection, > > * which will be released by another thread in > > * aio_complete_rw(). Fool lockdep by telling it the lock got > > * released so that it doesn't complain about the held lock when > > * we return to userspace. > > */ > > if (S_ISREG(file_inode(file)->i_mode)) { > > sb_start_write(file_inode(file)->i_sb); > > __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); > > } > > > > And in write completion: > > > > /* > > * Tell lockdep we inherited freeze protection from submission > > * thread. > > */ > > if (S_ISREG(inode->i_mode)) > > __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); > > file_end_write(kiocb->ki_filp); > > > > I suppose we also need to "fool lockdep" w.r.t returning to userspace > > with an acquired srcu? > > So AFAICT the whole point of Paul's new helpers is to not use lockdep and > thus not have to play the "fool lockdep" games. Exactly! ;-) But if you do return to userspace after invoking srcu_down_read(), it is your responsibility to make sure that -something- eventually invokes srcu_up_read(). Which might or might not be able to rely on userspace doing something sensible. I would guess that you have a timeout or rely on close() for that purpose, just as you presumably do for sb_start_write(), but figured I should mention it. Thanx, Paul