On Thu, Sep 17, 2020 at 07:52:32AM +0200, Christoph Hellwig wrote: > On Thu, Sep 17, 2020 at 01:09:42PM +1000, Dave Chinner wrote: > > > > iomap_dio_complete() > > > > generic_write_sync() > > > > btrfs_file_fsync() > > > > inode_lock() > > > > <deadlock> > > > > > > Can inode_dio_end() be called before generic_write_sync(), as it is done > > > in fs/direct-io.c:dio_complete()? > > > > Don't think so. inode_dio_wait() is supposed to indicate that all > > DIO is complete, and having the "make it stable" parts of an O_DSYNC > > DIO still running after inode_dio_wait() returns means that we still > > have DIO running.... > > > > For some filesystems, ensuring the DIO data is stable may involve > > flushing other data (perhaps we did EOF zeroing before the file > > extending DIO) and/or metadata to the log, so we need to guarantee > > these DIO related operations are complete and stable before we say > > the DIO is done. > > inode_dio_wait really just waits for active I/O that writes to or reads > from the file. It does not imply that the I/O is stable, just like > i_rwsem itself doesn't. No, but iomap_dio_rw() considers a O_DSYNC write to be incomplete until it is stable so that it presents consistent behaviour to anythign calling inode_dio_wait(). > Various file systems have historically called > the syncing outside i_rwsem and inode_dio_wait (in fact that is what the > fs/direct-io.c code does, so XFS did as well until a few years ago), and > that isn't a problem at all - we just can't return to userspace (or call > ki_complete for in-kernel users) before the data is stable on disk. I'm really not caring about userspace here - we use inode_dio_wait() as an IO completion notification for the purposes of synchronising internal filesystem state before modifying user data via direct metadata manipulation. Hence I want sane, consistent, predictable IO completion notification behaviour regardless of the implementation path it goes through. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx