On Thu, Sep 17, 2020 at 04:29:23PM +1000, Dave Chinner wrote: > > inode_dio_wait really just waits for active I/O that writes to or reads > > from the file. It does not imply that the I/O is stable, just like > > i_rwsem itself doesn't. > > No, but iomap_dio_rw() considers a O_DSYNC write to be incomplete > until it is stable so that it presents consistent behaviour to > anythign calling inode_dio_wait(). But that point is that inode_dio_wait does not care about that "consistency". It cares about when the I/O is done. I know because I wrote it (and I regret that as we should have stuck with the non-owner release of the rwsem which makes a whole lot more sense). > > > Various file systems have historically called > > the syncing outside i_rwsem and inode_dio_wait (in fact that is what the > > fs/direct-io.c code does, so XFS did as well until a few years ago), and > > that isn't a problem at all - we just can't return to userspace (or call > > ki_complete for in-kernel users) before the data is stable on disk. > > I'm really not caring about userspace here - we use inode_dio_wait() > as an IO completion notification for the purposes of synchronising > internal filesystem state before modifying user data via direct > metadata manipulation. Hence I want sane, consistent, predictable IO > completion notification behaviour regardless of the implementation > path it goes through. And none of that consistency matters. Think of it: - an O_(D)SYNC write is nothing but a write plus a ranged fsync, even if we do some optimizations to speed up the fsync by e.g. using the FUA flag - another fsync can come up at any time after we completed a write (with or without O_SYNC) - so any synchronization using inode_dio_wait (or i_rwsem for that matter) must not care if an fsync runs in parallel. - take a look at where we call inode_dio_wait to verify this - the prime original use case was truncate as we can't have I/O in progress while trunating. We then later extended it to all the truncate-like more compliated operations like hole punches, extent insert an collapse, etc. But in all that cases what matters is the actual I/O, not the sync. By having done direct I/O the page cache side of the sync doesn't matter to start with (but the callers all invalidate it anyway), so what matter is the metadata flush, aka the log force in the XFS case. And for that we absolutely do not need to be before inode_dio_wait returns. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx ---end quoted text---