On Tue, 13 Sep 2022, Dave Chinner wrote: > On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote: > > On Mon, 12 Sep 2022, Jeff Layton wrote: > > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > > > This could be expensive. > > > > > > > > There is not currently any locking around O_DIRECT writes. You cannot > > > > synchronise with them. > > > > > > > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > > > inode_lock_shared across the I/O. That was why patch #8 takes the > > > inode_lock (exclusive) across the getattr. > > > > Looking at ext4_dio_write_iter() it certain does take > > inode_lock_shared() before starting the write and in some cases it > > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for > > the write to complete. But not in all cases. > > So I don't think it always holds the shared lock across all direct IO. > > To serialise against dio writes, one must: > > // Lock the inode exclusively to block new DIO submissions > inode_lock(inode); > > // Wait for all in flight DIO reads and writes to complete > inode_dio_wait(inode); > > This is how truncate, fallocate, etc serialise against AIO+DIO which > do not hold the inode lock across the entire IO. These have to > serialise aginst DIO reads, too, because we can't have IO in > progress over a range of the file that we are about to free.... > > > > Taking inode_lock_shared is sufficient to block out buffered and DAX > > > writes. DIO writes sometimes only take the shared lock (e.g. when the > > > data is already properly aligned). If we want to ensure the getattr > > > doesn't run while _any_ writes are running, we'd need the exclusive > > > lock. > > > > But the exclusive lock is bad for scalability. > > Serilisation against DIO is -expensive- and -slow-. It's not a > solution for what is supposed to be a fast unlocked read-only > operation like statx(). > > > > Maybe that's overkill, though it seems like we could have a race like > > > this without taking inode_lock across the getattr: > > > > > > reader writer > > > ----------------------------------------------------------------- > > > i_version++ > > > getattr > > > read > > > DIO write to backing store > > > > > > > This is why I keep saying that the i_version increment must be after the > > write, not before it. > > Sure, but that ignores the reason why we actually need to bump > i_version *before* we submit a DIO write. DIO write invalidates the > page cache over the range of the write, so any racing read will > re-populate the page cache during the DIO write. and DIO reads can also get data at some intermediate state of the write. So what? i_version cannot provide coherent caching. You needs locks and call-backs and such for that. i_version (as used by NFS) only aims for approximate caching. Specifically: 1/ max-age caching. The i_version is polled when the age of the cache reaches some preset value, and the cache is purged/reloaded if needed, and the age reset to 0. 2/ close-to-open caching. There are well-defined events where the i_version must reflect preceding events by that the same client (close and unlock) and well defined events when the client will check the i_version before trusting any cache (open and lock). There is absolutely no need ever to update the i_version *before* making a change. If you really think there is, please provide a sequence of events for two different actors/observes where having the update before the change will provide a useful benefit. Thanks, NeilBrown > > Hence buffered reads can return before the DIO write has completed, > and the contents of the read can contain, none, some or all of the > contents of the DIO write. Hence i_version has to be incremented > before the DIO write is submitted so that racing getattrs will > indicate that the local caches have been invalidated and that data > needs to be refetched. > > But, yes, to actually be safe here, we *also* should be bumping > i_version on DIO write on DIO write completion so that racing > i_version and data reads that occur *after* the initial i_version > bump are invalidated immediately. > > IOWs, to avoid getattr/read races missing stale data invalidations > during DIO writes, we really need to bump i_version both _before and > after_ DIO write submission. > > It's corner cases like this where "i_version should only be bumped > when ctime changes" fails completely. i.e. there are concurrent IO > situations which can only really be handled correctly by bumping > i_version whenever either in-memory and/or on-disk persistent data/ > metadata state changes occur..... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >