On Mon, 12 Sep 2022, Jeff Layton wrote: > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > This could be expensive. > > > > There is not currently any locking around O_DIRECT writes. You cannot > > synchronise with them. > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > inode_lock_shared across the I/O. That was why patch #8 takes the > inode_lock (exclusive) across the getattr. Looking at ext4_dio_write_iter() it certain does take inode_lock_shared() before starting the write and in some cases it requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for the write to complete. But not in all cases. So I don't think it always holds the shared lock across all direct IO. > > > The best you can do is update the i_version immediately after all the > > O_DIRECT writes in a single request complete. > > > > > > > > To summarize, there are two main uses for the change attr in NFSv4: > > > > > > 1/ to provide change_info4 for directory morphing operations (CREATE, > > > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already > > > atomic in the current nfsd code (AFAICT) by virtue of the fact that we > > > hold the i_rwsem exclusively over these operations. The change attr is > > > also queried pre and post while the lock is held, so that should ensure > > > that we get true atomicity for this. > > > > Yes, directory ops are relatively easy. > > > > > > > > 2/ as an adjunct for the ctime when fetching attributes to validate > > > caches. We don't expect perfect consistency between read (and readlike) > > > operations and GETATTR, even when they're in the same compound. > > > > > > IOW, a READ+GETATTR compound can legally give you a short (or zero- > > > length) read, and then the getattr indicates a size that is larger than > > > where the READ data stops, due to a write or truncate racing in after > > > the read. > > > > I agree that atomicity is neither necessary nor practical. Ordering is > > important though. I don't think a truncate(0) racing with a READ can > > credibly result in a non-zero size AFTER a zero-length read. A truncate > > that extends the size could have that effect though. > > > > > > > > Ideally, the attributes in the GETATTR reply should be consistent > > > between themselves though. IOW, all of the attrs should accurately > > > represent the state of the file at a single point in time. > > > change+size+times+etc. should all be consistent with one another. > > > > > > I think we get all of this by taking the inode_lock around the > > > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant > > > solution, but it should give us the atomicity we need, and it doesn't > > > require adding extra operations or locking to the write codepaths. > > > > Explicit attribute changes (chown/chmod/utimes/truncate etc) are always > > done under the inode lock. Implicit changes via inode_update_time() are > > not (though xfs does take the lock, ext4 doesn't, haven't checked > > others). So taking the inode lock won't ensure those are internally > > consistent. > > > > I think using inode_lock_shared() is acceptable. It doesn't promise > > perfect atomicity, but it is probably good enough. > > > > We'd need a good reason to want perfect atomicity to go further, and I > > cannot think of one. > > > > > > Taking inode_lock_shared is sufficient to block out buffered and DAX > writes. DIO writes sometimes only take the shared lock (e.g. when the > data is already properly aligned). If we want to ensure the getattr > doesn't run while _any_ writes are running, we'd need the exclusive > lock. But the exclusive lock is bad for scalability. > > Maybe that's overkill, though it seems like we could have a race like > this without taking inode_lock across the getattr: > > reader writer > ----------------------------------------------------------------- > i_version++ > getattr > read > DIO write to backing store > This is why I keep saying that the i_version increment must be after the write, not before it. > > Given that we can't fully exclude mmap writes, maybe we can just > document that mixing DIO or mmap writes on the server + NFS may not be > fully cache coherent. "fully cache coherent" is really more than anyone needs. The i_version must be seen to change no earlier than the related change becomes visible, and no later than the request which initiated that change is acknowledged as complete. NeilBrown