On Sat, 10 Sep 2022, Jeff Layton wrote: > On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote: > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote: > > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote: > > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases > > > > > > > > > > > > > below, > > > > > > > > > > > > > the > > > > > > > > > > > > > application reads 'state B' as having occurred if any data was > > > > > > > > > > > > > committed to disk before the crash. > > > > > > > > > > > > > > > > > > > > > > > > > > Application Filesystem > > > > > > > > > > > > > =========== ========= > > > > > > > > > > > > > read change attr <- 'state A' > > > > > > > > > > > > > read data <- 'state A' > > > > > > > > > > > > > write data -> 'state B' > > > > > > > > > > > > > <crash>+<reboot> > > > > > > > > > > > > > read change attr <- 'state B' > > > > > > > > > > > > > > > > > > > > > > The important thing here is to not see 'state A'. Seeing 'state > > > > > > > > > > > C' > > > > > > > > > > > should be acceptable. Worst case we could merge in wall-clock > > > > > > > > > > > time > > > > > > > > > > > of > > > > > > > > > > > system boot, but the filesystem should be able to be more helpful > > > > > > > > > > > than > > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, without the crash+reboot it would still be acceptable to > > > > > > > > > see > > > > > > > > > "state A" at the end there - but preferably not for long. > > > > > > > > > From the NFS perspective, the changeid needs to update by the time > > > > > > > > > of > > > > > > > > > a > > > > > > > > > close or unlock (so it is visible to open or lock), but before that > > > > > > > > > it > > > > > > > > > is just best-effort. > > > > > > > > > > > > > > Nope. That will inevitably lead to data corruption, since the > > > > > > > application might decide to use the data from state A instead of > > > > > > > revalidating it. > > > > > > > > > > > > > > > > > The point is, NFS is not the only potential use case for change > > > > > attributes. We wouldn't be bothering to discuss statx() if it was. > > > > > > My understanding is that it was primarily a desire to add fstests to > > > exercise the i_version which motivated the statx extension. > > > Obviously we should prepare for other uses though. > > > > > Mainly. Also, userland nfs servers might also like this for obvious > reasons. For now though, in the v5 set, I've backed off on trying to > expose this to userland in favor of trying to just clean up the internal > implementation. > > I'd still like to expose this via statx if possible, but I don't want to > get too bogged down in interface design just now as we have Real Bugs to > fix. That patchset should make it simple to expose it later though. > > > > > > > > > > > I could be using O_DIRECT, and all the tricks in order to ensure > > > > > that > > > > > my stock broker application (to choose one example) has access > > > > > to the > > > > > absolute very latest prices when I'm trying to execute a trade. > > > > > When the filesystem then says 'the prices haven't changed since > > > > > your > > > > > last read because the change attribute on the database file is > > > > > the > > > > > same' in response to a statx() request with the > > > > > AT_STATX_FORCE_SYNC > > > > > flag set, then why shouldn't my application be able to assume it > > > > > can > > > > > serve those prices right out of memory instead of having to go > > > > > to disk? > > > > > > I would think that such an application would be using inotify rather > > > than having to poll. But certainly we should have a clear statement > > > of > > > quality-of-service parameters in the documentation. > > > If we agree that perfect atomicity is what we want to promise, and > > > that > > > the cost to the filesystem and the statx call is acceptable, then so > > > be it. > > > > > > My point wasn't to say that atomicity is bad. It was that: > > > - if the i_version change is visible before the change itself is > > > visible, then that is a correctness problem. > > > - if the i_version change is only visible some time after the > > > change > > > itself is visible, then that is a quality-of-service issue. > > > I cannot see any room for debating the first. I do see some room to > > > debate the second. > > > > > > Cached writes, directory ops, and attribute changes are, I think, > > > easy > > > enough to provide truly atomic i_version updates with the change > > > being > > > visible. > > > > > > Changes to a shared memory-mapped files is probably the hardest to > > > provide timely i_version updates for. We might want to document an > > > explicit exception for those. Alternately each request for > > > i_version > > > would need to find all pages that are writable, remap them read-only > > > to > > > catch future writes, then update i_version if any were writable > > > (i.e. > > > ->mkwrite had been called). That is the only way I can think of to > > > provide atomicity. > > > > > I don't think we really want to make i_version bumps that expensive. > Documenting that you can't expect perfect consistency vs. mmap with NFS > seems like the best thing to do. We do our best, but that sort of > synchronization requires real locking. > > > > O_DIRECT writes are a little easier than mmapped files. I suspect we > > > should update the i_version once the device reports that the write is > > > complete, but a parallel reader could have seem some of the write before > > > that moment. True atomicity could only be provided by taking some > > > exclusive lock that blocked all O_DIRECT writes. Jeff seems to be > > > suggesting this, but I doubt the stock broker application would be > > > willing to make the call in that case. I don't think I would either. > > Well, only blocked for long enough to run the getattr. Granted, with a > slow underlying filesystem that can take a while. Maybe I misunderstand, but this doesn't seem to make much sense. If you want i_version updates to appear to be atomic w.r.t O_DIRECT writes, then you need to prevent accessing the i_version while any write is on-going. At that time there is no meaningful value for i_version. So you need a lock (At least shared) around the actual write, and you need an exclusive lock around the get_i_version(). So accessing the i_version would have to wait for all pending O_DIRECT writes to complete, and would block any new O_DIRECT writes from starting. This could be expensive. There is not currently any locking around O_DIRECT writes. You cannot synchronise with them. The best you can do is update the i_version immediately after all the O_DIRECT writes in a single request complete. > > To summarize, there are two main uses for the change attr in NFSv4: > > 1/ to provide change_info4 for directory morphing operations (CREATE, > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already > atomic in the current nfsd code (AFAICT) by virtue of the fact that we > hold the i_rwsem exclusively over these operations. The change attr is > also queried pre and post while the lock is held, so that should ensure > that we get true atomicity for this. Yes, directory ops are relatively easy. > > 2/ as an adjunct for the ctime when fetching attributes to validate > caches. We don't expect perfect consistency between read (and readlike) > operations and GETATTR, even when they're in the same compound. > > IOW, a READ+GETATTR compound can legally give you a short (or zero- > length) read, and then the getattr indicates a size that is larger than > where the READ data stops, due to a write or truncate racing in after > the read. I agree that atomicity is neither necessary nor practical. Ordering is important though. I don't think a truncate(0) racing with a READ can credibly result in a non-zero size AFTER a zero-length read. A truncate that extends the size could have that effect though. > > Ideally, the attributes in the GETATTR reply should be consistent > between themselves though. IOW, all of the attrs should accurately > represent the state of the file at a single point in time. > change+size+times+etc. should all be consistent with one another. > > I think we get all of this by taking the inode_lock around the > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant > solution, but it should give us the atomicity we need, and it doesn't > require adding extra operations or locking to the write codepaths. Explicit attribute changes (chown/chmod/utimes/truncate etc) are always done under the inode lock. Implicit changes via inode_update_time() are not (though xfs does take the lock, ext4 doesn't, haven't checked others). So taking the inode lock won't ensure those are internally consistent. I think using inode_lock_shared() is acceptable. It doesn't promise perfect atomicity, but it is probably good enough. We'd need a good reason to want perfect atomicity to go further, and I cannot think of one. NeilBrown > > We could also consider less invasive ways to achieve this (maybe some > sort of seqretry loop around the vfs_getattr call?), but I'd rather not > do extra work in the write codepaths if we can get away with it. > -- > Jeff Layton <jlayton@xxxxxxxxxx> > >