On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote: > On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > > > the inode dirty and I think that will end up with the new i_version at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > I think that's even more problematic. ->getattr currently runs > completely unlocked for performance reasons - it's racy w.r.t. to > ongoing modifications to begin with, so /nothing/ that is returned > to userspace via stat/statx can be guaranteed to be "coherent". > Linus will be very unhappy if you make his git workload (which is > /very/ stat heavy) run slower by adding any sort of locking in this > hot path. > > Even if we did put an fsync() into ->getattr() (and dealt with all > the locking issues that entails), by the time the statx syscall > returns to userspace the i_version value may not match the > data/metadata in the inode(*). IOWs, by the time i_version gets > to userspace, it is out of date and any use of it for data > versioning from userspace is going to be prone to race conditions. > > Cheers, > > Dave. > > (*) fiemap has exactly the same "stale the moment internal fs > locks are released" race conditions, which is why it cannot safely > be used for mapping holes when copying file data.... > FWIW, I'm not terribly worried about atomicity for all of the reasons you describe. My main concern is reusing an i_version value that has already been handed out when the inode is now in an entirely different state. To that end, I was only considering doing an fsync iff STATX_VERSION was requested. If it wasn't we wouldn't need to do one. A lot of ->getattr implementations already call filemap_write_and_wait, but you're correct that flushing out the metadata is different matter. It'd be nice to avoid that if we can. -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html