On Fri, 2023-09-22 at 13:31 -0400, Kent Overstreet wrote: > On Fri, Sep 22, 2023 at 01:14:40PM -0400, Jeff Layton wrote: > > The VFS always uses coarse-grained timestamps when updating the ctime > > and mtime after a change. This has the benefit of allowing filesystems > > to optimize away a lot metadata updates, down to around 1 per jiffy, > > even when a file is under heavy writes. > > > > Unfortunately, this has always been an issue when we're exporting via > > NFS, which traditionally relied on timestamps to validate caches. A lot > > of changes can happen in a jiffy, and that can lead to cache-coherency > > issues between hosts. > > > > NFSv4 added a dedicated change attribute that must change value after > > any change to an inode. Some filesystems (btrfs, ext4 and tmpfs) utilize > > the i_version field for this, but the NFSv4 spec allows a server to > > generate this value from the inode's ctime. > > > > What we need is a way to only use fine-grained timestamps when they are > > being actively queried. > > > > POSIX generally mandates that when the the mtime changes, the ctime must > > also change. The kernel always stores normalized ctime values, so only > > the first 30 bits of the tv_nsec field are ever used. > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > has queried the inode for the mtime or ctime. When this flag is set, > > on the next mtime or ctime update, the kernel will fetch a fine-grained > > timestamp instead of the usual coarse-grained one. > > > > Filesytems can opt into this behavior by setting the FS_MGTIME flag in > > the fstype. Filesystems that don't set this flag will continue to use > > coarse-grained timestamps. > > Interesting... > > So in bcachefs, for most inode fields the btree inode is the "master > copy"; we do inode updates via btree transactions, and then on > successful transaction commit we update the VFS inode to match. > > (exceptions: i_size, i_blocks) > > I'd been contemplating switching to that model for timestamp updates as > well, since that would allow us to get rid of our > super_operations.write_inode method - except we probably wouldn't want > to do that since it would likely make timestamp updates too expensive. > > And now with your scheme of stashing extra state in timespec, I'm glad > we didn't. > > Still, timestamp updates are a bit messier than I'd like, would be > lovely to figure out a way to clean that up - right now we have an > awkward mix of "sometimes timestamp updates happen in a btree > transaction first, other times just the VFS inode is updated and marked > dirty". > > xfs doesn't have .write_inode, so it's probably time to study what it > does... A few months ago, we talked briefly and I asked about an i_version counter for bcachefs. You were going to look into it, and I wasn't sure if you had implemented one. If you haven't, then this may be a simpler alternative. For now, these aren't much good for anything other than faking up a change attribute for NFSv4, but they should be fine for that and you wouldn't need to grow your on-disk inode to accommodate them. Cheers, -- Jeff Layton <jlayton@xxxxxxxxxx>