Re: [PATCH v8 0/5] fs: multigrain timestamps for XFS's change_cookie

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 26 Sep 2023 07:31:55 -0400

On Tue, 2023-09-26 at 08:32 +1000, Dave Chinner wrote:
> On Mon, Sep 25, 2023 at 06:14:05AM -0400, Jeff Layton wrote:
> > On Mon, 2023-09-25 at 08:18 +1000, Dave Chinner wrote:
> > > On Sat, Sep 23, 2023 at 05:52:36PM +0300, Amir Goldstein wrote:
> > > > On Sat, Sep 23, 2023 at 1:46 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > 
> > > > > On Sat, 2023-09-23 at 10:15 +0300, Amir Goldstein wrote:
> > > > > > On Fri, Sep 22, 2023 at 8:15 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > > My initial goal was to implement multigrain timestamps on most major
> > > > > > > filesystems, so we could present them to userland, and use them for
> > > > > > > NFSv3, etc.
> > > > > > > 
> > > > > > > With the current implementation however, we can't guarantee that a file
> > > > > > > with a coarse grained timestamp modified after one with a fine grained
> > > > > > > timestamp will always appear to have a later value. This could confuse
> > > > > > > some programs like make, rsync, find, etc. that depend on strict
> > > > > > > ordering requirements for timestamps.
> > > > > > > 
> > > > > > > The goal of this version is more modest: fix XFS' change attribute.
> > > > > > > XFS's change attribute is bumped on atime updates in addition to other
> > > > > > > deliberate changes. This makes it unsuitable for export via nfsd.
> > > > > > > 
> > > > > > > Jan Kara suggested keeping this functionality internal-only for now and
> > > > > > > plumbing the fine grained timestamps through getattr [1]. This set takes
> > > > > > > a slightly different approach and has XFS use the fine-grained attr to
> > > > > > > fake up STATX_CHANGE_COOKIE in its getattr routine itself.
> > > > > > > 
> > > > > > > While we keep fine-grained timestamps in struct inode, when presenting
> > > > > > > the timestamps via getattr, we truncate them at a granularity of number
> > > > > > > of ns per jiffy,
> > > > > > 
> > > > > > That's not good, because user explicitly set granular mtime would be
> > > > > > truncated too and booting with different kernels (HZ) would change
> > > > > > the observed timestamps of files.
> > > > > > 
> > > > > 
> > > > > Thinking about this some more, I think the first problem is easily
> > > > > addressable:
> > > > > 
> > > > > The ctime isn't explicitly settable and with this set, we're already not
> > > > > truncating the atime. We haven't used any of the extra bits in the mtime
> > > > > yet, so we could just carve out a flag in there that says "this mtime
> > > > > was explicitly set and shouldn't be truncated before presentation".
> > > > > 
> > > > 
> > > > I thought about this option too.
> > > > But note that the "mtime was explicitly set" flag needs
> > > > to be persisted to disk so you cannot store it in the high nsec bits.
> > > > At least XFS won't store those bits if you use them - they have to
> > > > be translated to an XFS inode flag and I don't know if changing
> > > > XFS on-disk format was on your wish list.
> > > 
> > > Remember: this multi-grain timestamp thing was an idea to solve the
> > > NFS change attribute problem without requiring *any* filesystem with
> > > sub-jiffie timestamp capability to change their on-disk format to
> > > implement a persistent change attribute that matches the new
> > > requires of the kernel nfsd.
> > > 
> > > If we now need to change the on-disk format to support
> > > some whacky new timestamp semantic to do this, then people have
> > > completely lost sight of what problem the multi-grain timestamp idea
> > > was supposed to address.
> > > 
> > 
> > Yep. The main impetus for all of this was to fix XFS's change attribute
> > without requiring an on-disk format change. If we have to rev the on-
> > disk format, we're probably better off plumbing in a proper i_version
> > counter and tossing this idea aside.
> > 
> > That said, I think all we'd need for this scheme is a single flag per
> > inode (to indicate that the mtime shouldn't be truncated before
> > presentation). If that's possible to do without fully revving the inode
> > format, then we could still pursue this. I take it that's probably not
> > the case though.
> 
> Older kernels that don't know what the flag means, but that should
> be OK for an inode flag. The bigger issue is that none of the
> userspace tools (xfs_db, xfs_repair, etc) know about it, so they
> would have to be taught about it. And then there's testing it, which
> likely means userspace needs visibility of the flag (e.g. FS_XFLAG
> for it) and then there's more work....
> 
> It's really not worth it.
> 
>
> I think that Linus's suggestion of the in-memory inode timestamp
> always being a 64bit, 100ns granularity value instead of a timespec
> that gets truncated at sample time has merit as a general solution.
> 

Changing how we store timestamps in struct inode is a good idea, and
reducing the effective granularity to 100ns seems reasonable, but that
alone won't fix XFS's i_version counter, or the ordering problems that
we hit with the multigrain series that had to be reverted.

> We also must not lose sight of the fact that the lazytime mount
> option makes atime updates on XFS behave exactly as the nfsd/NFS
> client application wants. That is, XFS will do in-memory atime
> updates unless the atime update also sets S_VERSION to explicitly
> bump the i_version counter if required. That leads to another
> potential nfsd specific solution without requiring filesystems to
> change on disk formats: the nfsd explicitly asks operations for lazy
> atime updates...
> 

Not exactly. The problem with XFS's i_version is that it also bumps it
on atime updates. lazytime reduces the number of atime updates to
~1/day. To be exactly what nfsd wants, you'd need to make that 0. I
suppose you can work around it with noatime, but that has problems of
its own.

> And we must also keep in sight the fact that io_uring wants
> non-blocking timestamp updates to be possible (for all types of
> updates). Hence it looks to me like we have more than one use case
> for per-operation/application specific timestamp update semantics.
> Perhaps there's a generic solution to this problem (e.g.  operation
> specific non-blocking, in-memory pure timestamp updates) that does
> what everyone needs...

-- 
Jeff Layton <jlayton@xxxxxxxxxx>