Re: [PATCH v8 0/5] fs: multigrain timestamps for XFS's change_cookie

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 25 Sep 2023 06:08:27 -0400

On Sat, 2023-09-23 at 17:58 +0300, Amir Goldstein wrote:
> On Sat, Sep 23, 2023 at 1:22 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > On Sat, 2023-09-23 at 10:15 +0300, Amir Goldstein wrote:
> > > On Fri, Sep 22, 2023 at 8:15 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > 
> > > > My initial goal was to implement multigrain timestamps on most major
> > > > filesystems, so we could present them to userland, and use them for
> > > > NFSv3, etc.
> > > > 
> > > > With the current implementation however, we can't guarantee that a file
> > > > with a coarse grained timestamp modified after one with a fine grained
> > > > timestamp will always appear to have a later value. This could confuse
> > > > some programs like make, rsync, find, etc. that depend on strict
> > > > ordering requirements for timestamps.
> > > > 
> > > > The goal of this version is more modest: fix XFS' change attribute.
> > > > XFS's change attribute is bumped on atime updates in addition to other
> > > > deliberate changes. This makes it unsuitable for export via nfsd.
> > > > 
> > > > Jan Kara suggested keeping this functionality internal-only for now and
> > > > plumbing the fine grained timestamps through getattr [1]. This set takes
> > > > a slightly different approach and has XFS use the fine-grained attr to
> > > > fake up STATX_CHANGE_COOKIE in its getattr routine itself.
> > > > 
> > > > While we keep fine-grained timestamps in struct inode, when presenting
> > > > the timestamps via getattr, we truncate them at a granularity of number
> > > > of ns per jiffy,
> > > 
> > > That's not good, because user explicitly set granular mtime would be
> > > truncated too and booting with different kernels (HZ) would change
> > > the observed timestamps of files.
> > > 
> > 
> > That's a very good point.
> > 
> > > > which allows us to smooth over the fuzz that causes
> > > > ordering problems.
> > > > 
> > > 
> > > The reported ordering problems (i.e. cp -u) is not even limited to the
> > > scope of a single fs, right?
> > > 
> > 
> > It isn't. Most of the tools we're concerned with don't generally care
> > about filesystem boundaries.
> > 
> > > Thinking out loud - if the QERIED bit was not per inode timestamp
> > > but instead in a global fs_multigrain_ts variable, then all the inodes
> > > of all the mgtime fs would be using globally ordered timestamps
> > > 
> > > That should eliminate the reported issues with time reorder for
> > > fine vs coarse grained timestamps.
> > > 
> > > The risk of extra unneeded "change cookie" updates compared to
> > > per inode QUERIED bit may exist, but I think it is a rather small overhead
> > > and maybe worth the tradeoff of having to maintain a real per inode
> > > "change cookie" in addition to a "globally ordered mgtime"?
> > > 
> > > If this idea is acceptable, you may still be able to salvage the reverted
> > > ctime series for 6.7, because the change to use global mgtime should
> > > be quite trivial?
> > > 
> > 
> > This is basically the idea I was going to look at next once I got some
> > other stuff settled here: Basically, when we apply a fine-grained
> > timestamp to an inode, we'd advance the coarse-grained clock that
> > filesystems use to that value.
> > 
> > It could cause some write amplification: if you are streaming writes to
> > a bunch of files at the same time and someone stats one of them, then
> > they'd all end up getting an extra inode transaction. That doesn't sound
> > _too_ bad on its face, but I probably need to implement it and then run
> > some numbers to see.
> > 
> 
> Several journal transactions within a single jiffie tick?
> If ctime/change_cookie of an inode is updated once within the scope
> of a single running transaction, I don't think it matters how many
> times it would be updated, but maybe I am missing something.
> 
> The problem is probably going to be that the seqlock of the coarse
> grained clock is going to be invalidated way too frequently to be
> "read mostly" in the presence of ls -lR workload, but again, I did
> not study the implementation, so I may be way off.
> 

That may end up being the case, but I think if we can minimize the
number of fine-grained updates, then the number of invalidations will be
minimal too. I haven't rolled an implementation of this yet. This is all
very much still in the "waving of hands" stage anyway.

Once the dust settles from the atime and mtime API rework, I may still
take a stab at doing this.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>