On Sat, 2023-09-23 at 17:58 +0300, Amir Goldstein wrote: > On Sat, Sep 23, 2023 at 1:22 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > On Sat, 2023-09-23 at 10:15 +0300, Amir Goldstein wrote: > > > On Fri, Sep 22, 2023 at 8:15 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > > My initial goal was to implement multigrain timestamps on most major > > > > filesystems, so we could present them to userland, and use them for > > > > NFSv3, etc. > > > > > > > > With the current implementation however, we can't guarantee that a file > > > > with a coarse grained timestamp modified after one with a fine grained > > > > timestamp will always appear to have a later value. This could confuse > > > > some programs like make, rsync, find, etc. that depend on strict > > > > ordering requirements for timestamps. > > > > > > > > The goal of this version is more modest: fix XFS' change attribute. > > > > XFS's change attribute is bumped on atime updates in addition to other > > > > deliberate changes. This makes it unsuitable for export via nfsd. > > > > > > > > Jan Kara suggested keeping this functionality internal-only for now and > > > > plumbing the fine grained timestamps through getattr [1]. This set takes > > > > a slightly different approach and has XFS use the fine-grained attr to > > > > fake up STATX_CHANGE_COOKIE in its getattr routine itself. > > > > > > > > While we keep fine-grained timestamps in struct inode, when presenting > > > > the timestamps via getattr, we truncate them at a granularity of number > > > > of ns per jiffy, > > > > > > That's not good, because user explicitly set granular mtime would be > > > truncated too and booting with different kernels (HZ) would change > > > the observed timestamps of files. > > > > > > > That's a very good point. > > > > > > which allows us to smooth over the fuzz that causes > > > > ordering problems. > > > > > > > > > > The reported ordering problems (i.e. cp -u) is not even limited to the > > > scope of a single fs, right? > > > > > > > It isn't. Most of the tools we're concerned with don't generally care > > about filesystem boundaries. > > > > > Thinking out loud - if the QERIED bit was not per inode timestamp > > > but instead in a global fs_multigrain_ts variable, then all the inodes > > > of all the mgtime fs would be using globally ordered timestamps > > > > > > That should eliminate the reported issues with time reorder for > > > fine vs coarse grained timestamps. > > > > > > The risk of extra unneeded "change cookie" updates compared to > > > per inode QUERIED bit may exist, but I think it is a rather small overhead > > > and maybe worth the tradeoff of having to maintain a real per inode > > > "change cookie" in addition to a "globally ordered mgtime"? > > > > > > If this idea is acceptable, you may still be able to salvage the reverted > > > ctime series for 6.7, because the change to use global mgtime should > > > be quite trivial? > > > > > > > This is basically the idea I was going to look at next once I got some > > other stuff settled here: Basically, when we apply a fine-grained > > timestamp to an inode, we'd advance the coarse-grained clock that > > filesystems use to that value. > > > > It could cause some write amplification: if you are streaming writes to > > a bunch of files at the same time and someone stats one of them, then > > they'd all end up getting an extra inode transaction. That doesn't sound > > _too_ bad on its face, but I probably need to implement it and then run > > some numbers to see. > > > > Several journal transactions within a single jiffie tick? > If ctime/change_cookie of an inode is updated once within the scope > of a single running transaction, I don't think it matters how many > times it would be updated, but maybe I am missing something. > > The problem is probably going to be that the seqlock of the coarse > grained clock is going to be invalidated way too frequently to be > "read mostly" in the presence of ls -lR workload, but again, I did > not study the implementation, so I may be way off. > That may end up being the case, but I think if we can minimize the number of fine-grained updates, then the number of invalidations will be minimal too. I haven't rolled an implementation of this yet. This is all very much still in the "waving of hands" stage anyway. Once the dust settles from the atime and mtime API rework, I may still take a stab at doing this. -- Jeff Layton <jlayton@xxxxxxxxxx>