Re: [GIT PULL v2] timestamp fixes

Amir Goldstein <amir73il@xxxxxxxxx> · Sat, 23 Sep 2023 09:36:15 +0300

On Sat, Sep 23, 2023 at 3:43 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > We have many, many inodes though, and 12 bytes per adds up!
>
> That was my thinking, but honestly, who knows what other alignment
> issues might eat up some - or all - of the theoreteical 12 bytes.
>
> It might be, for example, that the inode is already some aligned size,
> and that the allocation alignment means that the size wouldn't
> *really* shrink at all.
>
> So I just want to make clear that I think the 12 bytes isn't
> necessarily there. Maybe you'd get it, maybe it would be hidden by
> other things.
>
> My biggest impetus was really that whole abuse of a type that I
> already disliked for other reasons.
>
> > I'm on board with the idea, but...that's likely to be as big a patch
> > series as the ctime overhaul was. In fact, it'll touch a lot of the same
> > code. I can take a stab at that in the near future though.
>
> Yea, it's likely to be fairly big and invasive.  That was one of the
> reasons for my suggested "inode_time()" macro hack: using the macro
> argument concatenation is really a hack to "gather" the pieces based
> on name, and while it's odd and not a very typical kernel model, I
> think doing it that way might allow the conversion to be slightly less
> painful.
>
> You'd obviously have to have the same kind of thing for assignment.
>
> Without that kind of name-based hack, you'd have to create all these
> random helper functions that just do the same thing over and over for
> the different times, which seems really annoying.
>
> > Since we're on the subject...another thing that bothers me with all of
> > the timestamp handling is that we don't currently try to mitigate "torn
> > reads" across the two different words. It seems like you could fetch a
> > tv_sec value and then get a tv_nsec value that represents an entirely
> > different timestamp if there are stores between them.
>
> Hmm. I think that's an issue that we have always had in theory, and
> have ignored because it's simply not problematic in practice, and
> fixing it is *hugely* painful.
>
> I suspect we'd have to use some kind of sequence lock for it (to make
> reads be cheap), and while it's _possible_ that having the separate
> accessor functions for reading/writing those times might help things
> out, I suspect the reading/writing happens for the different times (ie
> atime/mtime/ctime) together often enough that you might want to have
> the locking done at an outer level, and _not_ do it at the accessor
> level.
>
> So I suspect this is a completely separate issue (ie even an accessor
> doesn't make the "hugely painful" go away). And probably not worth
> worrying about *unless* somebody decides that they really really care
> about the race.
>
> That said, one thing that *could* help is if people decide that the
> right format for inode times is to just have one 64-bit word that has
> "sufficient resolution". That's what we did for "kernel time", ie
> "ktime_t" is a 64-bit nanosecond count, and by being just a single
> value, it avoids not just the horrible padding with 'struct
> timespec64', it is also dense _and_ can be accessed as one atomic
> value.

Just pointing out that xfs has already changed it's on-disk timestamp
format to this model (a.k.a bigtime), but the in-core inode still uses
the timespec64 of course.
The vfs can inprise from this model.

>
> Sadly, that "sufficient resolution" couldn't be nanoseconds, because
> 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
> time, because 2**63 nanoseconds is 292 years, so it moved the "year
> 2038" problem to "year 2262".

Note that xfs_bigtime_to_unix(XFS_BIGTIME_TIME_MAX) is in year
2486, not year 2262, because there was no need to use the 64bit to
go backwards to year 1678.

>
> And that's ok when we're talking about times that are kernel running
> times and we haev a couple of centuries to say "ok, we'll need to make
> it be a bigger type", but when you save the values to disk, things are
> different. I suspect filesystem people are *not* willing to deal with
> a "year 2262" issue.
>

Apparently, they are willing to handle the "year 2486" issue ;)

> But if we were to say that "a tenth of microsecond resolution is
> sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
> So we could do a
>
>     // tenth of a microseconds since Jan 1, 1970
>     typedef s64 fstime_t;
>
> and have a nice dense timestamp format with reasonable - but not
> nanosecond - accuracy. Now that 292 year range has become 29,247
> years, and filesystem people *might* find the "year-31k" problem
> acceptable.
>
> I happen to think that "100ns timestamp resolution on files is
> sufficient" is a very reasonable statement, but I suspect that we'll
> still find lots of people who say "that's completely unacceptable"
> both to that resolution, and to the 31k-year problem.
>

I am guessing that you are aware of the Windows/SMB FILETIME
standard which is 64bit in 100ns units (since 1601).
So the 31k-year "problem" is very widespread already.

But the resolution change is counter to the purpose of multigrain
timestamps - if two syscalls updated the same or two different inodes
within a 100ns tick, apparently, there are some workloads that
care to know about it and fs needs to store this information persistently.

Thanks,
Amir.