Re: [GIT PULL v2] timestamp fixes

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 21 Sep 2023 12:28:13 -0700

On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> We have many, many inodes though, and 12 bytes per adds up!

That was my thinking, but honestly, who knows what other alignment
issues might eat up some - or all - of the theoreteical 12 bytes.

It might be, for example, that the inode is already some aligned size,
and that the allocation alignment means that the size wouldn't
*really* shrink at all.

So I just want to make clear that I think the 12 bytes isn't
necessarily there. Maybe you'd get it, maybe it would be hidden by
other things.

My biggest impetus was really that whole abuse of a type that I
already disliked for other reasons.

> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.

Yea, it's likely to be fairly big and invasive.  That was one of the
reasons for my suggested "inode_time()" macro hack: using the macro
argument concatenation is really a hack to "gather" the pieces based
on name, and while it's odd and not a very typical kernel model, I
think doing it that way might allow the conversion to be slightly less
painful.

You'd obviously have to have the same kind of thing for assignment.

Without that kind of name-based hack, you'd have to create all these
random helper functions that just do the same thing over and over for
the different times, which seems really annoying.

> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.

Hmm. I think that's an issue that we have always had in theory, and
have ignored because it's simply not problematic in practice, and
fixing it is *hugely* painful.

I suspect we'd have to use some kind of sequence lock for it (to make
reads be cheap), and while it's _possible_ that having the separate
accessor functions for reading/writing those times might help things
out, I suspect the reading/writing happens for the different times (ie
atime/mtime/ctime) together often enough that you might want to have
the locking done at an outer level, and _not_ do it at the accessor
level.

So I suspect this is a completely separate issue (ie even an accessor
doesn't make the "hugely painful" go away). And probably not worth
worrying about *unless* somebody decides that they really really care
about the race.

That said, one thing that *could* help is if people decide that the
right format for inode times is to just have one 64-bit word that has
"sufficient resolution". That's what we did for "kernel time", ie
"ktime_t" is a 64-bit nanosecond count, and by being just a single
value, it avoids not just the horrible padding with 'struct
timespec64', it is also dense _and_ can be accessed as one atomic
value.

Sadly, that "sufficient resolution" couldn't be nanoseconds, because
64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
time, because 2**63 nanoseconds is 292 years, so it moved the "year
2038" problem to "year 2262".

And that's ok when we're talking about times that are kernel running
times and we haev a couple of centuries to say "ok, we'll need to make
it be a bigger type", but when you save the values to disk, things are
different. I suspect filesystem people are *not* willing to deal with
a "year 2262" issue.

But if we were to say that "a tenth of microsecond resolution is
sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
So we could do a

    // tenth of a microseconds since Jan 1, 1970
    typedef s64 fstime_t;

and have a nice dense timestamp format with reasonable - but not
nanosecond - accuracy. Now that 292 year range has become 29,247
years, and filesystem people *might* find the "year-31k" problem
acceptable.

I happen to think that "100ns timestamp resolution on files is
sufficient" is a very reasonable statement, but I suspect that we'll
still find lots of people who say "that's completely unacceptable"
both to that resolution, and to the 31k-year problem.

But wouldn't it be nice to have just one single "fstime_t" for file
timestamps, the same way we have "ktime_t" for CPU timestamps?

Then we'd save even more space in the 'struct inode'....

                  Linus