adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files)

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 25 May 2022 11:11:05 +0200

On Tue, May 24 2022, Taylor Blau wrote:

> On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@xxxxxxxxxxxxx wrote:
>> On May 24, 2022 6:25 PM ,Taylor Blau write:
>> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@xxxxxxxxxxxxx wrote:
>> >> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
>> >> on some platforms, signed/unsigned, and sometimes 64-bit. We are
>> >> talking about potentially long-persistent files, as I understand this
>> >> series, so we should not be limiting times to end at 2038. That's only
>> >> 16 years off and I would wager that many clones that exist today will exist then.
>> >
>> >Note that we're using unsigned fields here, so we have until 2106 (see my earlier
>> >response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>>
>> I appreciate that, but 32-bit time_t is still signed on many
>> platforms, so when cast, it still might, at some point in another
>> series, cause issues. Please be cautious. I expect that this is the
>> particular hill on which I will die. 😉
>> --Randall
>
> Yes, definitely. There is only one spot that we turn the result of
> nth_packed_mtime() into a time_t, and that's in
> add_object_in_unpacked_pack(). The code there is something like:
>
>     time_t mtime;
>     if (pack->is_cruft)
>       mtime = nth_packed_mtime(pack, object_pos);
>     else
>       mtime = pack->mtime;
>
>     ...
>
>     add_cruft_object_entry(oid, ..., mtime);
>
> ...and the reason mtime is a time_t is because that's the type of
> pack->mtime.
>
> And we quickly convert that back to a uint32_t in
> add_cruft_object_entry(). If time_t is signed, then we'll truncate any
> values beyond 2106, and pre-epoch values will become large positive
> values. That means our error is one-sided in the favorable direction,
> i.e., that we'll keep objects around for longer instead of pruning
> something that we shouldn't have.

I must say that I really don't like this part of the format. Is it
really necessary to optimize the storage space here in a way that leaves
open questions about future time_t compatibility, and having to
introduce the first use of unsigned 32 bit timestamps to git's codebase?

Yes, this is its own self-contained format, so we don't *need* time_t
here, but it's also really handy if we can eventually consistently use
64 time_t everywhere and not worry about any compatibility issues, or
unsigned v.s. signed, or to create our own little ext4-like signed 32
bit timestamp format.

Once we hit 2038 (or near that date) this would be the only part of our
codebase & on-disk formats that I'm aware of that would differ from
time_t's signedness, but perhaps there's some I've missed.

If there isn't a demonstrable reason (as in some real numbers, or
accompanying benchmark etc.) to special-snowflake this I really think we
should just go for signed 64 bit here, i.e. matching time_t on 64 bit
systems.

If we really are trying to micro-optimize storage space here I'm willing
to bet that this is still a bad/premature optimization. There's much
better ways to store this sort of data in a compact way if that's the
concern. E.g. you'd store a 64 bit "base" timestamp in the header for
the first entry, and have smaller (signed) "delta" timestamps storing
offsets from that "base" timestamp.

This would take advantage of the fact that when we find loose objects
we're vanishingly unlikely to have them splayed over more than a
days/weeks/months or in the worst case small number of years from the
"base" (and if we ever do we could simply shrug and leave such objects
out of the pack entirely).

We could thus keep the 32 bit second-resolution timestamps you have
here, they'd just be signed deltas to the 64 bit signed "base" in a
header.

Even better (again, if micro-optimizing this is really needed) would be
to store a 64 bit signed base and a table of 16 bit signed offsets.

We'd simply declare that for our expiry times we'd "snap" any such
values to the next day. Our current GC config exposes down-to-the-second
expiry times, but in practice nobody needs that. A 16 bit signed "day
offset" would give you 2^15/365 = 89 years +/- of day-resolution expiry
for objects. To avoid thundering herds we could even fake up an exact
down-to-the-second expiry on the computed day by combining the expiry
time & the first few bits of the OID.

== BREAK

Aside about time_t being signed v.s. unsigned. This is edited from an
older off-list E-Mail of mine (from git-security): For time_t itself no
standard says that time_t must be signed, but in practice it's
ubiquitous

This thread is informative
http://mm.icann.org/pipermail/tz/2004-July/012503.html it continues the
month after: http://mm.icann.org/pipermail/tz/2004-August/thread.html

Summary: Yeah it can be unsigned in theory, but it seems like nobody's
been crazy enough to try it, so it's de-facto standardized to
signed. Everyone has a Y2038 problem, nobody has a Y2106 problem. Well,
with time_t, e.g. Linux filesystems tend to use unsigned 32 bit epochs:
https://kernelnewbies.org/y2038/vfs