On Tue, May 24 2022, Taylor Blau wrote: > On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@xxxxxxxxxxxxx wrote: >> On May 24, 2022 6:25 PM ,Taylor Blau write: >> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@xxxxxxxxxxxxx wrote: >> >> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit >> >> on some platforms, signed/unsigned, and sometimes 64-bit. We are >> >> talking about potentially long-persistent files, as I understand this >> >> series, so we should not be limiting times to end at 2038. That's only >> >> 16 years off and I would wager that many clones that exist today will exist then. >> > >> >Note that we're using unsigned fields here, so we have until 2106 (see my earlier >> >response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/). >> >> I appreciate that, but 32-bit time_t is still signed on many >> platforms, so when cast, it still might, at some point in another >> series, cause issues. Please be cautious. I expect that this is the >> particular hill on which I will die. 😉 >> --Randall > > Yes, definitely. There is only one spot that we turn the result of > nth_packed_mtime() into a time_t, and that's in > add_object_in_unpacked_pack(). The code there is something like: > > time_t mtime; > if (pack->is_cruft) > mtime = nth_packed_mtime(pack, object_pos); > else > mtime = pack->mtime; > > ... > > add_cruft_object_entry(oid, ..., mtime); > > ...and the reason mtime is a time_t is because that's the type of > pack->mtime. > > And we quickly convert that back to a uint32_t in > add_cruft_object_entry(). If time_t is signed, then we'll truncate any > values beyond 2106, and pre-epoch values will become large positive > values. That means our error is one-sided in the favorable direction, > i.e., that we'll keep objects around for longer instead of pruning > something that we shouldn't have. I must say that I really don't like this part of the format. Is it really necessary to optimize the storage space here in a way that leaves open questions about future time_t compatibility, and having to introduce the first use of unsigned 32 bit timestamps to git's codebase? Yes, this is its own self-contained format, so we don't *need* time_t here, but it's also really handy if we can eventually consistently use 64 time_t everywhere and not worry about any compatibility issues, or unsigned v.s. signed, or to create our own little ext4-like signed 32 bit timestamp format. Once we hit 2038 (or near that date) this would be the only part of our codebase & on-disk formats that I'm aware of that would differ from time_t's signedness, but perhaps there's some I've missed. If there isn't a demonstrable reason (as in some real numbers, or accompanying benchmark etc.) to special-snowflake this I really think we should just go for signed 64 bit here, i.e. matching time_t on 64 bit systems. If we really are trying to micro-optimize storage space here I'm willing to bet that this is still a bad/premature optimization. There's much better ways to store this sort of data in a compact way if that's the concern. E.g. you'd store a 64 bit "base" timestamp in the header for the first entry, and have smaller (signed) "delta" timestamps storing offsets from that "base" timestamp. This would take advantage of the fact that when we find loose objects we're vanishingly unlikely to have them splayed over more than a days/weeks/months or in the worst case small number of years from the "base" (and if we ever do we could simply shrug and leave such objects out of the pack entirely). We could thus keep the 32 bit second-resolution timestamps you have here, they'd just be signed deltas to the 64 bit signed "base" in a header. Even better (again, if micro-optimizing this is really needed) would be to store a 64 bit signed base and a table of 16 bit signed offsets. We'd simply declare that for our expiry times we'd "snap" any such values to the next day. Our current GC config exposes down-to-the-second expiry times, but in practice nobody needs that. A 16 bit signed "day offset" would give you 2^15/365 = 89 years +/- of day-resolution expiry for objects. To avoid thundering herds we could even fake up an exact down-to-the-second expiry on the computed day by combining the expiry time & the first few bits of the OID. == BREAK Aside about time_t being signed v.s. unsigned. This is edited from an older off-list E-Mail of mine (from git-security): For time_t itself no standard says that time_t must be signed, but in practice it's ubiquitous This thread is informative http://mm.icann.org/pipermail/tz/2004-July/012503.html it continues the month after: http://mm.icann.org/pipermail/tz/2004-August/thread.html Summary: Yeah it can be unsigned in theory, but it seems like nobody's been crazy enough to try it, so it's de-facto standardized to signed. Everyone has a Y2038 problem, nobody has a Y2106 problem. Well, with time_t, e.g. Linux filesystems tend to use unsigned 32 bit epochs: https://kernelnewbies.org/y2038/vfs