Re: [PATCH v5 00/17] cruft packs

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Wed, 25 May 2022 15:59:24 -0400

On 5/25/2022 3:53 AM, Jonathan Nieder wrote:
> Taylor Blau wrote:
>> On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>>>> Moreover, I can't seem to find any formats that _don't_ use that
>>>> convention.
>>>
>>> It's used in the reftable format.

The use in reftable is the only one I can find and that implementation
is not idiomatic. Specifically, the way the four-byte header was
implemented is not easy to extract and share in other formats.

This series does the good work of extracting oid_version() as a
common method across these formats so it is easier to share.

> It's also used in the formats described in
> Documentation/technical/hash-function-transition.

It documents things that have not been implemented, such as the v3
pack-index format:

  Pack index (.idx) files use a new v3 format that supports multiple
  hash functions. They have the following format (all integers are in
  network byte order):
(...)
  * 4-byte number of object formats in this pack index: 2
  * For each object format:
    ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
    ** 4-byte length in bytes of shortened object names. This is the
      shortest possible length needed to make names in the shortened
      object name table unambiguous.
    ** 4-byte integer, recording where tables relating to this format
      are stored in this index file, as an offset from the beginning.

This was added in your 752414ae431 (technical doc: add a design doc
for hash function transition, 2017-09-27), but has not been acted upon
yet.

> [...]
>> Sounds good. Unless others have a very strong opinion, let's leave it as
>> is.
> 
> File formats are one of those things where a little time early can save
> a lot of work later.  If there were a strong reason to use "1" and "2"
> here then I'd be okay with living with it --- I'm a pragmatic person.
> But in general, using the magic numbers instead of a sequential value is
> really helpful both in making the file formats more self-explanatory and
> in making it possible to experiment with multiple new hash_algos at the
> same time.
> 
> The main argument I'm hearing for using "1" and "2" is "because some
> other formats got that wrong".  That reason is the opposite of
> compelling to me: it makes me suspect that as a project we should more
> eagerly break the old bad habits and form new ones.  I guess this
> qualifies as a very strong opinion.

Either way, these are magic numbers. One happens to somewhat spell
out something when looking at the file in a hex editor with ASCII
previews, but that doesn't change the fact that it is most important
that the hash function is correctly indicated by the file format and
parsed by the Git executable (not a human).

I'd much rather have a consistent and proven way of specifying the
hash value (using the oid_version() helper) than to try and make a
new mechanism.

Thanks,
-Stolee