Re: [PATCH 1/1] Documentation/user-manual.txt: example for generating object hashes

Dirk Gouders <dirk@xxxxxxxxxxx> · Thu, 29 Feb 2024 23:35:35 +0100

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Dirk Gouders <dirk@xxxxxxxxxxx> writes:
>
>> If someone spends the time to work through the documentation, the
>> subject "hashes" can lead to contradictions:
>>
>> The README of the initial commit states hashes are generated from
>> compressed data (which changed very soon), whereas
>> Documentation/user-manual.txt says they are generated from original
>> data.
>>
>> Don't give doubts a chance: clarify this and present a simple example
>> on how object hashes can be generated manually.
>
> I'd rather not to waste readers' attention to historical wart.

Yes, but -- I should have mentioned it -- the document itself suggests
to read the initial commit.

But I don't mean to argue about that, perhaps I digged to deep into
details.

>> @@ -4095,6 +4095,39 @@ that is used to name the object is the hash of the original data
>>  plus this header, so `sha1sum` 'file' does not match the object name
>>  for 'file'.
>
> The paragraph above (part of it is hidden before the hunk) clearly
> states what the naming rules are.  We hash the original and then
> compress.  If I use an implementation of Git that drives the zlib at
> compression level 1, and if you clone from my repository with
> another implementation of Git whose zlib is driven at compression
> level 9, our .git/objects/01/2345...90 files may not be identical,
> but when uncompressed they should store the same contents, so "hash
> then compress" is the only sensible choice that is not affected by
> the compression to give stable names to objects.

Thank your for that detail.

>> +Starting with the initial commit, hashing was done on the compressed
>> +data and the file README of that commit explicitely states this:
>> +
>> +"The SHA1 hash is always the hash of the _compressed_ object, not the
>> +original one."
>> +
>> +This changed soon after that with commit
>> +d98b46f8d9a3 (Do SHA1 hash _before_ compression.).  Unfortunately, the
>> +commit message doesn't provide the detailed reasoning.
>
> These three are about Git development history, which by itself may
> be of interest for some people, but the main target audience of the
> user-manual is probably different from them.  They may be interested
> to learn how Git works, but it is only to feel that they understand
> how the "magic" things Git does, like "a cryptographic hash of
> contents is enough to uniquely identify the contents being tracked",
> works well to trust their precious contents [*].
>
>     Side note: 
>     https://lore.kernel.org/git/Pine.LNX.4.58.0504200144260.6467@xxxxxxxxxxxxxxx/
>     explains the reason behind the change to those who did not find
>     it obvious.
>
> FYI, another "breaking" change we did earlier in the history of the
> project was to update the sort order of paths in tree objects.  We
> do not need to confuse readers by talking about the original and
> updated sort order.  The only thing they need, when they want to get
> the feeling that they understand how things work, is the description
> of how things work in the version of Git they have ready access to.
> Historical mistakes we made, corrections we made and why, are
> certainly of interest but not for the target audience of this
> document.

Again thank you, very interesting reading.

> On the other hand, ...
>
>> +The following is a short example that demonstrates how hashes can be
>> +generated manually:
>> +
>> +Let's asume a small text file with the content "Hello git.\n"
>> +-------------------------------------------------
>> +$ cat > hello.txt <<EOF
>> +Hello git.
>> +EOF
>> +-------------------------------------------------
>> +
>> +We can now manually generate the hash `git` would use for this file:
>> +
>> +- The object we want the hash for is of type "blob" and its size is
>> +  11 bytes.
>> +
>> +- Prepend the object header to the file content and feed this to
>> +  sha1sum(1):
>> +
>> +-------------------------------------------------
>> +$ printf "blob 11\0" | cat - hello.txt | sha1sum
>> +7217614ba6e5f4e7db2edaa2cdf5fb5ee4358b57 .
>> +-------------------------------------------------
>> +
>
> ... something like the above (modulo coding style) would be a useful
> addition to help those who want to convince themselves they
> understand how (some parts of) Git works under the hood, and I think
> it would be a welcome addition to some subset of such readers (the
> rest of the world may feel it is way too much detail, though).
>
> I would draw the line between this one and a similar description and
> demonstration of historical mistakes, which is not as relevant as
> how things work in the current system.  In other words, to me, it is
> OK to dig a bit deep to show how the current scheme works but it is
> way too much to do the same for versions of the system that do not
> exist anymore.
>
> But others may draw the line differently and consider even the above
> a bit too much detail, which is a position I would also accept.
>
> Thanks.