Junio C Hamano <gitster@xxxxxxxxx> writes: > Dirk Gouders <dirk@xxxxxxxxxxx> writes: > >> If someone spends the time to work through the documentation, the >> subject "hashes" can lead to contradictions: >> >> The README of the initial commit states hashes are generated from >> compressed data (which changed very soon), whereas >> Documentation/user-manual.txt says they are generated from original >> data. >> >> Don't give doubts a chance: clarify this and present a simple example >> on how object hashes can be generated manually. > > I'd rather not to waste readers' attention to historical wart. Yes, but -- I should have mentioned it -- the document itself suggests to read the initial commit. But I don't mean to argue about that, perhaps I digged to deep into details. >> @@ -4095,6 +4095,39 @@ that is used to name the object is the hash of the original data >> plus this header, so `sha1sum` 'file' does not match the object name >> for 'file'. > > The paragraph above (part of it is hidden before the hunk) clearly > states what the naming rules are. We hash the original and then > compress. If I use an implementation of Git that drives the zlib at > compression level 1, and if you clone from my repository with > another implementation of Git whose zlib is driven at compression > level 9, our .git/objects/01/2345...90 files may not be identical, > but when uncompressed they should store the same contents, so "hash > then compress" is the only sensible choice that is not affected by > the compression to give stable names to objects. Thank your for that detail. >> +Starting with the initial commit, hashing was done on the compressed >> +data and the file README of that commit explicitely states this: >> + >> +"The SHA1 hash is always the hash of the _compressed_ object, not the >> +original one." >> + >> +This changed soon after that with commit >> +d98b46f8d9a3 (Do SHA1 hash _before_ compression.). Unfortunately, the >> +commit message doesn't provide the detailed reasoning. > > These three are about Git development history, which by itself may > be of interest for some people, but the main target audience of the > user-manual is probably different from them. They may be interested > to learn how Git works, but it is only to feel that they understand > how the "magic" things Git does, like "a cryptographic hash of > contents is enough to uniquely identify the contents being tracked", > works well to trust their precious contents [*]. > > Side note: > https://lore.kernel.org/git/Pine.LNX.4.58.0504200144260.6467@xxxxxxxxxxxxxxx/ > explains the reason behind the change to those who did not find > it obvious. > > FYI, another "breaking" change we did earlier in the history of the > project was to update the sort order of paths in tree objects. We > do not need to confuse readers by talking about the original and > updated sort order. The only thing they need, when they want to get > the feeling that they understand how things work, is the description > of how things work in the version of Git they have ready access to. > Historical mistakes we made, corrections we made and why, are > certainly of interest but not for the target audience of this > document. Again thank you, very interesting reading. > On the other hand, ... > >> +The following is a short example that demonstrates how hashes can be >> +generated manually: >> + >> +Let's asume a small text file with the content "Hello git.\n" >> +------------------------------------------------- >> +$ cat > hello.txt <<EOF >> +Hello git. >> +EOF >> +------------------------------------------------- >> + >> +We can now manually generate the hash `git` would use for this file: >> + >> +- The object we want the hash for is of type "blob" and its size is >> + 11 bytes. >> + >> +- Prepend the object header to the file content and feed this to >> + sha1sum(1): >> + >> +------------------------------------------------- >> +$ printf "blob 11\0" | cat - hello.txt | sha1sum >> +7217614ba6e5f4e7db2edaa2cdf5fb5ee4358b57 . >> +------------------------------------------------- >> + > > ... something like the above (modulo coding style) would be a useful > addition to help those who want to convince themselves they > understand how (some parts of) Git works under the hood, and I think > it would be a welcome addition to some subset of such readers (the > rest of the world may feel it is way too much detail, though). > > I would draw the line between this one and a similar description and > demonstration of historical mistakes, which is not as relevant as > how things work in the current system. In other words, to me, it is > OK to dig a bit deep to show how the current scheme works but it is > way too much to do the same for versions of the system that do not > exist anymore. > > But others may draw the line differently and consider even the above > a bit too much detail, which is a position I would also accept. > > Thanks.