esr@xxxxxxxxxxx (Eric S. Raymond) writes: > I have been thinking hard about the problems raised during my > request for unique timestamps. I think I've found a better way > to bust the box I was trying to break out of. I am therefore > withdrawing that proposal and replacing it with this one. > > It's time to separate commit identification from Merkle hashing. Documentation/technical/hash-function-transition.txt identifies similar problem, namely that existing signatures in signed tags, signed commits and merges of signed tags are signatures of their SHA-1 form. We want to be able to verify those signatures, even if this verification may be considered less secure now. You want both more (stable IDs for all commits, not only those signed) and less (you don't need verification down the tree using IDs used for commit ID). > One reason I am sure of this is the SHA-1 to whatever transition. > We can't count on the successor hash to survive attack forever. > Accordingly, git's design needs to be stable against the possibility > of having to accommodate multiple future hash algorithms in the > future. > > Here's how to do it: > > 1. Commit IDs and Merkle-tree hashes become separate commit > properties in the git filesystem. The issue you need to consider is that for signatures to be secure they must be over verification-hash Merkle-tree. It is not only commits that are identified by hashes, but also trees, blobs and tags. Commits reference other commits ("parent" lines) and a tree ("tree"); trees reference other trees, blobs and possibly commits (if submodules are used). Tags can reference any object, but most common reference commits. Blobs, i.e. file contents, do not reference any other objects. For security, all those references should use most strong hash function. Changing referecing hash (e.g. "parent" uses SHA-256 instead of "SHA-1") means that the contents of object changes, and thus its hash. Documentation/technical/hash-function-transition.txt therefore talks about SHA-256 and SHA-1 forms and SHA-256 and SHA-1 object names. "The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. The sha256-name of an object is the SHA-256 of the concatenation of its type, length, a nul byte, and the object's sha256-content." > 2. The data structure representing a Merkle-tree hash becomes > a pair consisting of a value and a hash-algorithm tag. An > empty tag is interpreted as SHA-1. I will call this entity the > "verification hash" and avoid unqualified use of "hash" in the > rest of this proposal. Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers are of different lengths to distinguish them (see section "Meaning of signatures") in Documentation/technical/hash-function-transition.txt There might be, I think, the problem for "tree" objects. As opposed to all other places, "tree" objects use binary representation of hash, and not hexadecimal textual representation (some consider that a design mistake). > > 3. The initial value of a commit's ID in a live repository is a copy > of its verification hash, except in one important case. > > 4. When a repository is exported to a stream, the commit-id is dumped > with other commit metadata. Thus, anything that can read a stream > can resolve commit references in its change comments. > > 5. When a stream is imported, if a commit has a commit-id field it > overrides the default assignment of the generated verification hash > to that field. I think Documentation/technical/hash-function-transition.txt misses considerations for fast-import format (it talks about problem with submodules, shallow clones, and currently not solved problem of translating notes; it does not talk about git-replace, either). > > 6. Commit IDs are free-format and not interpreted by git except > as lookup keys. When git changes verification-hash functions, > commit IDs do not change. All right. Looks sensible on first glance. For security, all references in Merkle-tree of hashes must use strong verification hash. This means that you need to be able to refer to any object, including commit, by its verification hash name of its verification hash form (where all references inside object, like "parent" and "tree" headers in commit objects, use verification hashes). You need to store this commit ID somewhere. Current proposal for transitional period in Documentation/technical/hash-function-transition.txt talks about loose object index ($GIT_OBJECT_DIR/loose-object-idx) with the following format: # loose-object-idx (sha256-name SP sha1-name LF)* In packfile index contains separate SHA-1 indices and SHA-256 indices into packfile, providing fast mapping from SHA-1 name or SHA-256 name to position (index) of object in the packfile. Something similar might have been needed for commit IDs mapping. One problem is that neither loose object index, not the packfile index are transported alongside with the objects. So we may need to put commit ID elsewhere... Note that we cannot put X-hash identifier into X-hash object form, that is you cannot add "id" header to object (though you might add "other-id" header, assuming that if ID is hash based it is on the other-id form without other-id header). id <sha-1 identifier of this object> tree 0fa044a4d161254a3eae0bd06c0452d79e489593 parent 6505413ad94ddfc01f9e2f5c1b79ea6b8ffbabbb author A U Thor <author@xxxxxxxxxxx> 1558619302 +0200 committer C O Mitter <committer@xxxxxxxxxxx> 1558628753 -0500 fixes > Notice several important properties of this design. > > A. Git becomes absolutely future-proofed against hash-algorithm > changes. It can even support the use of multiple hash types over > the lifetime of one repo. > > B. All SHA-1 commit references will resolve forever even after git > stops generating them. All future hash-based commit references will > also be good forever. We might need to be able to distinguish commit IDs from hash-based object identifier of commit on command line, perhaps with something like <commit-id>^{id} This is similar to proposed git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} > C. The id/verification split will be invisible from clients at start, > because initially they coincide and will continue to do so unless > an explicit decision changes either the verification-hash algorithm > or the way commit-IDs are initialized. The problem may be with reusing command output for input (to refer to objects and commits). > > D. My wish for forward-portable unique commit IDs is granted. > They're not by default eyeball-friendly, but I can live with that. > Furthermore, because they're preserved in streams they can be > eternally stable even as hash algorithms and preferred ID > formats change. Good. > > E. There is now a unique total order on the repo, modulo highly > unlikely (and in priciple completely avoidable) commit-ID > collisions. It's commit date tie-broken by commit-ID sort order. > It too survives hash-function changes. Nice. > > F. There's no need for timestamp uniqueness any more. > > G. When a repository is imported from (say) Subversion, the Subversion > IDs *don't have to break*! They can be used to initialize the > commit-ID fields. Many users migrating from other VCSes will be > deeply, deeply grateful for this feature. There would also need to be some support to retrieve commits using their "commit ID" stable identifiers. It may not need to be very fast. > > I believe this solves every problem I walked in with except timestamp > truncation. Best, -- Jakub Narębski