Re: RFC: Separate commit identification from Merkle hashing

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 23 May 2019 21:09:44 +0200

esr@xxxxxxxxxxx (Eric S. Raymond) writes:

> I have been thinking hard about the problems raised during my
> request for unique timestamps.  I think I've found a better way
> to bust the box I was trying to break out of.  I am therefore
> withdrawing that proposal and replacing it with this one.
>
> It's time to separate commit identification from Merkle hashing.

Documentation/technical/hash-function-transition.txt identifies similar
problem, namely that existing signatures in signed tags, signed commits
and merges of signed tags are signatures of their SHA-1 form.  We want
to be able to verify those signatures, even if this verification may be
considered less secure now.

You want both more (stable IDs for all commits, not only those signed)
and less (you don't need verification down the tree using IDs used for
commit ID).

> One reason I am sure of this is the SHA-1 to whatever transition.
> We can't count on the successor hash to survive attack forever.
> Accordingly, git's design needs to be stable against the possibility
> of having to accommodate multiple future hash algorithms in the
> future.
>
> Here's how to do it:
>
> 1. Commit IDs and Merkle-tree hashes become separate commit
>    properties in the git filesystem.

The issue you need to consider is that for signatures to be secure they
must be over verification-hash Merkle-tree.  It is not only commits that
are identified by hashes, but also trees, blobs and tags.

Commits reference other commits ("parent" lines) and a tree ("tree");
trees reference other trees, blobs and possibly commits (if submodules
are used).  Tags can reference any object, but most common reference
commits.  Blobs, i.e. file contents, do not reference any other
objects.  For security, all those references should use most strong hash
function.

Changing referecing hash (e.g. "parent" uses SHA-256 instead of "SHA-1")
means that the contents of object changes, and thus its hash.
Documentation/technical/hash-function-transition.txt therefore talks
about SHA-256 and SHA-1 forms and SHA-256 and SHA-1 object names.

 "The sha1-name of an object is the SHA-1 of the concatenation of its
  type, length, a nul byte, and the object's sha1-content. This is the
  traditional <sha1> used in Git to name objects.

  The sha256-name of an object is the SHA-256 of the concatenation of its
  type, length, a nul byte, and the object's sha256-content."

> 2. The data structure representing a Merkle-tree hash becomes
>    a pair consisting of a value and a hash-algorithm tag. An
>    empty tag is interpreted as SHA-1. I will call this entity the
>    "verification hash" and avoid unqualified use of "hash" in the
>    rest of this proposal.

Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers
are of different lengths to distinguish them (see section "Meaning of
signatures") in Documentation/technical/hash-function-transition.txt

There might be, I think, the problem for "tree" objects.  As opposed to
all other places, "tree" objects use binary representation of hash, and
not hexadecimal textual representation (some consider that a design
mistake).

>
> 3. The initial value of a commit's ID in a live repository is a copy
>    of its verification hash, except in one important case.
>
> 4. When a repository is exported to a stream, the commit-id is dumped
>    with other commit metadata.  Thus, anything that can read a stream
>    can resolve commit references in its change comments.
>
> 5. When a stream is imported, if a commit has a commit-id field it
>    overrides the default assignment of the generated verification hash
>    to that field.

I think Documentation/technical/hash-function-transition.txt misses
considerations for fast-import format (it talks about problem with
submodules, shallow clones, and currently not solved problem of
translating notes; it does not talk about git-replace, either).

>
> 6. Commit IDs are free-format and not interpreted by git except
>    as lookup keys. When git changes verification-hash functions,
>    commit IDs do not change.

All right.  Looks sensible on first glance.

For security, all references in Merkle-tree of hashes must use strong
verification hash.  This means that you need to be able to refer to any
object, including commit, by its verification hash name of its
verification hash form (where all references inside object, like
"parent" and "tree" headers in commit objects, use verification hashes).

You need to store this commit ID somewhere.  Current proposal for
transitional period in Documentation/technical/hash-function-transition.txt
talks about loose object index ($GIT_OBJECT_DIR/loose-object-idx) with
the following format:

  # loose-object-idx
  (sha256-name SP sha1-name LF)*

In packfile index contains separate SHA-1 indices and SHA-256 indices
into packfile, providing fast mapping from SHA-1 name or SHA-256 name to
position (index) of object in the packfile.

Something similar might have been needed for commit IDs mapping.

One problem is that neither loose object index, not the packfile index
are transported alongside with the objects.  So we may need to put
commit ID elsewhere...

Note that we cannot put X-hash identifier into X-hash object form, that
is you cannot add "id" header to object (though you might add "other-id"
header, assuming that if ID is hash based it is on the other-id form
without other-id header).

  id <sha-1 identifier of this object>
  tree 0fa044a4d161254a3eae0bd06c0452d79e489593
  parent 6505413ad94ddfc01f9e2f5c1b79ea6b8ffbabbb
  author A U Thor <author@xxxxxxxxxxx> 1558619302 +0200
  committer C O Mitter <committer@xxxxxxxxxxx> 1558628753 -0500

  fixes

> Notice several important properties of this design.
>
> A. Git becomes absolutely future-proofed against hash-algorithm
>    changes. It can even support the use of multiple hash types over
>    the lifetime of one repo.
>
> B. All SHA-1 commit references will resolve forever even after git
>    stops generating them.  All future hash-based commit references will
>    also be good forever.

We might need to be able to distinguish commit IDs from hash-based
object identifier of commit on command line, perhaps with something like

  <commit-id>^{id}

This is similar to proposed

  git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

> C. The id/verification split will be invisible from clients at start,
>    because initially they coincide and will continue to do so unless
>    an explicit decision changes either the verification-hash algorithm
>    or the way commit-IDs are initialized.

The problem may be with reusing command output for input (to refer to
objects and commits).

>
> D. My wish for forward-portable unique commit IDs is granted.
>    They're not by default eyeball-friendly, but I can live with that.
>    Furthermore, because they're preserved in streams they can be
>    eternally stable even as hash algorithms and preferred ID
>    formats change.

Good.

>
> E. There is now a unique total order on the repo, modulo highly
>    unlikely (and in priciple completely avoidable) commit-ID
>    collisions. It's commit date tie-broken by commit-ID sort order.
>    It too survives hash-function changes.

Nice.

>
> F. There's no need for timestamp uniqueness any more.
>
> G. When a repository is imported from (say) Subversion, the Subversion
>    IDs *don't have to break*!  They can be used to initialize the
>    commit-ID fields. Many users migrating from other VCSes will be
>    deeply, deeply grateful for this feature.

There would also need to be some support to retrieve commits using their
"commit ID" stable identifiers.  It may not need to be very fast.

>
> I believe this solves every problem I walked in with except timestamp
> truncation.

Best,
--
Jakub Narębski