Re: [PATCH v4] technical doc: add a design doc for hash function transition

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 02 Oct 2017 18:02:21 +0900

Jonathan Nieder <jrnieder@xxxxxxxxx> writes:

> +Reading an object's sha1-content
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +The sha1-content of an object can be read by converting all newhash-names
> +its newhash-content references to sha1-names using the translation table.

Sure.

> +Fetch
> +~~~~~
> +Fetching from a SHA-1 based server requires translating between SHA-1
> +and NewHash based representations on the fly.
> +
> +SHA-1s named in the ref advertisement that are present on the client
> +can be translated to NewHash and looked up as local objects using the
> +translation table.
> +
> +Negotiation proceeds as today. Any "have"s generated locally are
> +converted to SHA-1 before being sent to the server, and SHA-1s
> +mentioned by the server are converted to NewHash when looking them up
> +locally.

Any of our alternate object store by definition is a NewHash
repository--otherwise we'd violate "no mixing" rule.  It may or may
note have the translation table for its objects.  If it no longer
has the translation table (because it migrated to NewHash only world
before we did), then we can still use it as our alternate but we
cannot use it for the purpose of common ancestore discovery.

> +After negotiation, the server sends a packfile containing the
> +requested objects.

s/objects.$/& These are all SHA-1 contents./

> +We convert the packfile to NewHash format using
> +the following steps:
> +
> +1. index-pack: inflate each object in the packfile and compute its
> +   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
> +   objects the client has locally. These objects can be looked up
> +   using the translation table and their sha1-content read as
> +   described above to resolve the deltas.

That procedure would give us the object's SHA-1 contents for
ref-delta objects.  For an ofs-delta object, by definition, its base
object should appear in the same packstream, so we should eventually
be able to get to the SHA-1 contents of the delta base, and from
there we can apply the delta to obtain the SHA-1 contents.  For a
non-delta object, we already have its SHA-1 contents in the
packstream.

So we can get SHA-1 names and SHA-1 contents of each and every
object in the packstream in this step.

Are we actually writing out a .pack/.idx pair that is usable in the
SHA-1 world at this stage?  Or are we going to read from something
we keep in-core in the step #3 below?

> +2. topological sort: starting at the "want"s from the negotiation
> +   phase, walk through objects in the pack and emit a list of them,
> +   excluding blobs, in reverse topologically sorted order, with each
> +   object coming later in the list than all objects it references.
> +   (This list only contains objects reachable from the "wants". If the
> +   pack from the server contained additional extraneous objects, then
> +   they will be discarded.)

Presumably this is a list of SHA-1 names, as we do not yet have
enough information to compute NewHash names yet at this point.  May
want to spell it out here.

Would it discard the auto-followed tags if we do the "traverse from
wants only"?  Traversing the objects in the packfile to find the
"tips" that are not referenced from any other object in the pack
might be necessary, and it shouldn't be too costly, I'd guess.

> +3. convert to newhash: open a new (newhash) packfile. Read the topologically
> +   sorted list just generated. For each object, inflate its
> +   sha1-content, convert to newhash-content, and write it to the newhash
> +   pack. Record the new sha1<->newhash mapping entry for use in the idx.

Are we doing any deltification here?  If we are computing .pack/.idx
pair that can be usable in the SHA-1 world in step #1, then reusing
blob deltas should be trivial (a good delta-base in the SHA-1 world
is a good delta-base in the NewHash world, too).  Things that have
outgoing references like trees, it might be possible that such a
heuristic may not give us the absolute best delta-base, but I guess
it would still be a good approximation to reuse the delta/base
object relationship in SHA-1 world to NewHash world, assuming that
the server did a good job choosing the bases.

> +4. sort: reorder entries in the new pack to match the order of objects
> +   in the pack the server generated and include blobs. Write a newhash idx
> +   file

OK.

> +5. clean up: remove the SHA-1 based pack file, index, and
> +   topologically sorted list obtained from the server in steps 1
> +   and 2.

Ah, OK, so we do write the SHA_1 pack/idx in the first step.  OK.

> +Push
> +~~~~
> +Push is simpler than fetch because the objects referenced by the
> +pushed objects are already in the translation table. The sha1-content
> +of each object being pushed can be read as described in the "Reading
> +an object's sha1-content" section to generate the pack written by git
> +send-pack.

OK.

> +Signed Commits
> +~~~~~~~~~~~~~~
> +We add a new field "gpgsig-newhash" to the commit object format to allow
> +signing commits without relying on SHA-1. It is similar to the
> +existing "gpgsig" field. Its signed payload is the newhash-content of the
> +commit object with any "gpgsig" and "gpgsig-newhash" fields removed.

Do we prepare for newerhash, too?  IOW, should the signed payload be
the newhash-contents with any field whose name is "gpgsig" or begins
with "gpgsig-" followed by anything?

> +This means commits can be signed
> +1. using SHA-1 only, as in existing signed commit objects
> +2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
> +   fields.
> +3. using only NewHash, by only using the gpgsig-newhash field.
> +
> +Old versions of "git verify-commit" can verify the gpgsig signature in
> +cases (1) and (2) without modifications and view case (3) as an
> +ordinary unsigned commit.

For old clients to be able to verify (2), signed payload for SHA-1
is everything in SHA-1 contents minus "gpgsig"; "gpgsig-newhash"
should not get excluded from the computation.  Am I correct?

I am primarily finding it a bit disturbing that there is a bit of
asymmetry here.

> +Signed Tags
> +~~~~~~~~~~~

This message stops here for now.