State of NewHash work, future directions, and discussion

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Sat, 9 Jun 2018 20:56:28 +0000

Since there's been a lot of questions recently about the state of the
NewHash work, I thought I'd send out a summary.

== Status

I have patches to make the entire codebase work, including passing all
tests, when Git is converted to use a 256-bit hash algorithm.
Obviously, such a Git is incompatible with the current version, but it
means that we've fixed essentially all of the hard-coded 20 and 40
constants (and therefore Git doesn't segfault).

I'm working on getting a 256-bit Git to work with SHA-1 being the
default.  Currently, this involves doing things like writing transport
code, since in order to clone a repository, you need to be able to set
up the hash algorithm correctly.  I know that this was a non-goal in the
transition plan, but since the testsuite doesn't pass without it, it's
become necessary.

Some of these patches will be making their way to the list soon.
They're hanging out in the normal places in the object-id-part14 branch
(which may be rebased).

== Future Design

The work I've done necessarily involves porting everything to use
the_hash_algo.  Essentially, when the piece I'm currently working on is
complete, we'll have a transition stage 4 implementation (all NewHash).
Stage 2 and 3 will be implemented next.

My vision of how data is stored is that the .git directory is, except
for pack indices and the loose object lookup table, entirely in one
format.  It will be all SHA-1 or all NewHash.  This algorithm will be
stored in the_hash_algo.

I plan on introducing an array of hash algorithms into struct repository
(and wrapper macros) which stores, in order, the output hash, and if
used, the additional input hash.

Functions like get_oid_hex and parse_oid_hex will acquire an internal
version, which knows about parsing things (like refs) in the internal
format, and one which knows about parsing in the UI formats.  Similarly,
oid_to_hex will have an internal version that handles data in the .git
directory, and an external version that produces data in the output
format.  Translation will take place at the outer edges of the program.

The transition plan anticipates a stage 1 where accept only SHA-1 on
input and produce only SHA-1 on output, but store in NewHash.  As I've
worked with our tests, I've realized such an implementation is not
entirely possible.  We have various tools that expect to accept invalid
object IDs, and obviously there's no way to have those continue to work.
We'd have to either reject invalid data in such a case or combine stages
1 and 2.

== Compatibility with this Work

If you're working on new features and you'd like to implement the best
possible compatibility with this work, here are some recommendations:

* Assume everything in the .git directory but pack indices and the loose
  object index will be in the same algorithm and that that algorithm is
  the_hash_algo.
* For the moment, use the_hash_algo to look up the size of all
  hash-related constants.  Use GIT_MAX_* for allocations.
* If you are writing a new data format, add a version number.
* If you need to serialize an algorithm identifier into your data
  format, use the format_id field of struct git_hash_algo.  It's
  designed specifically for that purpose.
* You can safely assume that the_hash_algo will be suitably initialized
  to the correct algorithm for your repository.
* Keep using the object ID functions and struct object_id.
* Try not to use mmap'd structs for reading and writing formats on disk,
  since these are hard to make hash size agnostic.

== Discussion about an Actual NewHash

Since I'll be writing new code, I'll be writing tests for this code.
However, writing tests for creating and initializing repositories
requires that I be able to test that objects are being serialized
correctly, and therefore requires that I actually know what the hash
algorithm is going to be.  I also can't submit code for multi-hash packs
when we officially only support one hash algorithm.

I know that we have long tried to avoid discussing the specific
algorithm to use, in part because the last discussion generated more
heat than light, and settled on referring to it as NewHash for the time
being.  However, I think it's time to pick this topic back up, since I
can't really continue work in this direction without us picking a
NewHash.

If people are interested, I've done some analysis on availability of
implementations, performance, and other attributes described in the
transition plan and can send that to the list.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc

Description: PGP signature