Re: State of NewHash work, future directions, and discussion

Duy Nguyen <pclouds@xxxxxxxxx> · Mon, 11 Jun 2018 20:09:47 +0200

On Sat, Jun 9, 2018 at 10:57 PM brian m. carlson
<sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Since there's been a lot of questions recently about the state of the
> NewHash work, I thought I'd send out a summary.
>
> == Status
>
> I have patches to make the entire codebase work, including passing all
> tests, when Git is converted to use a 256-bit hash algorithm.
> Obviously, such a Git is incompatible with the current version, but it
> means that we've fixed essentially all of the hard-coded 20 and 40
> constants (and therefore Git doesn't segfault).

This is so cool!

> == Future Design
>
> The work I've done necessarily involves porting everything to use
> the_hash_algo.  Essentially, when the piece I'm currently working on is
> complete, we'll have a transition stage 4 implementation (all NewHash).
> Stage 2 and 3 will be implemented next.
>
> My vision of how data is stored is that the .git directory is, except
> for pack indices and the loose object lookup table, entirely in one
> format.  It will be all SHA-1 or all NewHash.  This algorithm will be
> stored in the_hash_algo.
>
> I plan on introducing an array of hash algorithms into struct repository
> (and wrapper macros) which stores, in order, the output hash, and if
> used, the additional input hash.

I'm actually thinking that putting the_hash_algo inside struct
repository is a mistake. We have code that's supposed to work without
a repo and it shows this does not really make sense to forcefully use
a partially-valid repo. Keeping the_hash_algo a separate variable
sounds more elegant.

> If people are interested, I've done some analysis on availability of
> implementations, performance, and other attributes described in the
> transition plan and can send that to the list.

I quickly skimmed through that document. I have two more concerns that
are less about any specific hash algorithm:

- how does larger hash size affects git (I guess you covered cpu
aspect, but what about cache-friendliness, disk usage, memory
consumption)

- how does all the function redirection (from abstracting away SHA-1)
affects git performance. E.g. hashcmp could be optimized and inlined
by the compiler. Now it still probably can optimize the memcmp(,,20),
but we stack another indirect function call on top. I guess I might be
just paranoid and this is not a big deal after all.
-- 
Duy