Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 16 Jun 2017 14:24:14 -0700

Junio C Hamano wrote:
> Junio C Hamano <gitster@xxxxxxxxx> writes:
>> Adam Langley <agl@xxxxxxxxxx> writes:

>>> However, as I'm not a git developer, I've no opinion on whether the
>>> cost of carrying implementations of these functions is worth the speed
>>> vs using SHA-256, which can be assumed to be supported everywhere
>>> already.
>>
>> Thanks.
>>
>> My impression from this thread is that even though fast may be
>> better than slow, ubiquity trumps it for our use case, as long as
>> the thing is not absurdly and unusably slow, of course.  Which makes
>> me lean towards something older/more established like SHA-256, and
>> it would be a very nice bonus if it gets hardware acceleration more
>> widely than others ;-)
>
> Ah, I recall one thing that was mentioned but not discussed much in
> the thread: possible use of tree-hashing to exploit multiple cores
> hashing a large-ish payload.  As long as it is OK to pick a sound
> tree hash coding on top of any (secure) underlying hash function,
> I do not think the use of tree-hashing should not affect which exact
> underlying hash function is to be used, and I also am not convinced
> if we really want tree hashing (some codepaths that deal with a large
> payload wants to stream the data in single pass from head to tail)
> in the context of Git, but I am not a crypto person, so ...

Tree hashing also affects single-core performance because of the
availability of SIMD instructions.

That is how software implementations of e.g. blake2bp-256 and
SHA-256x16[1] are able to have competitive performance with (slightly
better performance than, at least in some cases) hardware
implementations of SHA-256.

It is also satisfying that we have options like these that are faster
than SHA-1.

All that said, SHA-256 seems like a fine choice, despite its worse
performance.  The wide availability of reasonable-quality
implementations (e.g. in Java you can use
'MessageDigest.getInstance("SHA-256")') makes it a very tempting one.

Part of the reason I suggested previously that it would be helpful to
try to benchmark Git with various hash functions (which didn't go over
well, for some reason) is that it makes these comparisons more
concrete.  Without measuring, it is hard to get a sense of the
distribution of input sizes and how much practical effect the
differences we are talking about have.

Thanks,
Jonathan

[1] https://eprint.iacr.org/2012/476.pdf