Re: [PATCH v2 0/4] Hash Abstraction

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 30 Oct 2017 09:45:38 -0700

On Sat, Oct 28, 2017 at 11:12 AM, brian m. carlson
<sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
> This is a series proposing a basic abstraction for hash functions.
>
> As we get closer to converting the remainder of the codebase to use
> struct object_id, we should think about the design we want our hash
> function abstraction to take.  This series is a proposal for one idea.
> Input on any aspect of this proposal is welcome.

I like the series/idea.

> This series exposes a struct git_hash_algo that contains basic
> information about a given hash algorithm that distinguishes it from
> other algorithms: name, identifiers, lengths, implementing functions,
> and empty tree and blob constants.  It also exposes an array of hash
> algorithms, and a constant for indexing them.
>
> The series also demonstrates a simple conversion using the abstraction
> over empty blob and tree values.

That looks good, though I wonder if we'll have to give a way a little
performance during the transition period (or even indefinitely, as the
hash abstraction won't go away; we'd rather want to keep it in place
long term). I guess we can measure after all is said and done,
no big deal.

>
> In order to avoid conflicting with the struct repository work and with
> the goal of avoiding global variables as much as possible, I've pushed
> the hash algorithm into struct repository and exposed it via a #define.

If you are referring to that long series[1] that Jonathan and I were working
on, then no worries.  Unfortunately that series got some delays, but I rebased
its prepared successors (of over 100 patches) on Friday and it was surprisingly
easy to do so; just tedious.

[1] https://public-inbox.org/git/20170830064634.GA153983@xxxxxxxxxxxxxxxxxxxxxxxxx/

>
> I propose this series now as it will inform the way we go about
> converting other parts of the codebase, especially some of the pack
> algorithms.

Thanks for doing this now.

> Because we share some hash computation code between pack
> checksums and object hashing, we need to decide whether to expose pack
> checksums as struct object_id, even though they are technically not
> object IDs.

I think we used sha1 as checksums, because it was a hash function easily
accessible, not because its other properties were the best to do its job.
So I am undecided if we want to just "keep the same hash function for
checksum as well as object hashing" or pickup another checksum, that
actually *only* does checksumming (we don't need the cryptographic
properties, such that an easy hash [such as crc, djb, fnv or murmur]
would do; raw speed, barely protecting the pack file against bit flips).

For the sake of symmetry I tend to go with the former, i.e use the object
hash function for pack files, too.

>  Furthermore, if we end up needing to stuff an algorithm
> value into struct object_id, we'll no longer be able to directly
> reference object IDs in a pack without a copy.
>
> I've updated this series in some significant ways to reflect and better
> implement the transition plan as it's developed.  If there are ways
> in which this series (or future series) can converge better on the
> transition plan, that input would be valuable.
>
> This series is available from the usual places as branch hash-struct,
> based against master as of 2.15-rc2.

Thanks,
Stefan