Re: [PATCH 03/15] cache: add an algo member to struct object_id

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sun, 11 Apr 2021 13:55:57 +0200

On Sat, Apr 10 2021, brian m. carlson wrote:

> Now that we're working with multiple hash algorithms in the same repo,
> it's best if we label each object ID with its algorithm so we can
> determine how to format a given object ID. Add a member called algo to
> struct object_id.
>
> Signed-off-by: brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx>
> ---
>  hash.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/hash.h b/hash.h
> index 3fb0c3d400..dafdcb3335 100644
> --- a/hash.h
> +++ b/hash.h
> @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>  
>  struct object_id {
>  	unsigned char hash[GIT_MAX_RAWSZ];
> +	int algo;

Curiosity since I'm not as familiar as you with the multi-hash support
by far:

So struct object_id is GIT_MAX_RAWSZ, not two types of structs for
GIT_SHA1_RAWSZ and GIT_SHA256_RAWSZ. That pre-dates this series because
we'd like to not deal with two types of objects everywhere for SHA-1 and
SHA-256. Makes sense.

Before this series we'd memcmp them up to their actual length, but the
last GIT_MAX_RAWSZ-GIT_SHA1_RAWSZ would be uninitialized

Now we pad them out, so the last 96 bits of every SHA1 are 0000...;
Couldn't we also tell which hash an object is by memcmp-ing those last N
bits and see if they're all zero'd?

Feels a bit hackish, and we'd need to reconsider that method if we'd
ever support other same-length hashes.

But OTOH having these objects all padded out in memory to the same
length, but having to carry around a "what hash algo" is it yields the
arguably weird hack of having a per-hash NULL_OID, which has never been
an actual object of any hash type, but just a pseudo-object.

As another aside I had some local patches (just for playing around) to
implement SHA-256/160, i.e. a SHA-256-to-SHA-1-length that doesn't
officially exist. We'd store things as full-length SHA-256 internally,
but on anything that would format them (including plumbing output) we'd
emit the truncated version(s).

The idea was to support Git/SHA-256 when combined with legacy systems
who'd all need DB column changes to have different length hashes.

I abandoned it as insany sillyness after playing with it for about a
day, but it did reveal that much of the hash code now can assume
internal length == formatting length, which is why I'm 3 paragraphs into
this digression, i.e. maybe some of the code structure also makes having
a NULL_OID always be 256-bits when we want to format it as 160/256
painful...