Re: [PATCH 03/15] cache: add an algo member to struct object_id

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Sun, 11 Apr 2021 21:37:21 +0000

On 2021-04-11 at 11:55:57, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sat, Apr 10 2021, brian m. carlson wrote:
> 
> > Now that we're working with multiple hash algorithms in the same repo,
> > it's best if we label each object ID with its algorithm so we can
> > determine how to format a given object ID. Add a member called algo to
> > struct object_id.
> >
> > Signed-off-by: brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx>
> > ---
> >  hash.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/hash.h b/hash.h
> > index 3fb0c3d400..dafdcb3335 100644
> > --- a/hash.h
> > +++ b/hash.h
> > @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
> >  
> >  struct object_id {
> >  	unsigned char hash[GIT_MAX_RAWSZ];
> > +	int algo;
> 
> Curiosity since I'm not as familiar as you with the multi-hash support
> by far:
> 
> So struct object_id is GIT_MAX_RAWSZ, not two types of structs for
> GIT_SHA1_RAWSZ and GIT_SHA256_RAWSZ. That pre-dates this series because
> we'd like to not deal with two types of objects everywhere for SHA-1 and
> SHA-256. Makes sense.
> 
> Before this series we'd memcmp them up to their actual length, but the
> last GIT_MAX_RAWSZ-GIT_SHA1_RAWSZ would be uninitialized
> 
> Now we pad them out, so the last 96 bits of every SHA1 are 0000...;
> Couldn't we also tell which hash an object is by memcmp-ing those last N
> bits and see if they're all zero'd?

That makes a lot of assumptions about the security of the hash
algorithm that I don't want to make here.  If anyone can ever find a
SHA-256 hash with trailing 96 bits zero, then they can confuse it with a
SHA-1 hash.  That means that our security level goes from 128 bits to 96
bits.  It's also a nonstandard construction.

More importantly, it results in the null OID being treated as a SHA-1
OID.  Because we do print the null OID in some cases, we're going to
break a lot of output formats if we print all the rest of the OIDs with
64 characters and then the null OID with 40.  That's to say nothing of
the problems in binary formats.

The reason we pad these objects is because our hashmaps are broken if we
don't.  I don't remember all the gory details, but it was obvious to me
that if they weren't consistently equal, things were going to be broken.
That's the only reason, not theoretical purity.

> Feels a bit hackish, and we'd need to reconsider that method if we'd
> ever support other same-length hashes.

My hope is that we don't need to do this, but we do have SHA-3 to serve
as a backup for SHA-2.  If quantum computers don't progress
substantially, SHA-3-256 is definitely a viable candidate for
replacement if anything ever happens to SHA-256.

> But OTOH having these objects all padded out in memory to the same
> length, but having to carry around a "what hash algo" is it yields the
> arguably weird hack of having a per-hash NULL_OID, which has never been
> an actual object of any hash type, but just a pseudo-object.

Unfortunately, as I mentioned above, we need to have two null OIDs to
handle printing things out.  It's inconvenient, I agree.

> I abandoned it as insany sillyness after playing with it for about a
> day, but it did reveal that much of the hash code now can assume
> internal length == formatting length, which is why I'm 3 paragraphs into
> this digression, i.e. maybe some of the code structure also makes having
> a NULL_OID always be 256-bits when we want to format it as 160/256
> painful...

We'll always format based on the algorithm in the OID.  That's the
simplest way to make things work because unfortunately we may end up
with both types of OIDs in the same code paths (as we're converting one
to the other) and otherwise our printing functions need a lot of special
handling and even more variants than they have today.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US
Attachment:
signature.asc

Description: PGP signature