Re: RFC v3: Another proposed hash function transition plan

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Brandon,

On Mon, 11 Sep 2017, Brandon Williams wrote:

> On 09/08, Junio C Hamano wrote:
> > Junio C Hamano <gitster@xxxxxxxxx> writes:
> > 
> > > One thing I still do not know how I feel about after re-reading the
> > > thread, and I didn't find the above doc, is Linus's suggestion to
> > > use the objects themselves as NewHash-to-SHA-1 mapper [*1*].  
> > > ...
> > > [Reference]
> > >
> > > *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@xxxxxxxxxxxxxx>
> > 
> > I think this falls into the same category as the often-talked-about
> > addition of the "generation number" field.  It is very tempting to add
> > these "mechanically derivable but expensive to compute" pieces of
> > information to the sha3-content while converting from sha1-content and
> > creating anew.  
> 
> We didn't discuss that in the doc since this particular transition plan
> we made uses an external NewHash-to-SHA1 map instead of an internal one
> because we believe that at some point we would be able to drop
> compatibility with SHA1.

Is there even a question about that? I mean, why would *any* project that
switches entirely to SHA-256 want to carry the SHA-1 baggage around?

So even if the code to generate a bidirectional old <-> new hash mapping
might be with us forever, it *definitely* should be optional ("optional"
at least as in "config setting"), allowing developers who only work with
new-hash repositories to save the time and electrons.

> Now I suspect that wont happen for a long time but I think it would be
> preferable over carrying the SHA1 luggage indefinitely.

It should be possible to push back the SHA-1 ginny into a small gin bottle
inside Git's source code, so to say, i.e. encapsulate it to the point
where it is a compile-time option, in addition to a runtime option.

Of course, that's only unless the SHA-1 calculation is made mandatory as
suggested above. I really shudder at the idea of requiring SHA-1 to be
required forever. We ignored advice in 2005 against making ourselves too
dependent on SHA-1, and I would hope that we would learn from this.

> At some point, then, we would be able to stop hashing objects twice
> (once with SHA1 and once with NewHash) instead of always requiring that
> we hash them with each hash function which was used historically.

Yes, please.

> > Because the "sha1-name" or the "generation number" can mechanically
> > be computed,

... as long as a shallow clone you do not have, of course...

> > as long as everybody agrees to _always_ place them in the
> > sha3-content, the same sha1-content will be converted into exactly the
> > same sha3-content without ambiguity, and converting them back to
> > sha1-content while pushing to an older repository will correctly
> > produce the original sha1-content, as it would just be the matter of
> > simply stripping these extra pieces of information.

... or Git would simply handle the absence of the generation number header
gracefully, so that sha1-content == sha3-content...

> > The same thing could happen if we decide to bake "generation number"
> > in the SHA-3 commit objects.  One possible definition would be that a
> > root commit will have gen #0; a commit with 1 or more parents will get
> > max(parents' gen numbers) + 1 as its gen number.  But somebody may
> > botch the counting and records sum(parents' gen numbers) as its gen
> > number.
> > 
> > In these cases, not just the SHA3-content but also the resulting SHA-3
> > object name would be different from the name of the object that would
> > have recorded the same contents correctly.  So converting back to
> > SHA-1 world from these botched SHA-3 contents may produce the original
> > contents, but we may end up with multiple "plausibly looking" set of
> > SHA-3 objects that (clain to) correspond to a single SHA-1 object,
> > only one of which is a valid one.
> > 
> > Our "git fsck" already treats certain brokenness (like a tree whose
> > entry has mode that is 0-padded to the left) as broken but still
> > tolerate them.  I am not sure if it is sufficient to diagnose and
> > declare broken and invalid when we see sha3-content that records
> > these "mechanically derivable but expensive to compute" pieces of
> > information incorrectly.
> > 
> > I am leaning towards saying "yes, catching in fsck is enough" and
> > suggesting to add generation number to sha3-content of the commit
> > objects, and to add even the "original sha1 name" thing if we find
> > good use of it.  But I cannot shake this nagging feeling off that I
> > am missing some huge problems that adding these fields and opening
> > ourselves to more classes of broken objects.
> > 
> > Thoughts?

Seeing as current Git versions would always ignore the generation number
(and therefore work perfectly even with erroneous baked-in generation
numbers), and seeing as it would be easy to add a config option to force
Git to ignore the embedded generation numbers, I would consider `fsck`
catching those problems the best idea.

It seems that every major Git hoster already has some sort of fsck on the
fly for newly-pushed objects, so that would be another "line of defense".

Taking a step back, though, it may be a good idea to leave the generation
number business for later, as much fun as it is to get side tracked and
focus on relatively trivial stuff instead of the far more difficult and
complex task to get the transition plan to a new hash ironed out.

For example, I am still in favor of SHA-256 over SHA3-256, after learning
some background details from in-house cryptographers: it provides
essentially the same level of security, according to my sources, while
hardware support seems to be coming to SHA-256 a lot sooner than to
SHA3-256.

Which hash algorithm to choose is a tough question to answer, and
discussing generation numbers will sadly not help us answer it any quicker.

Ciao,
Dscho



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux