Re: SHA1 collisions found

Mike Hommey <mh@xxxxxxxxxxxx> · Sun, 26 Feb 2017 07:09:44 +0900

On Sat, Feb 25, 2017 at 02:26:56PM -0500, Jeff King wrote:
> On Sat, Feb 25, 2017 at 06:50:50PM +0000, brian m. carlson wrote:
> 
> > > As long as the reader can tell from the format of object names
> > > stored in the "new object format" object from what era is being
> > > referred to in some way [*1*], we can name new objects with only new
> > > hash, I would think.  "new refers only to new" that stratifies
> > > objects into older and newer may make things simpler, but I am not
> > > convinced yet that it would give our users a smooth enough
> > > transition path (but I am open to be educated and pursuaded the
> > > other way).
> > 
> > I would simply use multihash[0] for this purpose.  New-style objects
> > serialize data in multihash format, so it's immediately obvious what
> > hash we're referring to.  That makes future transitions less
> > problematic.
> > 
> > [0] https://github.com/multiformats/multihash
> 
> I looked at that earlier, because I think it's a reasonable idea for
> future-proofing. The first byte is a "varint", but I couldn't find where
> they defined that format.
> 
> The closest I could find is:
> 
>   https://github.com/multiformats/unsigned-varint
> 
> whose README says:
> 
>   This unsigned varint (VARiable INTeger) format is for the use in all
>   the multiformats.
> 
>     - We have not yet decided on a format yet. When we do, this readme
>       will be updated.
> 
>     - We have time. All multiformats are far from requiring this varint.
> 
> which is not exactly confidence inspiring. They also put the length at
> the front of the hash. That's probably convenient if you're parsing an
> unknown set of hashes, but I'm not sure it's helpful inside Git objects.
> And there's an incentive to minimize header data at the front of a hash,
> because every byte is one more byte that every single hash will collide
> over, and people will have to type when passing hashes to "git show",
> etc.
> 
> I'd almost rather use something _really_ verbose like
> 
>   sha256:1234abcd...
> 
> in all of the objects. And then when we get an unadorned hash from the
> user, we guess it's sha256 (or whatever), and fallback to treating it as
> a sha1.
> 
> Using a syntactically-obvious name like that also solves one other
> problem: there are sha1 hashes whose first bytes will encode as a "this
> is sha256" multihash, creating some ambiguity.

Indeed, multihash only really is interesting when *all* hashes use it.
And obviously, git can't change the existing sha1s.

Mike