Re: SHA1 collisions found

Jeff King <peff@xxxxxxxx> · Fri, 24 Feb 2017 20:16:36 -0500

On Fri, Feb 24, 2017 at 04:39:45PM -0800, Linus Torvalds wrote:

> For example, what I would suggest the rules be is something like this:
> 
>  - introduce new tag2/commit2/tree2/blob2 object type tags that imply
> that they were hashed using the new hash
> 
>  - an old type obviously can never contain a pointer to a new type (ie
> you can't have a "tree" object that contains a tree2 object or a blob2
> object.
> 
>  - but also make the rule that a *new* type can never contain a
> pointer to an old type, with the *very* specific exception that a
> commit2 can have a parent that is of type "commit".

Yeah, this is exactly what I had in mind. That way everybody in
"newhash" mode has no decisions to make. They follow the same rules and
it's as if sha1 never existed, except when you follow links in
historical objects.

> [in reply...]
> Actually, I take that back. I think it might be easier to keep
> "object->type" as-is, and it would only show the current OBJ_xyz
> fields. Then writing the SHA ends up deciding whether a OBJ_COMMIT
> gets written as "commit" or "commit2".

Yeah, I think there are some data structures with limited bits for the
"type" fields (e.g., the pack format). So sticking with OBJ_COMMIT might
be nice. For commits and tags, it would be nice to have an "I'm v2"
header at the start so there's no confusion about how they are meant to
be interpreted.

Trees are more difficult, as they don't have any such field. But a valid
tree does need to start with a mode, so sticking some non-numeric flag
at the front of the object would work (it breaks backwards
compatibility, but that's kind of the point).

I dunno. Maybe we do not need those markers at all, and could get by
purely on object-length, or annotating the headers in some way (like
"parent sha256:1234abcd").

It might just be nice if we could very easily identify objects as one
type or the other without having to parse them in detail.

> So you will end up with duplicate objects, and that's not good (think
> of what it does to all our full-tree "diff" optimizations, for example
> - you no longer get the "these sub-trees are identical" across a
> format change), but realistically you'll have a very limited time of
> that kind of duplication.

Yeah, cross-flag-day diffs will be more expensive. I think that's
something we have to live with. I was thinking originally that the
sha1->newhash mapping might solve that, but it only works at the blob
level. I.e., you can compare a sha1 and a newhash like:

  if (!hashcmp(sha1_to_newhash(a), b))

without having to look at the contents. But it doesn't work recursively,
because the tree-pointing-to-newhash will have different content.

-Peff