Re: SHA1 collisions found

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 24 Feb 2017 22:10:01 -0800

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> For example, what I would suggest the rules be is something like this:
>
>  - introduce new tag2/commit2/tree2/blob2 object type tags that imply
> that they were hashed using the new hash
>
>  - an old type obviously can never contain a pointer to a new type (ie
> you can't have a "tree" object that contains a tree2 object or a blob2
> object.
>
>  - but also make the rule that a *new* type can never contain a
> pointer to an old type, with the *very* specific exception that a
> commit2 can have a parent that is of type "commit".

OK, I think that is what Peff was suggesting in his message, and I
do not have problem with such a transition plan.  Or the *very*
specific exception could be that a reference to "commit" can use old
name (which would allow binding a submodule before transition to a
new project).

We probably do not need "blob2" object as they do not embed any
pointer to another thing.  A loose blob with old name can be made
available on the filesystem also under new name without much "heavy"
transition, and an in-pack blob can be pointed at with _two_ entries
in the updated pack index file under old and new names, both for the
base (just deflated) representation and also ofs-delta.  A ref-delta
based on another blob with old name may need a bit of special
handling, but the deltification would not be visible at the "struct object"
layer, so probably not such a big deal.

We may also be able to get away without "commit2" and "tag2" as
their pointers can be widened and parse_{commit,tag}_object() should
be able to deal with objects with new names transparently.  "tree2"
may be a bit tricky, though, but offhand it seems to me that nothing
is insurmountable.

> That way everything "converges" towards the new format: the only way
> you can stay on the old format is if you only have old-format objects,
> and once you have a new-format object all your objects are going to be
> new format - except for the history.

Yes.

> So you will end up with duplicate objects, and that's not good (think
> of what it does to all our full-tree "diff" optimizations, for example
> - you no longer get the "these sub-trees are identical" across a
> format change), but realistically you'll have a very limited time of
> that kind of duplication.
>
> I'd furthermore suggest that from a UI standpoint, we'd
>
>  - convert to 64-character hex numbers (32-byte hashes)
>
>  - (as mentioned earlier) default to a 40-character abbreviation
>
>  - make the old 40-character SHA1's just show up within the same
> address space (so they'd also be encoded as 32-byte hashes, just with
> the last 12 bytes zero).

Yes to all of the above.

>  - you'd see in the "object->type" whether it's a new or old-style hash.

I am not sure if this is needed.  We may need to abstract tree_entry walker
a little bit as a preparatory step, but I suspect that the hash (and
more importantly the internal format) can be kept as an internal
knowledge to the object layer (i.e. {commit,tree,tag}.c).

So,... thanks for straightening me out.  I was thinking we would
need mixed mode support for smoother transition, but it now seems to
me that the approach to stratify the history into old and new is
workable.