Re: SHA1 collisions found

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 26 Feb 2017 10:55:09 -0800

Jeff King <peff@xxxxxxxx> writes:

> Trees are more difficult, as they don't have any such field. But a valid
> tree does need to start with a mode, so sticking some non-numeric flag
> at the front of the object would work (it breaks backwards
> compatibility, but that's kind of the point).

Just like the object header format does not inherently impose a
maximum length the system can handle on our objects or the number of
mode bits we can use in an entry in the tree object [*1*], the
format in which tags and commits refer to other objects does not
impose what hash is used for these references [*2*].  

The object names in the tree format is an oddball; by being a binary
20-byte field and without any other hint, it does limit us to stick
to SHA-1.

I think the helper functions in tree-walk.h, namely 

	init_tree_desc();
	tree_entry_extract();
	update_tree_entry();

and the associated data structures can be updated to read a tree
object in a new format without affecting the readers too much.  By
having a "I am in a new format" byte at the beginning that cannot be
a valid first byte in the current tree format (non-octal is a good
thing to use here), init_tree_desc() can set things up in the desc
structure to expect that the data that will be read by
tree_entry_extract() and update_tree_entry() are formatted in a new
way, and by varying that "tree-format signature" byte, we can update
the format in the future.

So at the loose-object format level, we may not even need "tree2";
we can view this update in a way similar to the change we did when
we started supporting submodules/gitlinks.  Older Git would have
said "There is an object that is not tree or blob recorded" and
barfed but newer one takes such a tree just fine.  This "we are now
introducing a new hash, and a tree can either have objects all named
by SHA-1 or all new (non SHA-1) hash" update can be treated the same
way, methinks.

The normal flow to write tree objects is (supposed to be) all
contained in cache-tree.c.  As long as we can tell from "struct
object" which hash names the object (i.e. struct object_id may
become an enum and a union), we should be able to use it to convert
objects near the tip of the existing history to new hashes
incrementally. Ideally, the flag-day for one tip of a dag may be
just a matter of

	git commit --allow-empty -m "object name hash update"

without anything else.  The commit by default would want to name
itself with the new hash, which requires it to get its tree named
with the new hash, which may read the old tree and associated blobs
all named with SHA-1, but write_index_as_tree() should be able to
(1) read the tree with its SHA-1 name to learn what is contained;
(2) read the contents of blobs with their SHA-1 names, and compute
their names with the new hash; and (3) write out a containing tree
object in the updated format and named with the new hash.  And that
would give us the tree object named with the new hash that the
command can write into the new commit object on its "tree" line.

[Footnote]

*1* These lengths and mode bits are spelled out in ASCII without any
    fixed length limit for the number of the bytes in this ASCII
    string that represents the length.  The current code may happen
    to read them into unsigned long and unsigned int, which does
    impose limit on the individual reader in the sense that if your
    ulong is only 32-bit, you cannot have an object larger than 4GB.
    But that is not an inherent limit in the format; you can lift it
    by upgrading the reader.

*2* They are also spelled out in ASCII and there is no length limit.
    Existing implementation may happen to assume that they are all
    SHA-1, but the readers and the writers can be updated to allow
    other hashes to be used in a way that does not break existing
    code when we are only using SHA-1 by marking a reference that
    uses new hash distinguishable from SHA-1 references.