Jeff King <peff@xxxxxxxx> writes: > Trees are more difficult, as they don't have any such field. But a valid > tree does need to start with a mode, so sticking some non-numeric flag > at the front of the object would work (it breaks backwards > compatibility, but that's kind of the point). Just like the object header format does not inherently impose a maximum length the system can handle on our objects or the number of mode bits we can use in an entry in the tree object [*1*], the format in which tags and commits refer to other objects does not impose what hash is used for these references [*2*]. The object names in the tree format is an oddball; by being a binary 20-byte field and without any other hint, it does limit us to stick to SHA-1. I think the helper functions in tree-walk.h, namely init_tree_desc(); tree_entry_extract(); update_tree_entry(); and the associated data structures can be updated to read a tree object in a new format without affecting the readers too much. By having a "I am in a new format" byte at the beginning that cannot be a valid first byte in the current tree format (non-octal is a good thing to use here), init_tree_desc() can set things up in the desc structure to expect that the data that will be read by tree_entry_extract() and update_tree_entry() are formatted in a new way, and by varying that "tree-format signature" byte, we can update the format in the future. So at the loose-object format level, we may not even need "tree2"; we can view this update in a way similar to the change we did when we started supporting submodules/gitlinks. Older Git would have said "There is an object that is not tree or blob recorded" and barfed but newer one takes such a tree just fine. This "we are now introducing a new hash, and a tree can either have objects all named by SHA-1 or all new (non SHA-1) hash" update can be treated the same way, methinks. The normal flow to write tree objects is (supposed to be) all contained in cache-tree.c. As long as we can tell from "struct object" which hash names the object (i.e. struct object_id may become an enum and a union), we should be able to use it to convert objects near the tip of the existing history to new hashes incrementally. Ideally, the flag-day for one tip of a dag may be just a matter of git commit --allow-empty -m "object name hash update" without anything else. The commit by default would want to name itself with the new hash, which requires it to get its tree named with the new hash, which may read the old tree and associated blobs all named with SHA-1, but write_index_as_tree() should be able to (1) read the tree with its SHA-1 name to learn what is contained; (2) read the contents of blobs with their SHA-1 names, and compute their names with the new hash; and (3) write out a containing tree object in the updated format and named with the new hash. And that would give us the tree object named with the new hash that the command can write into the new commit object on its "tree" line. [Footnote] *1* These lengths and mode bits are spelled out in ASCII without any fixed length limit for the number of the bytes in this ASCII string that represents the length. The current code may happen to read them into unsigned long and unsigned int, which does impose limit on the individual reader in the sense that if your ulong is only 32-bit, you cannot have an object larger than 4GB. But that is not an inherent limit in the format; you can lift it by upgrading the reader. *2* They are also spelled out in ASCII and there is no length limit. Existing implementation may happen to assume that they are all SHA-1, but the readers and the writers can be updated to allow other hashes to be used in a way that does not break existing code when we are only using SHA-1 by marking a reference that uses new hash distinguishable from SHA-1 references.