On Fri, Feb 24, 2017 at 3:39 PM, Jeff King <peff@xxxxxxxx> wrote: > > One thing I worry about in a mixed-hash setting is how often the two > will be mixed. Honestly, I think that a primary goal for a new hash implementation absolutely needs to be to minimize mixing. Not for security issues, but because of combinatorics. You want to have a model that basically reads old data, but that very aggressively approaches "new data only" in order to avoid the situation where you have basically the exact same tree state, just _represented_ differently. For example, what I would suggest the rules be is something like this: - introduce new tag2/commit2/tree2/blob2 object type tags that imply that they were hashed using the new hash - an old type obviously can never contain a pointer to a new type (ie you can't have a "tree" object that contains a tree2 object or a blob2 object. - but also make the rule that a *new* type can never contain a pointer to an old type, with the *very* specific exception that a commit2 can have a parent that is of type "commit". That way everything "converges" towards the new format: the only way you can stay on the old format is if you only have old-format objects, and once you have a new-format object all your objects are going to be new format - except for the history. Obviously, if somebody stays in old format, you might end up still getting some object duplication when you continue to merge from him, but that tree can never merge back without converting to new-format, so it will be a temporary situation. So you will end up with duplicate objects, and that's not good (think of what it does to all our full-tree "diff" optimizations, for example - you no longer get the "these sub-trees are identical" across a format change), but realistically you'll have a very limited time of that kind of duplication. I'd furthermore suggest that from a UI standpoint, we'd - convert to 64-character hex numbers (32-byte hashes) - (as mentioned earlier) default to a 40-character abbreviation - make the old 40-character SHA1's just show up within the same address space (so they'd also be encoded as 32-byte hashes, just with the last 12 bytes zero). - you'd see in the "object->type" whether it's a new or old-style hash. I suspect it shouldn't be too painful to do it that way. Linus