"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes: > I will say that the pack format will likely require some changes, > because it assumes ... > The reason is that we can't have an unambiguous parse of the current > objects if two hash algorithms are in use.... > So when we look at a new hash, we need to provide an unambiguous way to > know what hash is in use. The two choices are to either require all > object use the new hash, or to extend the objects to include the hash. > Until a couple days ago, I had planned to do the former. I had not even > considered using a multihash approach due to the complexity. Objects in Git identify themselves, but once you introduce the second hash function (as opposed to replacing the hash function to a new one), you would allow people to call the same object by two names. That has interesting implications. Let's say you have a blob at path F in a top-level tree object and create a commit. You have three objects in total, the tree knows the blob as one name based on SHA-1 and the commit knows the tree as one name based on SHA-1. The same contents of the blob and the tree could have different names based on SHA-256 in the future Git. Let's further say you have a future Git and clone from the above repository with three objects. You get a pack stream, containing the data for one commit, tree and blob each. These objects do not carry their own name as extra pieces of information. You only get their contents, and it is up to you to name them by hashing. .idx files are created by running index-pack while receiving the pack data stream. You _somehow_ need to know that these three objects need to be hashed with SHA-1, even though you are SHA-256 capable, because otherwise the object name recorded in the tree object for the blob would not match what your .idx file would call the blob data. Also the object name recorded in the ref to point at the commit would not match the commit object's object name, unless you hash with SHA-1. It is a possibility to always hash these objects twice and record _both_ hashes in the updated .idx file; after all, .idx files are strictly local matter. Now let's further say that you update the file F in the working tree, and do "git commit -a" with updated version of Git. What should happen? Assuming that we are trying to migrate to a different hashing algorithm over time, we would want to create a new blob under object name based on SHA-256, add that to the index and write a new tree out, named by hashing with SHA-256. We then record that longer-named tree in a commit whose parent commit is still named with SHA-1 based hash, and the new commit in turn is named by hashing with SHA-256. Then you push the result back. Let's assume by now the place you cloned from is also SHA-256 capable. You look at the tips of refs at your clone-source and discover that you would need to only send the new commit, its tree and the updated blob. You send data in these three objects. The receiving end would now need to do the same "magically choose hash to make sure the new blob gets the name that is recorded in the new tree (and the new tree the new commit)" thing. The same discussion applies if somebody else clones from you at this point. The objects introduced by the second commit all need to be hashed with the new hash to be named, while the other objects need to be hashed with the old hash. Continuing this thought process, I do not see a good way to allow us to wean ourselves off of the old hash, unless we _break_ the pack stream format so that each object in the pack carries not just the data but also the hash algorithm to be used to _name_ it, so that new objects will never be referred to using the old hash. It matters performance-wise that the weaning process go as quickly as possible, once the system becomes capable of new hash algorighm, because during the transition period, we'd have to suffer the full tree-diff becoming inefficient (Note: don't limit your thinking to just "git diff" and "git log"; the same inefficiency hits "git checkout" to switch branches and "git merge" to walk three trees in parallel), because we cannot skip descending into subdirectories based on the tree object name being equal, which guarantees that everything under the hierarchy is equal. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html