Thanks for mentioning the 100K limit, which I didn't know about. Will have to try to see how to split it into two patches. The intent is to increase the cost of a malicious attack, which requires generating two different files with the same SHA-1 value, detect such an attack early, and to slow such an attack down - because of Git's rule that the first object with a SHA-1 value is the one the repository has, if it takes longer to generate a collision than the time it takes to get the original object into all repositories (which is done manually by multiple individuals), the forged file will never appear in any "official" repository. The additional CRC (easily changed to whatever message digest one might prefer) makes a malicious attack far more difficult: the modified file has to have both the same SHA-1 hash (including the Git header) and the same CRC (not including the Git header). An efficient algorithm to do both simultaneously does not yet exist. So, if we could generate a SHA-1 collision in one second, it would presumably take billions of seconds (many decades of continuous computation) to generate a SHA-1 hash with the same CRC, and well before a year has elapsed, the original object should have been in all the repositories, preventing a forged object from being inserted. Of course, eventually you might need a real message digest. The weakness of a CRC as an integrity check is not an issue since it is never used alone: it's use is more analogous to the few extra bits added to a data stream when error-detecting codes are used. I used a CRC in the initial implementation rather than a message digest because it is faster, and because the initial goal was to get things to work correctly. In any case, the patch does not eliminate any code in which Git already does a byte-by-byte comparison. In cases where Git currently assumes that two objects are the same because the SHA-1 hashes are the same, the patch compares CRCs as an additional test. Regarding your [Jeff's] second concern, "how does this alternative digest have any authority?" there are two things to keep in mind. First, it is a supplement to the existing digest. Second, any value of the CRC that is stored permanently (baring bugs, in my implementation, of course) is computed locally - when a loose object is created or when a pack file's index is created. At no point is a CRC that was obtained from another repository trusted. While the patch modifies Git so that it can send CRCs when using the git protocol, these CRCs are never stored, but are instead used only for cross checks. If one side or the other "lies", you get an error. To give a concrete example, during a fetch, the git protocol currently sends "have" messages that contain the SHA-1 hashes of commits. The extension allows two CRCs to be sent along with each hash. If these do not match the local values (tested only if the local values exist), something is wrong and you get an error report that the server sends to the client, but the server never uses these CRCs for any other purpose and the server never sends its CRCs to the client because of backwards-compatibility issues. For objects that are transferred, you end up with a pack file, with index-pack called to build the index (and with the patch, the corresponding MDS file), but index-pack already does a byte-by-byte comparison to detect collisions - the comparison is much faster than the SHA-1 computation index-pack has to do anyway. Where this helps is when one is using multiple repositories. If you fetch a commit from repository B, which we'll assume has a forged blob (different content, but the original SHA-1 hash), and then run fetch using repository A, which has has the same commit with the original blob, the forged blob will not be transferred from Server A and the client will not be notified that there is an inconsistency - the protocol is "smart" enough to know that the client already has the commit and assumes there is nothing to do regarding it. BTW, regarding your [Jeff's] discussion about putting an additional header in commit messages - I tried that. The existing versions of Git didn't like it: barring a bug in my test code, it seems that Git expects headers in commit messages to be in a particular order and treats deviations from that to be an error. I even tried appending blank lines at the end of a commit, with spaces and tabs encoding an additional CRC, and that didn't work either - at least it never got through all the test programs, failing in places like the tests involving notes. In any case, you'd have to phase in such a change gradually, first putting in the code to read the new header if it is there, and subsequently (after ample time so that everyone is running a sufficiently new version) enabling the code to create the new header. Also, regarding "At that point, I really wonder if a flag day to switch to a new repository format is all that bad," if that turns out to be the decision, I'd recommend doing it sooner rather than later. The reason is cost, which grows with the number of git users and the number and size of Git repositories. Bill -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html