[Your original message was almost certainly bounced by git@vger because it surpassed the 100K limit; I'll try to quote liberally for those who missed the original. You may want to split your content and repost.] On Mon, Nov 28, 2011 at 10:02:43PM -0800, Bill Zaumen wrote: > Maintains a database of CRCs of Git objects to allow SHA-1 hash > collisions to be detected with high probability (1 - 1/2^32) and with > little computational overhead. The CRCs cover the content of Git > objects, but not the header. For loose objects, these are stored in > subdirectories of GIT_DIRECTORY/objects/crcs, with each subdirectory's > name consisting of the first two hexadecimal digits of the > corresponding object's SHA-1 hash. For each pack file, FILE.pack, the > CRCs are stored in a FILE.mds, in the same order as the SHA-1 hashes > that appear in FILE.idx. Checks for hash collisions are made whenever > a new loose object is created. I'm confused. Is this about avoiding accidental collisions, or about avoiding malicious collision attacks? Let's assume the former for a minute. The usual way of considering collision likelihood is not by probabilities, but by the number of items that must be selected to achieve a >50% probability that there is a collision. Which is the square root of the number of total items, or half the bit-space. So we expect to see a collision in 160-bit SHA-1 after writing about 2^80 objects. The linux-2.6 repository has about 2.2 million objects after 6.5 years of development. If development continues at this pace, we would expect a collision in a mere 10^18 years. Assuming your 32-bit CRC is statistically independent of the SHA-1 value, it adds 32 bits to the hash, or 16 bits to the number of objects we expect to generate (i.e., 2^96). That bumps us up to 10^23 years. However, it may be of interest that the Sun is expected to burn out in a mere 10^10 years[1]. So I'm not sure there is really any point in adding a few bits to the hash length to detect an accidental collision. It's already fantastically unlikely. Adding another probability on top does make it less likely, but in the same ballpark of fantastic. You can argue about whether linux-2.6 is a representative sample, or whether the pace of object creation might increase. But the simple answer is that we're many orders of magnitude away from having to care. However, in your patch you write: > +Security-Issue Details > +---------------------- > + > +Without hash-collision detection, Git has a higher risk of data > +corruption due to the obvious hash-collision vulnerability, so the > +issue is really whether a usable vulnerability exists. Recent research > +has shown that SHA-1 collisions can be found in 2^63 operations or > +less. While one result claimed 2^53 operations, the paper claiming > +that value was withdrawn from publication due to an error in the > +estimate. Another result claimed a complexity of between 2^51 and 2^57 > +operations, and still another claimed a complexity of 2^57.5 SHA-1 > +computations. A summary is available at > +<http://hackipedia.org/Checksums/SHA/html/SHA-1.htm#SHA-1>. Given the > +number of recent attacks, possibly by governments or large-scale > +criminal enterprises > +(<http://www.csmonitor.com/World/terrorism-security/2011/0906/Iranian-government-may-be-behind-hack-of-Dutch-security-firm>, > +<http://en.wikipedia.org/wiki/Operation_Aurora>, > +<http://en.wikipedia.org/wiki/Botnet#Historical_list_of_botnets>), > +which include botnets with an estimated 30 million computers, there is > +reason for some concern: while generating a SHA-1 collision for > +purposes of damaging a Git repository is extremely expensive > +computationally, it is possibly within reach of very well funded > +organizations. 2^32 operations, even if the operations are as > +expensive as computing a SHA-1 hash of a modest source-code file, can > +be performed in a reasonably short period of time on the type of > +hardware widely used in desktop or laptop computers at present. With > +sufficient parallelism, 30 million personal computers sufficient for > +playing the latest video games could perform 2^56 operations in a > +reasonable time. ...which makes me think that you do care about malicious collisions. All of what you wrote above seems fairly accurate. Let's leave aside that those numbers are for a collision attack as opposed to a pre-image attack (collision attacks are hard to execute, as they require you to generate two "matched" objects, one good and one evil, and then convince somebody to first accept your good object, only to later replace it with the evil one). I have two concerns with this as a security mechanism: First, let us assume that the implementation details of your scheme work, and that git users will effectively be checking not just a SHA-1, but now a SHA-1 plus your additional digest. In that case, why use crc32 as the digest? It seems like a terrible choice. Assuming it's cryptographically secure, then it's adding a mere 16 bits to the numbers you mentioned above. IOW, it's at best a band-aid that pushes attacks off for a few years. But more importantly, it's _not_ secure, and can be trivially forged. But that's a relatively simple problem. crc32 could be replaced in your scheme with any of the SHA-2 family, or the upcoming SHA-3, or whatever. That brings me to my second concern: how does this alternative message digest have any authority? For example, your patch teaches the git protocol a new extension to pass these digests along with the object sha1s. But how do we know the server isn't feeding us a bad digest along with the bad object? The "usual" security model discussed in git is that of verifying a signed tag. Linus signs a tag and pushes it to a server. I fetch the tag, and can verify the signature on the tag. I want to know that I have the exact same objects that Linus had. But I can't assume the server is trustworthy; it may have fed me a bad object with a collided sha1. But what's in the signed part of the tag object? Only the sha1 of the commit the tag points to, but not the new digest. So an attacker can replace the commit with one that collides, and it can in turn point to arbitrary trees and blobs. You can fix this by including an extra header in the signed part of the tag that says "also, the digest of the commit I point to is X". Then you know you have the same commit that Linus had. But the commit points to a tree by its sha1. So you have to add a similar header in the commit object that says "also, the digest of the tree I point to is X". And ditto for all of the parent pointers, if you want to care about signing history. And then you have the same problem in the tree: each sub-tree and blob is referenced by its sha1. In other words, authority flows from the gpg-signed tag portion, and links in the chain to each object are made by referencing sha1s. Every time such a link is made, you need to also include the digest of the object, or you are giving the attacker a place to insert a collided object. For tag and commit objects, this actually isn't that hard; they have a relatively small number of links, and they have room for arbitrary headers. I.e., add a "tree-md-sha256" header that gives the expected sha-256 of the tree object referenced by sha-1 in the "tree" header. Older versions of git will skip over this header (though obviously they won't be smart enough to do the more-secure verification). However, it's harder for trees. Each entry needs to have the new digest added, but there simply isn't room in the format. So we would have to change the tree format, breaking interoperability with older versions of git. And all of your tree sha1's would change as soon as you wrote them with the new format. That's only slightly better than just swapping sha1 out for a better algorithm. One trick you could do is to include the digest of each blob in the commit object itself. This really bloats the size of commit objects, though. You can store a digest of their digests instead (which your patch actually calculates, but AFAICT does not actually insert into the commit object). That is small and relatively cheap (it turns commit from an O(1) operation into an O(number of files) operation, but the per-file constant is pretty inexpensive). But it comes at the expense of not being able to tell which of the constituent blobs was actually attacked. So I think all of that would work for verification starting at a signed tag that points to a commit or a blob. For a tag pointing straight to a tree, it could include the same "digest of digests" that the commit would. But it can never handle direct sha1 references outside of git. For example, a bug tracker or CI system that references a commit in git will do so by sha1. But that's an insecure link in the chain; you really need it to refer to both the sha1 _and_ the digest, and then you can verify that the object you have under that sha1 matches the digest. So you'd have to teach a whole generation of tools that they can't trust git sha1 ids, and that you need an "extended" id that includes the digest. At that point, I really wonder if a flag day to switch to a new repository format is all that bad. You'd have to either rewrite existing history (which means re-signing tags if you care about them!), or just decide to leave the new and old histories disconnected. Whereas a scheme based around an added-on digest could keep the old history connected. But at the same time, the old history will have zero value from a security perspective. If sha1 is broken, then those old signatures are worthless, anyway. And just switching to a new algorithm means the implementation remains very simple. -Peff [1] Fun fact of the day: if linux development continues at the same rate until the Sun burns out, there is a 1/2^60 chance of a collision! -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html