On Thu, 17 Aug 2006, David Rientjes wrote: > > I'd like to solicit some comments about implementing support for md5 as a > hash function that could be determined at runtime by the user during a > project init-db. I would _strongly_ suggest against this. At least not md5. I can see the point of configurable hashes, but it would be for a stronger hash than sha1, not for a (much) weaker one. md5 is not only shorter, it's known to be broken, and there are attacks out there that generate documents with the same md5 checksum quickly and undetectably (ie depending on what the "document format" is, you might actually not _see_ the corruption). There's a real-life example of this (just google for "same md5") with a postscript file, which when printed out still looks "valid". In contrast, sha1 is still considered "hard", in that while you can obviously always brute-force _any_ hash, the sha1 brute-forcing attack is considered to be impractical and nobody has at least shown any realistic version of the above postscript kind of hack. In my fairly limited performance analysis, I've actually been surprised by the fact that the hashing has never really shown up as a major issue in any of my profiles. All the _real_ performance issues have been related to memory usage, and things like the hash lookup (ie "memcmp()" was pretty high on the list - just from comparing object names during lookup). We've also had compression issues (initial check-in) and obviously the delta selection used to be a _huge_ time-waster until the pack info reuse code went in. But I don't think we've ever had a load that was really hashing-limited. So considering that md5 isn't _that_ much faster to compute (let's say that it's ~30% slower), the biggest advantage of md5 would likely be just the fact that 16 bytes is smaller than 20 bytes, and thus commit objects and tree objects in particular could be smaller. But you'd be better off just using the first 16 bytes of the sha1 than the md5 hash, if that was the main goal. So yes, maybe we'll want to make the hash choice a setup-time option, but if we ever do, I don't think we should make md5 even a choice. It's just not a very good hash, and no new program should start using it. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html