On Mon, Nov 28, 2011 at 22:02, Bill Zaumen <bill.zaumen+git@xxxxxxxxx> wrote: > Several years ago (in 2006) there was a discussion about the whether > SHA-1 is adequate given the very small but non-zero probability of a > hash collision, particularly given the possibility of a malicious > attempt to generate a collision. At roughly the same time, Git was > modified to support "thin packs" during data transfers. These allow > one to send deltas based on objects that are not in the pack file > being transferred. As a result a previously undetected hash collision > could result in a corrupted repository when (for example) the same > delta is applied to different objects that have the same SHA-1 hash. I don't think you understand how these thin packs are processed. If the pack contains <100 objects, it is inflated to loose objects. If the receiving side (so client in fetch, server in push) already has an object by that SHA-1, the new object is discarded. If the pack contains >=100 objects, and the receiving side already has the object, it is compared byte-for-byte to ensure the incoming copy exactly matches the already existing copy. Either way the first object to arrive always wins. The recipient has to trust that the remote side is providing it something reasonable. If the recipient has *ZERO* trust in the sender, then s/he should read the content of all newly arrived objects before compiling or executing them. This is one reason why Git does not run hooks that are transported as part of the repository. If the recipient thinks reading the content is too onerous or impossible, then they have to make a trust assertion on the remote side. This trust assertion should be derived from the community, and not so much around the machine claiming the content is what it says it is. We have yet to disprove the halting problem, so we have yet to construct a machine that can verify those Linux kernel sources you downloaded don't contain a local root exploit (for example). Instead we have to trust the community of developers and users who work on and run that code to have confidence that the code works as expected, etc. We base our trust off reputable people making statements like "Linus kernel at git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git is pretty good", and "kernel.org is where Linus pushes directly to, so its reasonable to trust the kernel.org server". The recipient should have some understanding of the remote server's security policies, or pay attention to notices posted by others who are also fetching and reviewing content from the same repository. At some level, the community using a repository from a given site should be policing that site and establishing trust that the host is not providing garbage content. After an incident, it is possible to pick up again by rebuilding the environment from an already known repository that people trust. In the case of the recent kernel.org environment rebuild, that is exactly what they did, the community picked up again from Linus' personal repository. DNS could be abused to send you the wrong IP for a site, but most people don't use random DNS servers, they have some level of trust in their DNS provider. DNSSEC is helping to improve the security of the name->IP translation process, and using protocols like HTTPS with SSL certificate verification can help to reduce the chances that a forged DNS entry sends you to an evil source, rather than the community trusted one. (Although SSL certificates seem to be getting forged left and right these days, so again you can't really rely on strong cryptography to magically solve security problems when the attacker holds the private key you have decided to trust with no further verification.) But trust aside, consider an object C is sent as a delta to the remote side. The delta base B is not included in the pack, and is referenced by SHA-1. When the remote side processed delta C, it looks up a copy of base B from its own repository. We assume this content of B is correct, due to the "first to arrive wins" rule, and the community review/trust/notification process. The inflated length of B is checked against a size that is stored in the front of the delta instruction stream that describes C. These lengths must match exactly, if they do not match then the delta application aborts, the pack is rejected, and any temporary data is removed from the filesystem. As Peff pointed out elsewhere in this thread, the odds of a SHA-1 collision in a project are low, on the order of 1/(2^80). Although there are some attacks on message digest functions like MD-5 or SHA-1 that might be possible to generate a duplicate in 2^57 time, any that I have read require producing a different length content than the original you are trying to replace. Assuming the copy of B on the remote system actually inflates and computes to the correct SHA-1 B, it probably does not also have the correct length if it is an object with correct hash but wrong content. So delta application should still be checking for collision with a 1/(2^80) probability. Assuming the remote's copy of B passed the size check, the delta is applied on top, and the SHA-1 of the result buffer is computed. The attacker must craft the delta such that the SHA-1 of C is the result, otherwise connectivity checks will fail. Assuming the attacker successfully stores a C' that has the right SHA-1, but wrong content... the community around that repository will eventually notice this and message that the source site cannot be trusted. I refer you back to the statement above about trusting the site you pull from, or trusting the users that you authorize to push into a repository. But thin pack aside, this problem exists in any form of a packed object. An attacker can try to send object P' (as a non-delta) in place of P. SHA1(P') = SHA1(P), but the content differs. This is far easier to construct than the thin pack delta case you think is a problem, and is the most likely approach for an attacker to take. I refer you back to the statement above about trusting the site you pull from, or trusting the users that you authorize to push into a repository, or reading every object you receive. Even if you magically fix the hash function somehow to decrease the odds of collision (e.g. by switching to a member of the SHA-2 or SHA-3 family), there is no way to prevent a bug or root exploit from entering the project except by never adding new code, or by carefully reviewing everything that is submitted, and building up a basis of trust around that review method. It is far more likely for an attacker to try and submit a plain text patch to the Linux kernel mailing list that reads completely correct, hashes to the correct SHA-1s when applied in Git, etc... but just "happens" to contain an off by one pointer bug in some weird case that allows the attacker to overflow a critical memory buffer and later inject some code that can later be used to create an exploit. If they are ever "caught" they may just claim "I AM A MORON I AM SORRY I MISSED THAT BUFFER CHECK AND SO DID YOU DURING CODE REVIEW SO ITS NOT ALL MY FAULT LEAVE ME ALONE" and get away with it. Trust. Review. Verify. I don't know about you, but I don't just pull random code from arbitrary sites on the Internet. Nor do I compile or execute that code on my workstation. I do trust some individuals based on their reputation on the Internet, or my past experiences working with their code. And I also trust some hosting environments like kernel.org, or GitHub, or code.google.com, to provide reasonably secure hosting, and to aggressively react to any event that might make it harder for me to trust the content they supply. And I also read a lot of code that I pull. It really isn't the problem you try to claim it is. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html