On Wed, Aug 21, 2013 at 04:20:07PM -0700, Junio C Hamano wrote: > I do not understand the last bit. If two copies of an object exist > but you have only one slot for the object in the index, and another > object names it as its base with ref-delta, then reconstituting it > should work just fine---whichever representation of the base object > is recorded in the .idx, that first needs to be reconstituted before > the delta is applied to it, and both copies should yield identical > contents for the delta base object, no? > > In any case, ejecting one from the pack .idx would not help in the > presense of either broken or malicious packer that reuses delta too > aggressively. Suppose you have objects A and B and somehow manage > to create a cycle of deltas, A names B as its delta base and B names > A as its delta base. The packer may notice its mistake and then add > another copy of A as a base object. The pack contains two copies of > A (one is delta based on B, the other is full) and B (delta against > A). Yes, that is the potentially problematic scenario... > If B refers to the copy of A that is delta against B using ofs-delta, > fixing the .idx file will have no effect. read_packed_sha1(B) will > read the delta data of B, finds the offset to start reading the data > for A which was excised from the .idx and unless that codepath is > updated to consult revindex (i.e. you mark one copy of A in the .pack > as bad, and when B refers to that bad copy of A via ofs-delta, you > check the offset against revindex to get the object name of A and go > to the good copy of A), you will never finish reading B because > reading the bad copy of A will lead you to first reconstitute B. Yes. There are two levels of brokenness here. One is that the pack has a delta base offset for B that leads to the "wrong" A, creating a cycle. This is an utterly broken pack and cannot be fixed automatically (you do not even know the name of the cycled A because you cannot constitute it to find its name and realize that it has a duplicate!). But we should notice this during index-pack, because we cannot reconstruct the object. Another situation is that your delta base points to the "right" A. You can reconstruct either A or B, because although there is a duplicate, there are no cycles in the delta chain (i.e., the chain does not care about object name, only offset, and offsets point only one way). So we do not have a problem at all with reconstructing the object data. We do have two ways we might access A from the same pack. For regular object lookup, that is probably not a big deal. For operations like repacking that look more closely at the on-disk representation, I do not know. We may mark A as "has a delta" or "does not have a delta" at one point in the code, but then later find the other copy of A which does not match. I did not trace through all of pack-objects to find whether that is possible, or what bugs it might cause. But indexing only one copy as the "master" means that we will always get a consistent view of the object (from that particular pack, at least). Now consider REF_DELTA. We lookup the base object by name, so we may get a different answer depending on how it is indexed. E.g., index-pack keeps an internal hash table, whereas later callers will use the .idx file. When looking at the .idx file, we may use a regular binary search, or sha1_entry_pos. The exact order of the search may even change from git version to git version. So you may have a situation where index-pack finds the "right" A and properly reconstructs the file while creating the .idx, and thinks all is well. But later lookups may find the "wrong" A, and fail. In other words, we cannot trust that data that makes it through index-pack is not corrupted (and it is index-pack's output that lets receive-pack commit to the client that we have their objects, so we want to be reasonably sure we have uncorrupted copies at that point). Choosing a single "master" A to go in the .idx means we will get a consistent view of A once the .idx is generated. But it may not be the right one, and it may not be the one we used to check that we can resolve A and B in the first place. The right fix for that situation is to keep both entries in the .idx, detect delta cycles, then look up extra copies of A, but being aware that you already found one copy in the .idx and that sha1_entry_pos (or the vanilla binary search) should return any other copies, if they exist. I do not think we detect delta cycles at all (though I didn't check), but certainly we do not have a way to tell sha1_entry_pos any different information so that it will find the "other" A. So I think it is a recoverable state, but it is a non-trivial bit of code for something that should never happen. Rejecting on push/fetch via index-pack is simple and easy, and we can deal with the fallout if and when it ever actually happens. > > I think this line of "fixing" should probably be scrapped. > > I tend to agree. OK. Let's drop the "skip" patch, then. I'll re-roll with a patch to flip the default to "reject". -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html