On Fri, Dec 08, 2023 at 09:30:07PM -0500, Taylor Blau wrote: > > I did wonder how expensive to recompute and validate the "distinct" > > information (in other words, is it too expensive for the consumers > > of an existing midx file to see which packs are distinct on demand > > before they stream contents out of the underlying packs?), as the > > way the packs are marked as distinct looked rather error prone (you > > can very easily list packfiles with overlaps with "+" prefix and the > > DISK chunk writer does not even notice that you lied to it). As long > > as "git fsck" catches when two packs that are marked as distinct share > > an object, that is OK, but the arrangement did look rather brittle > > to me. > > It's likely too expensive to do on the reading side for every > pack-objects operation or MIDX load. This made me think a bit. Obviously we can check for disjointedness in O(n log k), where n is the total number of objects and k is the number of packs, by doing a k-merge of the sorted lists. But that's a non-starter, because we may be serving a request that is much smaller than all "n" objects (e.g., any small fetch, but also even clones when the pack contains a bunch of irrelevant objects). But we can relax our condition a bit. The packs only need to be disjoint with respect to the set of objects that we are going to output (we'll call that "m"). So at the very least, you can do O(mk) lookups (each one itself "log n", of course). We know that the work is already going to be linear in "m". In theory we want to generally keep "k" small, but part of the point of using midx in this way is to let "k" grow a bit. So that might turn out pretty bad in practice. So let's take one more step back. One thing I didn't feel like I saw either in this patch or the cover letter is exactly why we care about disjointedness. IIRC, the main issue is making sure that for any object X we send verbatim, it is either a non-delta or its delta base is viable. And the two reasons I can think of for the base to be non-viable are: 1. We are not sending the base at all. 2. The base is in another pack, and we are worried about creating a cycle (i.e., in pack A we have X as a delta against Y, and in pack B we have Y as a delta against X, and we send both deltas). We already deal with (1) for the single-pack case by finding the base object offset, converting it to a pack position, and then making sure that position is also marked for verbatim reuse. The interesting one is (2), which is new for midx (since a single pack cannot have two copies of an object). But I'm not sure if it's possible. The verbatim reuse code depends on using bitmaps in the first place. And there is already a preference-order to the packs in the midx bitmaps. That is, we'll choose all of the objects for pack A over any duplicates in B, and duplicates from B over C, and so on. If we likewise try verbatim reuse in that order, then we know that an object in pack A can never have a base that is selected from pack B or C (because if they do have duplicates, we'd have selected A's copy to put in the midx bitmap). And likewise, a copy of an object in pack B will always have its base either in A or B, but never in C. So it kind of seems to me that this would "just work" if try_partial_reuse() did something like for the midx case: - convert offset of base object into an object id hash using the pack revindex (similar to offset_to_pack_pos) - look up the object id in the midx to get a pack/offset combo - use the midx revindex to convert that into a bit position - check if that bit is marked for verbatim reuse The assumption there is that in the second step, the midx has resolved duplicates by putting all of pack A before pack B, and so on, as above. It also assumes that we are trying verbatim reuse in the same order (though a different order would not produce wrong results, it would only result in less verbatim reuse). All of which makes me think I'm missing some other case that is a problem. While I wait for you to explain it, though, let's continue our thought experiment for a moment. If we assume that any cross-pack deltas are a problem, we could always just skip them for verbatim reuse. That is, once we look up the object id in the midx, we can see if it's in the same pack we're currently processing. If not, we could punt and let the non-verbatim code paths handle it as usual. That still leaves the problem of realizing we skipped over a chunk of the packfile (say we are pulling object X from pack B, and it is a delta of Y, but we already decided to reuse Y from pack A). But the reuse code already has to accommodate gaps. I think it happens naturally in write_reused_pack_one(), where we feed the actual byte offsets into record_reused_object(). You'd have to take some care not to go past gaps in the blind copy that happens write_reused_pack_verbatim(). So you might need to mark the first such gap somehow (if it's hard, I'd suggest just skipping write_reused_pack_verbatim() entirely; it is a fairly minor optimization compared to the rest of it). And of course there's a bunch of hard work teaching all of those functions about midx's and multiple packs in the first place, but you already had to do all that in the latter part of your series, and it would still be applicable. OK, this is the part where you tell me what I'm missing. ;) -Peff