On Thu, Aug 16, 2018 at 11:52:13AM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > I think that makes sense. We already see duplicates from > > for_each_packed_object() when they're in multiple packs, and callers > > just need to be ready to deal with it (and depending on what you're > > doing, you may actually _want_ the duplicates). > > You of course would also see dups between loose and packed until > prune-packed is run. Yes, although there are two separate calls to look at those two sources, so it's a bit more obvious that the caller has to de-dup. I'm not opposed to a flag to ask the function to de-dup for us, but it really only makes sense if we do a combined for_each_object() call. De-duping is potentially expensive, and there's little point in de-duping the packs if we have to then de-dup the result with the loose objects. One benefit of having the iterator function de-dup is that it _could_ be done more efficiently. Right now, even before my recent patches the cat-file de-dup happens by creating a secondary list, sorting it, and then skipping the duplicates. But if a function knew _all_ of the sources we were going to look at (any loose directories, including alternates, and all available packs), and it could walk each individual source in sorted order (easy for packs by .idx, not too bad for loose by sorting the readdir() results), then we could do it as an online list-merge, with a single walk through the results. In practice I don't know if it's worth the effort. If you're accessing the object contents, sorted order really is bad. And if you're just collecting the hashes, then you're likely generating some data structure for future lookups, and you can just de-dup as part of that process. > I also was thinking about the same thing after Derrick's response, > but unless you are very specialized caller, it does not allow you do > very much to learn that object X exists as a loose object locally, > also as a loose object in our alternate, and also in pack A, but not > in other packs. You need a way to say "Read the contents of object > X from that place, not from any other place", "Remove that copy of > object X at that place, but not at any other place" etc. to make > effective use of that kind of information. We do have those functions, and use them. E.g., fsck uses read_loose_object() to actually check that particular copy of the object. That's obviously a specialized caller as you note, but then anybody iterating over all of the objects is pretty specialized already. > The codepath that implements runtime access has "I found a copy > here, but it is unusable, so let's go on to look for another usable > copy" fallback. This is a tangent but it is something we should not > lose in the midx-enabled world. Yeah, good catch. That's orthogonal to most of this discussion, I think, but it's a possible downside of the midx because it has literally discarded those duplicate index entries (or at least that's my understanding). If we kept the .idx for each pack as a fallback for older Gits, then it's easy to solve: in the unlikely case the .midx-referenced copy fails, you look in each individual pack for another copy. But I think eventually you'd want to stop generating those .idx files, too, if you know that you'll only use a modern version of Git. I wonder if the .midx format should have an extension for "here are all the duplicates I found". They could even be part of the main index (it's easy enough for a binary-search lookup to always point to the start of a run of duplicates), but if you had a ton of duplicates they would slow your binary searches a little. I'll leave all that to Stolee to ponder. :) -Peff