Re: [PATCH 0/7] speeding up cat-file by reordering object access

Jeff King <peff@xxxxxxxx> · Thu, 16 Aug 2018 15:45:38 -0400

On Thu, Aug 16, 2018 at 11:52:13AM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > I think that makes sense. We already see duplicates from
> > for_each_packed_object() when they're in multiple packs, and callers
> > just need to be ready to deal with it (and depending on what you're
> > doing, you may actually _want_ the duplicates).
> 
> You of course would also see dups between loose and packed until
> prune-packed is run.

Yes, although there are two separate calls to look at those two sources,
so it's a bit more obvious that the caller has to de-dup. I'm not
opposed to a flag to ask the function to de-dup for us, but it really
only makes sense if we do a combined for_each_object() call.
De-duping is potentially expensive, and there's little point in
de-duping the packs if we have to then de-dup the result with the loose
objects.

One benefit of having the iterator function de-dup is that it _could_ be
done more efficiently. Right now, even before my recent patches the
cat-file de-dup happens by creating a secondary list, sorting it, and
then skipping the duplicates.

But if a function knew _all_ of the sources we were going to look at
(any loose directories, including alternates, and all available packs),
and it could walk each individual source in sorted order (easy for packs
by .idx, not too bad for loose by sorting the readdir() results), then
we could do it as an online list-merge, with a single walk through the
results.

In practice I don't know if it's worth the effort. If you're accessing
the object contents, sorted order really is bad. And if you're just
collecting the hashes, then you're likely generating some data structure
for future lookups, and you can just de-dup as part of that process.

> I also was thinking about the same thing after Derrick's response,
> but unless you are very specialized caller, it does not allow you do
> very much to learn that object X exists as a loose object locally,
> also as a loose object in our alternate, and also in pack A, but not
> in other packs.  You need a way to say "Read the contents of object
> X from that place, not from any other place", "Remove that copy of
> object X at that place, but not at any other place" etc. to make
> effective use of that kind of information.

We do have those functions, and use them. E.g., fsck uses
read_loose_object() to actually check that particular copy of the
object. That's obviously a specialized caller as you note, but then
anybody iterating over all of the objects is pretty specialized already.

> The codepath that implements runtime access has "I found a copy
> here, but it is unusable, so let's go on to look for another usable
> copy" fallback.  This is a tangent but it is something we should not
> lose in the midx-enabled world.

Yeah, good catch. That's orthogonal to most of this discussion, I think,
but it's a possible downside of the midx because it has literally
discarded those duplicate index entries (or at least that's my
understanding). If we kept the .idx for each pack as a fallback for
older Gits, then it's easy to solve: in the unlikely case the
.midx-referenced copy fails, you look in each individual pack for
another copy.

But I think eventually you'd want to stop generating those .idx files,
too, if you know that you'll only use a modern version of Git. I wonder
if the .midx format should have an extension for "here are all the
duplicates I found". They could even be part of the main index (it's
easy enough for a binary-search lookup to always point to the start of a
run of duplicates), but if you had a ton of duplicates they would slow
your binary searches a little.

I'll leave all that to Stolee to ponder. :)

-Peff