Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 25, 2020 at 08:58:46AM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> > or "git repack -Ad" is going to pack everything and delete the old
> > packs. So I think we'd want to remove a midx there.
> 
> AFAIK, the pack-redundant subcommand is used by nobody in-tree, so
> nobody is doing "there are three packfiles, but all the objects in
> one of them are contained in the other two, so let's remove that
> redundant one".

OK, that's the part I was missing. The discussion here and the statement
from git-repack(1):

  -d
      After packing, if the newly created packs make some existing packs
      redundant, remove the redundant packs. Also run git prune-packed
      to remove redundant loose object files.

made me think that it was running pack-redundant. But it doesn't seem
to. It looks like we stopped doing so in 6ed64058e1 (git-repack: do not
do complex redundancy check., 2005-11-19).

As an aside, we tried using pack-redundant at GitHub several years ago
for dropping packs that were replicated in alternates storage. It
performs very poorly (quadratically, perhaps?) to the point that we
found it unusable, and I implemented a "local-redundant" command to do
the same thing more quickly. We ended up dropping that as we now migrate
packs wholesale (rather than via fetch), so I never upstreamed it.

I mention that here to warn people away from pack-redundant (I wondered
at the time why nobody had noticed its terrible performance, but now I
think the answer is just that nobody ever runs it). I wonder if we
should even consider deprecating and removing it.

I'm also happy to share local-redundant if anybody is interested (though
as I've mentioned elsewhere, I think the larger scheme it was part of it
also performed poorly and isn't worth pursuing).

> > And "git repack -d" I think of as deleting only loose objects that we
> > just packed. But I guess it could also remove a pack that has now been
> > made redundant? That seems like a rare case in practice, but I suppose
> > is possible.
> 
> Meaning it can become reality?  Yes.  Or it already can happen?  I
> doubt it.

I thought the latter, but after this thread I agree that it can't.

> > I'm also a little curious how bad it is to have a midx whose pack has
> > gone away. I guess we'd answer queries for "yes, we have this object"
> > even if we don't, which is bad. Though in practice we'd only delete
> > those packs if we have their objects elsewhere.
> 
> Hmph, object O used to be in pack A and pack B, midx points at the
> copy in pack B but we deleted it because the pack is redundant.
> Now, midx says O exists, but midx cannot be used to retrieve O.  We
> need to ask A.idx about O to locate it.
> 
> That sounds brittle.  I am not sure our error fallback is all that
> patient.

It has to be that patient even without a midx, I think. We might open
A.idx and mmap its contents, without opening the matching A.pack (or we
might open it and later close it due to memory or descriptor
constraints). And we have two layers there:

  - when object_info_extended sees a bad return from
    packed_object_info(), it will mark the entry as bad and recurse,
    giving us an opportunity to find another one

  - when we can't find the object at all (e.g., because we marked that
    particular pack's version as inaccessible), we'll hit the
    reprepare_packed_git() path and look for a new idx

Those same things should be kicking in for midx lookups, too (I didn't
test them, but from a cursory look at the organization of the code, I
think they will).

> > In that line of thinking, do we even need to delete midx files if one of
> > their packs goes away? The reading side probably ought to be able to
> > handle that gracefully.
> 
> But at that point, is there even a point to have that midx file that
> knows about objects (so it can answer has_object()? correctly and
> quickly) but does not know the correct location of half of the objects?
> Instead of going directly to A.idx to locate O, we need to go to midx
> to learn the location of O in B (which no longer exists), and then
> fall back to it, that is a pure overhead.

It is overhead for sure. But my thinking was that you're trading one
efficiency for another:

  - the midx may incur an extra lookup for objects in the redundant
    pack, but that pack may be a small portion of what the midx indexes
    (and this is likely in the usual case that you have one big
    cumulative pack and many recently-created smaller ones; the big one
    will never be made redundant and dominates the object count of the
    midx)

  - by keeping the midx, you're improving lookup speed for all of the
    other objects, which don't have to hunt through separate pack idx
    files

So my guess is that it would usually be a win. Though really the correct
answer is: if you are mucking with packs you should just generate a new
midx (or you should pack all-into-one, which generates a single idx
accomplishing the same thing).

> > And the more interesting case is when you repack everything with "-ad"
> > or similar, at which point you shouldn't even need to look up what's in
> > the midx to see if you deleted its packs. The point of your operation is
> > to put it all-into-one, so you know the old midx should be discarded.
> 
> Old one, yes.  Do we need to have the new one in that case?

You shouldn't need one since you'd only have a single pack now.

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux