Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'

Jeff King <peff@xxxxxxxx> · Fri, 29 Jan 2021 17:57:01 -0500

On Fri, Jan 29, 2021 at 05:10:20PM -0500, Taylor Blau wrote:

> On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote:
> > So it may be reasonable to go that direction, which is really defining a
> > totally separate strategy from git-gc's "repack, and occasionally
> > objects age out". Especially if we find that the
> > assume-kept-packs-closed route is too risky (i.e., has too many cases
> > where it's possible to cause corruption if our assumptions isn't met).
> 
> Yeah, this whole conversation has made me very nervous about using
> reachability. Fundamentally, this isn't about reachability at all. The
> operation is as simple as telling pack-objects a list of packs that you
> do and don't want objects from, making a new pack out of that, and then
> optionally dropping the packs that you rolled up.
> 
> So, I think that teaching pack-objects a way to understand a caller that
> says "include objects from packs X, Y, and Z, but not if they appear in
> packs A, B, or C, and also pull in any loose objects" is the best way
> forward here.
> 
> Of course, you're going to be dragging along unreachable objects until
> you decide to do a full repack, but I'm OK with that since we wouldn't
> expect anybody to be solely relying on geometric repacks without
> occasionally running 'git repack -ad'.

While writing my other response, I had some thoughts that this "dragging
along" might not be so bad.

Just to lay out the problem as I see it, if you do:

  - frequently roll up all small packs and loose objects into a new
    pack, without regard to reachability

  - occasionally run "git repack -ad" to do a real traversal

then the problem is that unreachable objects never age out:

  - a loose unreachable object starts with a recent-ish mtime

  - the frequent roll-up rolls it into a pack, freshening its mtime

  - the full "repack -ad" doesn't delete it, because its pack mtime is
    too recent. It explodes it loose again.

  - repeat forever

We know that "repack -d" is not 100% accurate because of similar "closed
under reachability" assumptions (see my other email). But it's OK,
because the worst case is an object that doesn't quite get packed yet,
not that it gets deleted.

So you could do something like:

  - roll up loose objects into a pack with "repack -d"; mostly accurate,
    but doesn't suck up unreachable objects

  - roll up small packs into a bigger pack without regard for
    reachability. This includes the pack created in the first step, but
    we know everything in it is actually reachable.

  - eventually run "repack -ad" to do a real traversal

That would extend the lifetime of unreachable objects which were found
in a pack (they get dragged forward during the rollups). But they'd
eventually get exploded loose during a "repack -ad", and then _not_
sucked back into a roll-up pack. And then eventually "repack -ad"
removes them.

The downsides are:

  - doing a separate "repack -d" plus a roll-up repack is wasted work.
    But I think they could be combined into a single step (at the cost
    of some extra complexity in the implementation).

  - using "--unpacked" still means traversing every commit. That's much
    faster than traversing the whole object graph, but still scales with
    the size of the repo, not the size of the new objects. That might be
    acceptable, though.

I do think the original problem goes away entirely if we can keep better
track of the mtimes. I.e., if we had packs marked with ".cruft" instead
of exploding loose, then the logic is:

  - roll up all loose objects and any objects in a pack that isn't
    marked as cruft (or keep); never delete a cruft pack at this stage

  - occasionally "repack -ad"; this does delete old cruft packs (because
    we'd have rescued any reachable objects they might have contained)

I'm not sure I want to block this topic on having cruft packs, though.
Of course there are tons of _other_ reasons to want them (like not
causing operational headaches when a repo's disk and inode usage grows
by 10x due to exploding loose objects). So maybe it's not a bad idea to
work on them together. I dunno.

-Peff