Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 29, 2021 at 12:30:38PM -0800, Junio C Hamano wrote:

> > I think this would generally happen if the .keep packs are generated
> > using something like "git repack -a", which packs everything reachable
> > together. So if you do:
> >
> >   git repack -ad
> >   touch .git/objects/pack/pack-whatever.keep
> >   ... some more packs come in, perhaps via pushes ...
> >   # imagine repack knew how to pass this along...
> >   git repack -a --assume-kept-packs-closed
> >
> > then you'd repack just the objects that aren't in the big pack.
> 
> Yeah.  As a tool to help the above workflow, where you are only
> creating another .keep out of youngest objects (i.e. those that are
> either loose or in non-kept packs), because by definition anything
> in .keep cannot be pointing back at these younger objects, it does
> make sense to take advantage of "the set of packs with .keep as a
> whole is closed".
> 
> It may become tricky once we start talking about creating a new
> .keep out of youngest objects PLUS a few young keep packs, though.

Right. You'd have to make sure the younger packs were also created with
that reachability in mind. I.e., if you are in a situation where you've
got:

  - a big "old" pack
  - N new packs from pushes

then you can't assume anything about the reachability for those
individual N packs. It would be wrong to split any of them into the
"keep" side. They need to have all their objects traversed until we hit
something in the old pack.

But if you have a situation with:

  - a big "old" pack
  - M packs made from previous rollups on top of the old pack
  - N new packs from pushes

Then I think you can still take the old pack + the M packs as a
cohesive unit closed under reachability.

The tricky part is knowing which packs are which (size is a heuristic,
but it can be wrong; people may make a big push to a previously-small
repository).

> Starting from all on-disk .keep packs, you'd mark them as in-core
> keep bit, then drop in-core keep bit from the few young keep packs
> that you intend to coalesce with the youngest objects---that is how
> I would imagine your repacking strategy would go.  The set of all
> the on-disk .keep packs may give us "closed" guarantee, but if we 
> exclude a few latest packs from that set, would the remainder still
> give us the "closed" guarantee we can take advantage of, in order to
> pack these youngest objects (including the ones in the kept packs
> that we are coalescing)?

Yeah, I think we are going along the same lines. Except I think it is
dangerous to use on-disk ".keep" as your marker, because we will racily
see incoming push packs with a ".keep" (which receive-pack/index-pack
use as a lockfile until the refs are updated).

So repack has to "somehow" get the list of which is which.

None of which is disputing your "it may become tricky", of course. ;) It
is exactly this trickiness that I am worried about. And I am not being
coy with "somehow", as if we have some custom not-yet-shared layer on
top of repack that tracks this. We are still figuring out whether this
is a good direction in the first place. :)

One of the things that led us to this reachability traversal, away from
"just suck up all of the objects from these packs", is that this is how
--unpacked works. I had always assumed it was implemented as "if it's
loose, then put it in the pack". But it's not. It's attached to the
revision traversal. And it actually gets some cases wrong!

It will walk every commit, so you don't have to worry about a packed
commit referring to an unpacked one. But it doesn't look at the trees of
the packed commits (for the quite obvious reason that doing so is orders
of magnitude more expensive). That means that if there is a packed
commit that refers to an unpacked blob (which is not referenced by an
unpacked commit), then "rev-list --unpacked" will not report it (and
likewise "git repack -d" would not pack it).

It's easy to create such a situation manually, but I've included a
more plausible sequence involving "--amend" and push/fetch unpackLimit
at the end of this email.

At the core, --unpacked is assuming certain things about reachability of
loose/packed objects that aren't necessarily true. And this
--assume-kept-pack-closed stuff is basically doing the same thing for a
particular set of packs (albeit more so; I believe the patches here cut
off traversal of parent pointers, not just commit-to-tree pointers).

One of the reasons I think nobody noticed with --unpacked is that the
stakes are pretty low. If our assumption is wrong, the worst case is
that a loose object remains unpacked in "repack -d". But we'd never
delete it based on that information (instead, git prune would do its own
traversal to find the correct reachability). And it would eventually get
picked up by "git repack -ad".

But for repacking, the general strategy is to put things you want to
keep into the new pack, and then delete the old ones (not marked as
keep, of course). So if our assumption is ever wrong, it means we'd
potentially drop packs that have reachable objects not found elsewhere,
and we'd end up corrupting the repository.

So I think the paths forward are either:

  - come up with an air-tight system of making sure that we know packs
    we claim are closed under reachability really are (perhaps some
    marker that says "I was generated by repack -a")

  - have a "roll-up" mode that does not care about reachability at all,
    and just takes any objects from a particular set of packs (plus
    probably loose objects)

I'm still thinking aloud here, and not really sure which is a better
path. I do feel like the failure modes for the second one are less
risky.

Anyway, here's the --unpacked example, if you're curious. It's based on
fetching, but you could invert it to do pushes (in which case it is
repacking in "parent" that gets the wrong result).

-- >8 --
# two repos, one a clone of the other
git init parent
git -C parent commit --allow-empty -m base
git clone parent child

# now there's a small fetch, which will get
# exploded into loose objects.
(
	cd parent
	echo small >small
	git add small
	git commit -m small
)
git -C child fetch

# We can verify that "rev-list --unpacked" reports these
# objects.
git -C child rev-list --objects --unpacked origin

# and now a bigger one that will remain a pack (we'll
# tweak unpackLimit instead of making a really big commit,
# but the concept is the same)
#
# There are two key things here:
#
#   - the "small" commit is no longer reachable, but the big one
#     still contains the "small" blob object. Using --amend is
#     a plausible mechanism for this happening.
#
#   - we are using bitmaps, which give us an exact answer for the set of
#     objects to send. Otherwise, pack-objects on the server actually
#     fails to notice the other side has told us it has the small blob, and
#     sends another copy of it.
(
	cd parent
	git repack -adb
	echo big >big
	git add big
	git commit --amend -m big
)
git -C child -c fetch.unpackLimit=1 fetch

# So now in the child we have a packed object
# whose ancestor is an unpacked one. rev-list
# now won't report the "small" blob (ac790413).
git.compile -C child rev-list --objects --unpacked origin

# Even though we can see that it's present only as an
# unpacked object.
show_objects() {
	for i in child/.git/objects/pack/*.idx; do
		git show-index <$i
	done | cut -d' ' -f2 | sed 's/^/packed: /'
	find child/.git/objects/??/* |
		perl -F/ -alne 'print " loose: $F[-2]$F[-1]"'
}

# If we were to do an incremental repack now, it wouldn't be packed.
# (Note we have to kill off the reflog, which still references the
# rewound commit).
rm -rf child/.git/logs
git -C child repack -d
show_objects



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux