Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse

Jeff King <peff@xxxxxxxx> · Mon, 9 Dec 2019 03:11:29 -0500

On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote:

> The old code to reuse deltas from an existing packfile
> just tried to dump a whole segment of the pack verbatim.
> That's faster than the traditional way of actually adding
> objects to the packing list, but it didn't kick in very
> often. This new code is really going for a middle ground:
> do _some_ per-object work, but way less than we'd
> traditionally do.

This is a good intro to the general problem, but...

> For instance, packing torvalds/linux on GitHub servers
> just now reused 6.5M objects, but only needed ~50k
> chunks.

What the heck is a chunk? :)

I know, because I wrote the patches, but I think we need to sketch out
the solution a bit for the reader.

I think the flow here is:

  - we want to do some per-object work (i.e., your intro above)

  - the general strategy is to make a bitmap of objects from the
    packfile we'll include, and then iterate over it, writing out each
    object exactly as it is in our on-disk pack, but _not_ adding it to
    our packlist (which costs memory, and increases the search space for
    deltas)

  - one complication is that if we're omitting some objects, we can't
    set a delta against a base that we're not sending. So we have to
    check each object in try_partial_reuse() to make sure we have its
    delta (actually, that third big comment in that function is
    incomplete, I think; it talks about sending the object later, not as
    part of the reused pack, but we might not send it at all!).

  - another complication is that when we omit parts of the packfile,
    that screws up delta base offsets. So to handle that, we have to
    keep track of which chunks we send (which is the bits you included
    below about chunks)

  - and then we can talk about performance; worst case we might have
    interleaved objects that we are sending or not sending, and we'd
    have as many chunks as objects. But in practice we send big chunks,
    so that's where the 6.5M objects vs 50k chunks comes in.

> Additional checks are added in have_duplicate_entry()
> and obj_is_packed() to avoid duplicate objects in the
> reuse bitmap. It was probably buggy to not have such a
> check before.
> 
> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.
> 
> The simplest place to fix this is right in add_ref_tag,
> where we could avoid peeling the tag at all if we know that
> we are already including it. However, I've pushed the check
> instead into have_duplicate_entry(). This fixes not only
> this case, but also means that we cannot have any similar
> problems lurking in other code.

I think this part could go in its own commit. If we introduce
reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk
at the beginning of the file with the existing code, then we can
introduce the extra checks in have_duplicate_entry() on top of that.

And then it would be safe to do the cleanup in show_objects_from_type()
that triggers the test failure in patch 4.

>  builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
>  pack-bitmap.c          | 150 ++++++++++++++++++++--------
>  pack-bitmap.h          |   3 +-
>  3 files changed, 288 insertions(+), 87 deletions(-)

It might be worth having a perf test here. The case this is helping the
most, I think, is when somebody cloning wants all of the objects from
the beginning of the pack, but then there's a bunch of _other_ stuff.

That could be unreachable objects kept by "repack -k", or maybe objects
reachable outside of heads and tags. Or objects that are part of other
delta islands. This last is the most plausible, really, because we'll
put all of the root-island objects at the beginning of the pack. So
maybe a good perf test would be to take an existing repository add a
bunch of new commits in refs/foo/, and then repack with delta islands
such that refs/heads and refs/tags are in one (root) island, but
refs/foo is in another.

The old code should fail to do the pack-reuse thing, but we should get
pretty good reuse with the new code (and less CPU and peak RAM,
hopefully, though the perf suite doesn't measure RAM directly).

Answering some questions from Jonathan's response in the last round
(some of the quotes are his, some are yours):

> I still don't understand this part. Going off what's written in the
> commit message here, it seems to me that the issue is that (1) an object
> can be added to the reuse bitmap without going through to_pack, and (2)
> this is done when the client asks for a tag by SHA-1. But all objects
> when specified by SHA-1 go through to_pack, don't they?

No, if it's part of the reused chunk, we'd avoid adding it to to_pack at
all (and I think the commit message should make that more clear, as I
described above).

> >> No tests, because git does not actually exhibit this "ask
> >> for it and also include-tag" behavior. We do one or the
> >> other on clone, depending on whether --single-branch is set.
> >> However, libgit2 does both.
>
> So my wild guess sould be that maybe the reason is that some of this
> code was shared (or intended to be shared) with libgit2?

No, the question is how the client behaves, and how we on the server
react to it. Git as a client would never ask for both include-tag and
for the tag itself by sha1. But libgit2 will, so a libgit2 client
cloning from a Git server would trigger the bug.

> > +     if (pos >= bitmap_git->pack->num_objects)
> > +             return; /* not actually in the pack */
>
> Is this case possible? try_partial_reuse() is only called when there is
> a 1 at the bit location.

Yes, it's possible. The result of a bitmap walk is larger than a given
pack, because we have to extend it to match whatever objects the caller
asked for. E.g., imagine we have commit A, repack into into a bitmapped
pack, and then somebody pushes up commit B. Now I want to clone, getting
both A and B.

Our bitmap result represents the whole answer, and so must include both
objects. But since "B" isn't in the pack, it doesn't have an assigned
bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
bit (valid only for that walk result!). But of course for pack reuse, we
only care about the objects that were actually in the bitmapped pack.

> > +     /* Don't mark objects not in the packfile */
> > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>
> Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

This is the same issue as above. Imagine instead of two objects, imagine
we have 42 words worth. But if only 20 words worth are in the packfile,
then we have to stop there.

Now we _could_ figure this out for each individual object through the
similar logic in try_partial_reuse(). But we try to take an even faster
path for the initial chunk of objects. We don't even walk over them
looking to see if they're deltas or not. We just send that chunk of the
pack verbatim.

I don't have numbers for how often we hit this path versus the
individual try_partial_reuse() path. Possibly the stats could
differentiate that.

Thinking on it more, though, I wonder if there's a bug hiding here. We
know that we can send the whole initial chunk verbatim for OFS_DELTA
objects, because the relative offsets will remain unchanged (i.e., there
are no "holes" that trigger our chunk code). But if there were a
REF_DELTA in that initial chunk, we have no certainty that the base is
being sent.

In practice, this doesn't happen because the objects in question have to
be in a pack with bitmaps, which means it was written ourselves by
git-repack. And we'd never write REF_DELTA objects there.

But I wonder if it's worth being a bit more careful (and what the
performance impact is; it would mean checking the type of every object
in that initial chunk).

-Peff