Re: [PATCH v3] repack: enable bitmaps by default on bare repos

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 24 May 2019 06:33:00 -0400

On 5/24/2019 3:24 AM, Jeff King wrote:
> On Thu, May 23, 2019 at 08:53:56AM -0400, Derrick Stolee wrote:
> 
>>> I spent a while thinking and experimenting with this tonight. The result
>>> is the patch below. Ævar, do you still have a copy of the repo that
>>> misbehaved? I'd be curious to see how it fares.
>>
>> This patch caught my attention, and I think I understand some of the issues
>> at hand. I'm not well-versed specifically in Git's bitmap implementation.
>> The fundamental ideas are there, but my perspective is biased by my own
>> independent bitmap implementation for Azure Repos. What worked there may not
>> work at all here.
> 
> Thanks for looking at this. There are a lot of short-comings in the
> current bitmap implementation, so if there's a better way to do things,
> I'm not opposed to moving to a new format. :)
> 
>>> Finding the right trees to explore is a little tricky with bitmaps.  In
>>> a normal traversal, we consider the "edges" to be worth exploring.
>>> I.e., the places where an UNINTERESTING commit is the parent of an
>>> interesting one.
>>
>> This is the "commit frontier" which I bring up again below.
> 
> Right. I actually had trouble coming up with a succinct way of
> describing this, and stole the definition from your recent blog post. ;)
> 
>>> But with bitmaps, we don't have that information in the same way,
>>> because we're trying to avoid walking the commits in the first place. So
>>> e.g., in a history like this:
>>>
>>>   A--B--C
>>>       \
>>>        D
>>>
>>> Let's imagine we're computing "C..D", and that D has a bitmap on disk
>>> but C does not.
>>
>> (As I read your discussion below, I'm confused. For "C..D", C is a have
>> and D is a want. We should explore all the haves _first_, then walk the
>> wants, right?)
> 
> I think I may have confused things by starting my description with a
> hypothetical combined want/have walk. To take a step back: the problem
> we were discussing is that we spend a lot of time reading trees to fill
> in the "have" bitmap, even though most of those objects are unlikely to
> be in the "want" in the first place (only the frontier trees are really
> useful).

Thank you for resolving my confusion.

[snip]

> As I said earlier in the thread, I suspect our commit selection is not
> great. It's mostly some heuristics we threw together in 2013, and I
> don't think it was tested very thoroughly. So fixing that may be a
> simpler approach.

It's a hard problem! There are no _sure_ answers here, and what works in
some cases will probably not work in other cases.

> What I was wondering here was whether we could get an easy fix based on
> the same frontier ideas that the non-bitmap walk uses.

[snip]

> But doing that commit walk to find the frontier negates part of the
> purpose of using the bitmaps, which is avoiding even walking commits.
> Going back to a variant of our example:
> 
>   A -- B -- C_1 -- .. -- C_1000
>         \
> 	 D_1 -- .. -- D_1000
> 
> If we have a bitmap at C_1000 and D_1000, we don't have to walk any
> commits at all. But finding the frontier requires walking 2000 commits.

In my opinion, walking commits is easy (easier with the commit-graph)
and walking trees is hard. We probably _should_ do more commit walking if
it helps avoid walking some trees.

> Is there a way to find it with just bitmaps? You can certainly find the
> intersection, but you don't have any idea which of the many shared
> commits is the merge base. Of course in this example you don't actually
> care about the frontier (you have the full answer immediately). But how
> do you decide which way to optimize: try to avoid walking commits to
> get a quick answer from bitmaps, or prefer to walk some commits to find
> the frontier and avoid opening too many trees?

With a new perspective on the problem, I think perhaps trying to solve this
problem with bitmaps is incorrect. If you want to use bitmaps for C..D and
you don't have any bitmaps in the range D..C, then maybe we should just use
the old-fashioned method of walking trees? In your examples above, the
new method would walk trees for the commits in {B, C_i, D_j} while the
bitmap method would walk trees for the commits in {B, C_i, A_k}. I would
expect the list of {A_k} commits to be the largest in most cases.

The approach here would be to do the commit frontier walk, and check for
commits with bitmaps along the way. If we don't find an UNINTERESTING
bitmap, then use the non-bitmap way instead. Perhaps worth a shot.

I'll bring up this code snippet again:

        /*
         * if we have a HAVES list, but none of those haves is contained
         * in the packfile that has a bitmap, we don't have anything to
         * optimize here
         */
        if (haves && !in_bitmapped_pack(bitmap_git, haves))
                goto cleanup;

In addition to this, we can fill the "haves" set with the commits
in D..C (with boundary) and then check if any of those commits have a
precomputed bitmap. If not, goto cleanup.

Thanks,
-Stolee