Re: [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Tue, 25 Apr 2023 14:53:46 -0400

On 4/24/2023 8:00 PM, Taylor Blau wrote:
> When reachability bitmap coverage exists in a repository, Git will use a
> different (and hopefully faster) traversal to compute revision walks.

Before getting into the code...

> Consider a set of positive and negative tips (which we'll refer to with
> their standard bitmap parlance by, "wants", and "haves"). In order to
> figure out what objects exist between the tips, the existing traversal
> in prepare_bitmap_walk() looks something like:

>   4. Construct a reachability bitmap for the "haves" side by walking
>      from the revision tips down to any existing bitmaps, OR-ing in any
>      bitmaps as they are found.
> 
>   5. Then do the same for the "wants" side, stopping at any objects that
>      appear in the "haves" bitmap.

One important thing about this older algorithm is that it will avoid
including trees and blobs that are reachable from the "haves" that
are not reachable from the boundary between "haves" and "wants". The
most obvious case is that someone force-pushed a branch after rewording
the commit messages but keeping the trees and blobs identical.

This case is rare, and usually the objects that would be repeated are
low enough in count and size (with deltas) that it's not a big deal, but
it is worth bringing up. I didn't see a reference to this side of the
difference in your message.

> One of the benefits, however, is that even if the walk is slower, bitmap
> traversals are guaranteed to provide an *exact* answer. Unlike the
> traditional object traversal algorithm, which can over-count the results
> by not opening trees for older commits, the bitmap walk builds an exact
> reachability bitmap for either side, meaning the results are never
> over-counted.

So the new algorithm is a new, third state of "not overcounting things
reachable beyond the commit-walk boundary" and "overcounting things
reachable from the wants..haves commit region".

> But producing non-exact results is OK for our traversal here (both in
> the bitmap case and not), as long as the results are over-counted, not
> under.

True, our response will not create a pack-file that cannot be applied
locally, but it might be larger than required.

For these reasons, I'm surprised that this patch completely replaces
the old algorithm for the new one. I would prefer to see a config
option that enables this new algorithm. It would be safer to deploy
that way, too.

> The new result performs significantly better in many cases, particularly
> when the distance from the boundary commit(s) to an existing bitmap is
> shorter than the distance from (all of) the have tips to the nearest
> bitmapped commit.

Aside: we may even gain benefits by adjusting the commits selected for
bitmaps by trying for this data shape. One way to attempt this is to
not focus on refs but on history common to "most refs" using first-
parent history as a good heuristic. If a commit appears in a lot of
first-parent histories, then it is more likely to be helpful to these
kinds of walks.

> Note that when using the old bitmap traversal algorithm, the results can
> be *slower* than without bitmaps! Under the new algorithm, the result is
> computed faster with bitmaps than without (at the cost of over-counting
> the true number of objects in a similar fashion as the non-bitmap
> traversal):
>
>     # (Computing objects unique to 'master' among all branches, not
>     # using bitmaps).
>     $ time git rev-list --count --objects master --not --exclude=master --branches
>     54

I like how you included the output of --count here. It might be interesting
to demonstrate the different forms of overcounting in a carefully constructed
test case, which would help us be sure that we are using the desired
algorithm during a test case.

>     real	0m1.012s  (no bitmaps)
>     real	0m29.571s (bitmaps, old)
>     real	0m2.279s  (bitmaps, new)

> The new algorithm is still slower than not using bitmaps at all, but it
> is nearly a 13-fold improvement over the existing traversal.

> In a more realistic setting (using my local copy of git.git), I can
> observe a similar speed-up:
...
> Here, the new algorithm is also still slower than not using bitmaps, but
> represents a 4-fold improvement over the existing traversal in a more
> modest example.

Interesting that the new bitmap algorithm is worse in all presented cases.

How "atypical" do we need to be in order for bitmaps to outperform the
tree-parsing algorithm? Do we need to try "--branches --not master~1000"
to replicate a stale single-branch clone getting every latest commit?

(I'll leave code review to a separate reply.)

Thanks,
-Stolee