Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 25 Apr 2023 14:57:21 -0400

On Tue, Apr 25, 2023 at 02:30:01PM -0400, Derrick Stolee wrote:
> On 4/25/2023 2:02 PM, Junio C Hamano wrote:
> > Taylor Blau <me@xxxxxxxxxxxx> writes:
> >
> >> This improves many cases where using bitmaps was significantly *slower*
> >> than regular, non-bitmap traversal. In some instances, using bitmaps is
> >> still slower than the non-bitmap case, but it is a substantial
> >> improvement over the classic bitmap walk.
> >> ...
> >> In a large repository on GitHub.com, the timings to compute the objects
> >> unique to "master", as in:
> >>
> >>     $ git rev-list --count --objects master --not --exclude=master --branches
> >>
> >> improve as follows:
> >
> > Curious---when would it be significantly faster to use bitmaps?
> > "Most of the time"?  "In not-too-large repository?"
>
> This is an interesting question that is very difficult to predict
> in advance.

Indeed ;-).

> It would be interesting if we used the initial commit walk to the
> boundary as a predictor of whether bitmaps would be good, or if
> we should use the tree-parsing walk instead.

I think that this is tough, too. When making a prediction, it would be
good if we could avoid walking down to the nearest bitmap, since by that
point, you might as well have used bitmaps (if you found one before
reaching the root(s) of history).

I think the best heuristic we have available is if we have is something
like "# of UNINTERESTING commits between the tips and boundary", since
we OR those in ahead of time, and then the fill-in traversal will
(usually, not always) be cheap.

But I think that heuristic probably isn't too predictive, since it would
often give you the wrong prediction, say, when there are no
UNINTERESTING commits between the tips and boundary, but the boundary is
bitmapped.

Another approach you could take is to cut off the fill-in traversal at
some point, or start with whichever is presumed to be faster and then
dynamically switch to the other at some point during the walk. But that
also depends on repository shape/size, and machine configuration.

So I think it's extremely difficult to get "right", but if others have
suggestions, I would certainly be receptive to them :-).

> But perhaps this series gets the bitmap walk "close enough" in
> the typical case that it's better to use bitmaps in the chance that
> we have accidentally hit a case where we'd normally need to walk a
> lot of trees.

Yeah, that is definitely part of this series's goal. Another framing is,
for configurations that are using bitmaps everywhere, this should reduce
the number of cases where using bitmaps is *significantly* slower than
not.

Thanks,
Taylor