Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Tue, 25 Apr 2023 16:27:21 -0400

On 4/25/2023 3:01 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 02:06:25PM -0400, Derrick Stolee wrote:
>> For my curiosity, and since you already have a test environment set up,
>> could you redo the "without bitmaps" case with pack.useSparse true and
>> false? When the option was created and defaulted to true, we never
>> really explored comparing it to the bitmap case. In fact, I assumed the
>> bitmap case was faster in important cases like this (where there is a
>> substantial difference in object counts), but your data is surprising me
>> that the sparse algorithm is outperforming bitmaps, even with this new
>> algorithm.
>>
>> The main question I'd like to ask is: is pack.useSparse contributing
>> in an important way to the performance here?
> 
> I don't know enough about pack.useSparse to say with certainty, but
> trying again on the same repository (which is reasonably well-packed at
> the moment), they appear about the same:

Thanks for running that. I apologize, but I didn't sync in my head
that you're purposefully using 'git rev-list' to do the walk and the
config only makes a difference for 'git pack-objects'. The suspiciously
close timings is indeed too suspicious. Thus, the performance numbers
you are considering are not identical to what would happen in 'git
pack-objects' by default.

So, I created my own example using git.git, which included this input:

	refs/remotes/origin/master
	refs/remotes/origin/maint
	refs/remotes/origin/next
	refs/remotes/origin/seen
	^refs/remotes/origin/master~1000

And there is a clear preference for pack.useSparse=false in this
case:

Benchmark 1: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-true.txt" ~/_git/git/git-review/git -c pack.useSparse=true -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.857 s ±  0.031 s    [User: 6.850 s, System: 0.378 s]
  Range (min … max):    2.814 s …  2.922 s    10 runs

Benchmark 2: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-false.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.682 s ±  0.029 s    [User: 6.678 s, System: 0.388 s]
  Range (min … max):    2.653 s …  2.753 s    10 runs

and I used the perf traces to extract that the enumerate objects
phase took ~550ms for the sparse walk and ~375ms for the non-sparse
version.

But using this same input, I'm able to construct an example where
your modified bitmap walk is faster than the non-sparse object walk:

Benchmark 1: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-true.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=true pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.045 s ±  0.039 s    [User: 6.464 s, System: 0.334 s]
  Range (min … max):    1.966 s …  2.089 s    10 runs

Benchmark 2: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-false.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.138 s ±  0.034 s    [User: 6.070 s, System: 0.333 s]
  Range (min … max):    2.058 s …  2.182 s    10 runs

(Here, the enumerate objects phase takes ~230ms on average.)

I can easily swap this by changing my negative ref to

  ^revs/remotes/origin/master

Benchmark 1: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-true.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=true pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):     521.0 ms ±  28.8 ms    [User: 1190.9 ms, System: 103.6 ms]
  Range (min … max):   496.5 ms … 589.8 ms    10 runs

Benchmark 2: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-false.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):     498.5 ms ±  13.2 ms    [User: 1089.1 ms, System: 123.3 ms]
  Range (min … max):   471.3 ms … 513.6 ms    10 runs

which gives us reason to think about "small fetches" don't need
bitmaps, but "big fetches" do. But that can be done separately
from this patch.

Context about the sparse walk:

The sparse walk uses the names of the paths with respect to the root 
trees as a way to walk only the "new" objects without walking the
entire object set at the boundary. This makes a big difference in
repos with many files at HEAD, but the risk is that those path names
become significant overhead if there are enough new changes spread
across a large fraction of the tree. For clients pushing the work
of a single developer, the sparse algorithm outperforms the default
walk.

I wrote about this years ago in [1] and almost brought it up earlier
as a description of the commit boundary (called the frontier in [1]),
but the sparse object walk is helpful to visualize.

[1] https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/

Thanks,
-Stolee