Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 20 Mar 2025 16:18:32 -0400

On 3/12/2025 4:47 PM, Taylor Blau wrote:
> On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> In the above three examples we see some trade-offs between pack size and
> the time it took to generate it. I think it's worth discussing whether
> or not the potential benefit of such a trade-off is worth the
> significant complexity and code that this feature will introduce. (To be
> clear, I don't have a strong opinion here one way or the other, but I do
> think that it's at least worth discussing).
> 
> I wonder how much of the benefits of path-walk over the hash v2 approach
> could be had by simply widening the pack.window during delta selection?
> 
> I tried to run a similar experiment as you did above on the
> microsoft/fluentui repository and got the following:
> 
>     Repack Method       Pack Size       Time
>     ------------------------------------------
>     Hash v1              447.2MiB      932.41s
>     Hash v2              154.1MiB      404.35s
>     Hash v2 (window=20)  146.7MiB      472.66s
>     Hash v2 (window=50)  138.3MiB      622.13s
>     Path Walk            140.8MiB      168.86s
> 
> In your experiment above on the same repository, the path walk feature
> represents an 11.873% reduction in pack size, but at the cost of a 30.9%
> regression in runtime.
> 
> When I set pack.window to "50" (over the default value of "10"), I get a
> ~10.3% reduction in pack size at the cost of a 54% increase in runtime
> (relative to just --name-hash-version=2 with the default pack.window
> settings).
> 
> But when I set the pack.window to "20", the relative values (again
> comparing against --name-hash-version=2 with the default pack.window)
> are 4.8% reduction in pack size and a 16.9% increase in runtime.

You're right that I wasn't including data around the --window option in
my analysis. This option presents folks with the opportunity to add CPU
time in order to improve the possibility of better compression due to
considering more object pairs.

But it's also important to note that that option still works with
--path-walk, except that the --path-walk option is focused on improving
the quality of objects being considered within a window. There's also the
aspect that there are two passes (one path-based and one name-hash-based)
so increasing the --window size has a larger impact on the --path-walk
option.

With regards to the microsoft/fluentui repo, I had previously been using
an old clone using --bare. The size changes if I use --mirror as well,
since it will get the fork hint refs corresponding to objects in public
forks that are not actually in the core repo. This changes the clone size
as well as the repacked size.

(To save time, I didn't repeat the --window option tests for name hash
v1 as name hash v2 is clearly superior to that option in this repo.)

Cloned with --bare:

| Type         | Window: 10     | Window: 20     | Window: 50     |
|--------------|----------------|----------------|----------------|
| name hash v1 | 451 M | 1m 42s |       |        |       |        |
| name hash v2 | 160 M | 35.4 s | 151 M | 25.4 s | 141 M | 31.0 s |
| --path-walk  | 141 M | 31.0 s | 136 M | 35.7 s | 129 M | 49.3 s |

Cloned with --mirror:

| Type         | Window: 10     | Window: 20     | Window: 50      |
|--------------|----------------|----------------|-----------------|
| name hash v1 | 882 M | 3m 27s |       |        |       |         |
| name hash v2 | 584 M | 70.4 s | 554 M | 54.6 s | 530 M | 69.4 s  |
| --path-walk  | 548 M | 79.8 s | 523 M | 93.9 s | 507 M | 126.2 s |

Running on a slightly-larger Javascript repo with the same CHANGLOG
filename issue, I get these results:

| Type         | Window: 10     | Window: 20     | Window: 50     |
|--------------|----------------|----------------|----------------|
| name hash v1 | 6.4 G | 36m 9s |       |        |       |        |
| name hash v2 | 920 M | 7m 39s | 767 M | 5m 49s | 665 M | 6m 12s |
| --path-walk  | 834 M | 4m 48s | 697 M | 7m 39s | 615 M | 8m 42s |

> But these numbers are pretty confusing to me, TBH. The reduction in pack
> sizes makes sense, and here I see numbers that are on-par with what you
> noted above for the same repository. But the runtimes are wildly
> different (e.g., hash v1 takes you just 87s while mine takes 932s).

I wonder if it's related to threading? I'm using as many cores as I can.

> There must be something in our environment that is different. I'm
> starting with a bare clone of microsoft/fluentui from GitHub, and made
> several 'cp -al' copies of it for the different experiments. In the
> penultimate one, I ran:
> 
>     $ time git.compile -c pack.window=50 repack --name-hash-version=2 \
>         -adF --no-write-bitmap-index

There's also some strange things with my numbers because I'm not copying
the same data into multiple places but instead running the test on the
same repo. Thus, the "input size" is changing with each run and this is
probably a big factor in the larger tests.

So the order in my tables is left-to-right, top-to-bottom, like reading
a page in English. Thus, the short time for --path-walk --window=10 in
the last example is maybe a bit faster because it is starting from the
665 M from the --name-hash-version=2 --window=50 example. 
> In any event, it seems like at least in this example we can get
> performance that is on-par with path-walk by simply widening the
> pack.window when using hash v2. On my machine that seems to cost more
> time than it does for you to the point where it's slower than my
> path-walk. But I think I need to understand what the differences are
> here before we can draw any conclusions on the size or timing.

I'd be very curious to see if more folks have bandwidth to do similar
testing. My default mode is that I like giving users more options to
explore which may work better for them. 
> If the overwhelming majority of cases where the --path-walk feature
> presents a significant benefit over hash v2 at various pack.window sizes
> (where we could get approximately the same reduction in pack size with
> approximately the same end-to-end runtime of 'git repack'), then I feel
> we might want to reconsider whether or not the complexity of this feature
> is worthwhile.
> 
> But if the --path-walk feature either gives us a significant size
> benefit that we can't get with hash v2 and a wider pack.window without
> paying a significant runtime cost (or vice-versa), then this feature
> would indeed be worthwhile.
> 
> I also have no idea how representative the above is of your intended
> use-case, which seems much more oriented around pushes than from-scratch
> repacks, which would also affect our conclusions here.The push story is valuable, but I'm also interested in helping users shrink their local repositories in whatever means they are
willing to wait for.

---

Meta-response to your patch review: I have made adjustments to my local
branch in response to the points you brought up. I'll hold off on v2
for a few more days to give more opportunity for review.

Thanks,
-Stolee