On 3/12/2025 4:47 PM, Taylor Blau wrote: > On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote: >> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > In the above three examples we see some trade-offs between pack size and > the time it took to generate it. I think it's worth discussing whether > or not the potential benefit of such a trade-off is worth the > significant complexity and code that this feature will introduce. (To be > clear, I don't have a strong opinion here one way or the other, but I do > think that it's at least worth discussing). > > I wonder how much of the benefits of path-walk over the hash v2 approach > could be had by simply widening the pack.window during delta selection? > > I tried to run a similar experiment as you did above on the > microsoft/fluentui repository and got the following: > > Repack Method Pack Size Time > ------------------------------------------ > Hash v1 447.2MiB 932.41s > Hash v2 154.1MiB 404.35s > Hash v2 (window=20) 146.7MiB 472.66s > Hash v2 (window=50) 138.3MiB 622.13s > Path Walk 140.8MiB 168.86s > > In your experiment above on the same repository, the path walk feature > represents an 11.873% reduction in pack size, but at the cost of a 30.9% > regression in runtime. > > When I set pack.window to "50" (over the default value of "10"), I get a > ~10.3% reduction in pack size at the cost of a 54% increase in runtime > (relative to just --name-hash-version=2 with the default pack.window > settings). > > But when I set the pack.window to "20", the relative values (again > comparing against --name-hash-version=2 with the default pack.window) > are 4.8% reduction in pack size and a 16.9% increase in runtime. You're right that I wasn't including data around the --window option in my analysis. This option presents folks with the opportunity to add CPU time in order to improve the possibility of better compression due to considering more object pairs. But it's also important to note that that option still works with --path-walk, except that the --path-walk option is focused on improving the quality of objects being considered within a window. There's also the aspect that there are two passes (one path-based and one name-hash-based) so increasing the --window size has a larger impact on the --path-walk option. With regards to the microsoft/fluentui repo, I had previously been using an old clone using --bare. The size changes if I use --mirror as well, since it will get the fork hint refs corresponding to objects in public forks that are not actually in the core repo. This changes the clone size as well as the repacked size. (To save time, I didn't repeat the --window option tests for name hash v1 as name hash v2 is clearly superior to that option in this repo.) Cloned with --bare: | Type | Window: 10 | Window: 20 | Window: 50 | |--------------|----------------|----------------|----------------| | name hash v1 | 451 M | 1m 42s | | | | | | name hash v2 | 160 M | 35.4 s | 151 M | 25.4 s | 141 M | 31.0 s | | --path-walk | 141 M | 31.0 s | 136 M | 35.7 s | 129 M | 49.3 s | Cloned with --mirror: | Type | Window: 10 | Window: 20 | Window: 50 | |--------------|----------------|----------------|-----------------| | name hash v1 | 882 M | 3m 27s | | | | | | name hash v2 | 584 M | 70.4 s | 554 M | 54.6 s | 530 M | 69.4 s | | --path-walk | 548 M | 79.8 s | 523 M | 93.9 s | 507 M | 126.2 s | Running on a slightly-larger Javascript repo with the same CHANGLOG filename issue, I get these results: | Type | Window: 10 | Window: 20 | Window: 50 | |--------------|----------------|----------------|----------------| | name hash v1 | 6.4 G | 36m 9s | | | | | | name hash v2 | 920 M | 7m 39s | 767 M | 5m 49s | 665 M | 6m 12s | | --path-walk | 834 M | 4m 48s | 697 M | 7m 39s | 615 M | 8m 42s | > But these numbers are pretty confusing to me, TBH. The reduction in pack > sizes makes sense, and here I see numbers that are on-par with what you > noted above for the same repository. But the runtimes are wildly > different (e.g., hash v1 takes you just 87s while mine takes 932s). I wonder if it's related to threading? I'm using as many cores as I can. > There must be something in our environment that is different. I'm > starting with a bare clone of microsoft/fluentui from GitHub, and made > several 'cp -al' copies of it for the different experiments. In the > penultimate one, I ran: > > $ time git.compile -c pack.window=50 repack --name-hash-version=2 \ > -adF --no-write-bitmap-index There's also some strange things with my numbers because I'm not copying the same data into multiple places but instead running the test on the same repo. Thus, the "input size" is changing with each run and this is probably a big factor in the larger tests. So the order in my tables is left-to-right, top-to-bottom, like reading a page in English. Thus, the short time for --path-walk --window=10 in the last example is maybe a bit faster because it is starting from the 665 M from the --name-hash-version=2 --window=50 example. > In any event, it seems like at least in this example we can get > performance that is on-par with path-walk by simply widening the > pack.window when using hash v2. On my machine that seems to cost more > time than it does for you to the point where it's slower than my > path-walk. But I think I need to understand what the differences are > here before we can draw any conclusions on the size or timing. I'd be very curious to see if more folks have bandwidth to do similar testing. My default mode is that I like giving users more options to explore which may work better for them. > If the overwhelming majority of cases where the --path-walk feature > presents a significant benefit over hash v2 at various pack.window sizes > (where we could get approximately the same reduction in pack size with > approximately the same end-to-end runtime of 'git repack'), then I feel > we might want to reconsider whether or not the complexity of this feature > is worthwhile. > > But if the --path-walk feature either gives us a significant size > benefit that we can't get with hash v2 and a wider pack.window without > paying a significant runtime cost (or vice-versa), then this feature > would indeed be worthwhile. > > I also have no idea how representative the above is of your intended > use-case, which seems much more oriented around pushes than from-scratch > repacks, which would also affect our conclusions here.The push story is valuable, but I'm also interested in helping users shrink their local repositories in whatever means they are willing to wait for. --- Meta-response to your patch review: I have made adjustments to my local branch in response to the points you brought up. I'll hold off on v2 for a few more days to give more opportunity for review. Thanks, -Stolee