On 9/22/24 2:37 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@xxxxxxxxx> writes: > >> Combining the two features actually ends up with very similar performance >> to what `--full-name-hash` already does. It's actually important that the >> `--path-walk` option does a full pass of the objects via the standard >> name-hash after its first pass in groups based on the path. >> ... >> I was not clear about this, but the RFC is 30 patches so it's possible to see >> the big picture, but I will be breaking it into at least four series in >> sequence for actual review. They match the four sections described above, but >> will be in the opposite order: >> >> A. `git repack --full-name-hash` >> B. `git pack-objects --path-walk` >> C. `git survey` >> D. `git backfill` >> >> (It's possible that `git survey` and `git backfill` may be orthogonal enough >> that they could be under review at the same time. Alternatively, `git backfill` >> may jump the line because it's so simple to implement once the path-walk API >> is established.) > > I actually was hoping to hear something like "since it turns out > that --path-walk gives a better performance and it does not regress > small incremental transfer like --full-name-hash does, the real > series drops --full-name hash", i.e. without part (A). That reduces > things we need to worry about (like having to either keep track of > two "hashes" per object, or making small incremental transfer more > costly) greatly. I believe that the --full-name-hash version still has some benefits, in that it could better integrate with reachability bitmaps and delta islands: 1. The .bitmap file format would need a modification in order to signal which hash function is being used for compatibility reasons, but this does seem within reach without too much work. 2. The delta islands feature integrates seamlessly with --full-name-hash and seems difficult to integrate with the --path-walk feature. Either we would need to have a second object walk to get the delta island markers, or somehow put the passing of the object markers into the path-walk API itself (similar to how it needs to push the UNINTERESTING bit around during the walk). I'm not recommending any version that requires tracking two hash values per object, as I have not been able to demonstrate any improvement when doing so. But, it would be helpful to know if the --full-name-hash feature should not be pursued due to the --path-walk feature being prepared shortly after it. I can see an argument for either direction: having a new hash algorithm provides a smaller change to get most of the results for the full repack case, but gets worse performance in many push scenarios. This is the point of an RFC, to get questions like this worked out based on the "big picture" view of everything. Perhaps I should pause the --full-name-hash topic and focus on getting the --path-walk topic up and running. I am curious to hear from folks who are currently running Git servers about their thoughts on these trade-offs and potential uses in their environment. My needs on the client side are solved by the --path-walk approach. Thanks, -Stolee