Re: [PATCH 00/30] [RFC] Path-walk API and applications

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/22/24 2:37 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@xxxxxxxxx> writes:
>
>> Combining the two features actually ends up with very similar performance
>> to what `--full-name-hash` already does. It's actually important that the
>> `--path-walk` option does a full pass of the objects via the standard
>> name-hash after its first pass in groups based on the path.
>> ...
>> I was not clear about this, but the RFC is 30 patches so it's possible to see
>> the big picture, but I will be breaking it into at least four series in
>> sequence for actual review. They match the four sections described above, but
>> will be in the opposite order:
>>
>>   A. `git repack --full-name-hash`
>>   B. `git pack-objects --path-walk`
>>   C. `git survey`
>>   D. `git backfill`
>>
>> (It's possible that `git survey` and `git backfill` may be orthogonal enough
>> that they could be under review at the same time. Alternatively, `git backfill`
>> may jump the line because it's so simple to implement once the path-walk API
>> is established.)
>
> I actually was hoping to hear something like "since it turns out
> that --path-walk gives a better performance and it does not regress
> small incremental transfer like --full-name-hash does, the real
> series drops --full-name hash", i.e. without part (A).  That reduces
> things we need to worry about (like having to either keep track of
> two "hashes" per object, or making small incremental transfer more
> costly) greatly.

I believe that the --full-name-hash version still has some benefits, in
that it could better integrate with reachability bitmaps and delta
islands:

 1. The .bitmap file format would need a modification in order to signal
    which hash function is being used for compatibility reasons, but
    this does seem within reach without too much work.

 2. The delta islands feature integrates seamlessly with
    --full-name-hash and seems difficult to integrate with the
    --path-walk feature. Either we would need to have a second object
    walk to get the delta island markers, or somehow put the passing of
    the object markers into the path-walk API itself (similar to how it
    needs to push the UNINTERESTING bit around during the walk).

I'm not recommending any version that requires tracking two hash values
per object, as I have not been able to demonstrate any improvement when
doing so.

But, it would be helpful to know if the --full-name-hash feature should
not be pursued due to the --path-walk feature being prepared shortly
after it. I can see an argument for either direction: having a new hash
algorithm provides a smaller change to get most of the results for the
full repack case, but gets worse performance in many push scenarios.
This is the point of an RFC, to get questions like this worked out based
on the "big picture" view of everything.

Perhaps I should pause the --full-name-hash topic and focus on getting
the --path-walk topic up and running. I am curious to hear from folks
who are currently running Git servers about their thoughts on these
trade-offs and potential uses in their environment. My needs on the
client side are solved by the --path-walk approach.

Thanks,
-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux