Re: [PATCH 0/6] PATH WALK I: The path-walk API

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 11 Nov 2024 11:56:01 +0900

Derrick Stolee <stolee@xxxxxxxxx> writes:

>> Jeff King <peff@xxxxxxxx> writes:
>> 
>>> That's interesting. I wonder which cases get worse, and if a larger
>>> window size might help. I.e., presumably we are pushing the candidates
>>> further away in the sorted delta list.
>
> I think the cases that make things get worse with --full-name-hash are:
>
>   1. The presence of renames, partitioning objects that used to fit into
>      the same bucket (in the case of directory renames).
>
>   2. Some standard kinds of files may appear several times across the
>      tree but do not change very often and are similar across path.
>
>   3. Common patterns across similar file types, such as similar includes
>      in .c and .h files or other kinds of boilerplate in different
>      languages.
> ...
> In a depth 1 shallow clone, there are no repeated paths, so any hash
> collisions are true collisions instead of good candidates for deltas.

Or #2 and #3 above, where large boilerplates are shared across
similarly named files.

> Yes. This is the downside of the --full-name-hash compared to the
> standard name hash. When repacking an entire repository, the effect
> of these renames is typically not important in the long run as it's
> basically a single break in the delta chain. The downside comes in
> when doing a small fetch or push where the rename has more impact.

Yes.  Due to --depth limit, we need to break delta chains somewhere
anyway, and a rename boundary is just as good place as any other in
a sufficiently long chain.

> The --path-walk approach does not suffer from this problem because
> it has a second pass that sorts by the name hash and looks for
> better deltas than the ones that already exist. Thus, it gets the
> best of both worlds.

Yes, at the cost of being more complex :-)

> The performance impact of the two passes of the --path-walk
> approach is interesting, as you'd typically expect this to always
> be slower. However:
>
>  1. The delta compression within each batch only compares the
>     objects within that batch. We do not compare these objects to
>     unrelated objects, which can be expensive and wasteful. This
>     also means that small batches may even be smaller than the
>     delta window, reducing the number of comparisons.

Interesting.

>  2. In the second pass, the delta calculation can short-circuit if
>     the computed delta would be larger than the current-best delta.
>     Thus, the good deltas from the first pass make the second pass
>     faster.

Yes, the early give-up codepath helps when you already found
something semi-decently good delta base.

Thanks for a good summary.