Derrick Stolee <stolee@xxxxxxxxx> writes: >> Jeff King <peff@xxxxxxxx> writes: >> >>> That's interesting. I wonder which cases get worse, and if a larger >>> window size might help. I.e., presumably we are pushing the candidates >>> further away in the sorted delta list. > > I think the cases that make things get worse with --full-name-hash are: > > 1. The presence of renames, partitioning objects that used to fit into > the same bucket (in the case of directory renames). > > 2. Some standard kinds of files may appear several times across the > tree but do not change very often and are similar across path. > > 3. Common patterns across similar file types, such as similar includes > in .c and .h files or other kinds of boilerplate in different > languages. > ... > In a depth 1 shallow clone, there are no repeated paths, so any hash > collisions are true collisions instead of good candidates for deltas. Or #2 and #3 above, where large boilerplates are shared across similarly named files. > Yes. This is the downside of the --full-name-hash compared to the > standard name hash. When repacking an entire repository, the effect > of these renames is typically not important in the long run as it's > basically a single break in the delta chain. The downside comes in > when doing a small fetch or push where the rename has more impact. Yes. Due to --depth limit, we need to break delta chains somewhere anyway, and a rename boundary is just as good place as any other in a sufficiently long chain. > The --path-walk approach does not suffer from this problem because > it has a second pass that sorts by the name hash and looks for > better deltas than the ones that already exist. Thus, it gets the > best of both worlds. Yes, at the cost of being more complex :-) > The performance impact of the two passes of the --path-walk > approach is interesting, as you'd typically expect this to always > be slower. However: > > 1. The delta compression within each batch only compares the > objects within that batch. We do not compare these objects to > unrelated objects, which can be expensive and wasteful. This > also means that small batches may even be smaller than the > delta window, reducing the number of comparisons. Interesting. > 2. In the second pass, the delta calculation can short-circuit if > the computed delta would be larger than the current-best delta. > Thus, the good deltas from the first pass make the second pass > faster. Yes, the early give-up codepath helps when you already found something semi-decently good delta base. Thanks for a good summary.