On Mon, Jan 13, 2025 at 04:40:28PM +0100, Patrick Steinhardt wrote: > On Thu, Jan 09, 2025 at 03:46:49AM -0500, Jeff King wrote: > > So my conclusion is that it probably does help a little, but it's mostly > > lost in the noise. I could see an argument for keeping it, as the > > complexity is hidden away in functions that do not often need to be > > touched. But it does make them more confusing than necessary (despite > > some detailed explanations from the author of that commit; it just took > > me a while to wrap my head around what was going on) and prevents > > further refactoring of the combine_diff_path struct. So let's drop it. > > A 1% performance speedup does not feel like a good argument to me, so > I'm perfectly fine with dropping the code, even if most of it is > actually in the form of comments. But that already shows that it needs > quite a bit of explanation. > > I wonder though: did you also use e.g. Valgrind to compare the number of > allocations? glibc tends to be heavily optimized with regards to small > allocations, so you typically don't notice the performance impact caused > by them even when the number of saved allocations is significant. So the > effect might be more pronounced with other libcs that aren't optimized > for such usecases, like e.g. musl libc. I didn't use valgrind, but I did confirm via some hacky printf() calls that the optimization does kick in. Here's a version with counting: diff --git a/tree-diff.c b/tree-diff.c index d9237ffd9b..60db2b2f51 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -154,6 +154,11 @@ static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_ * * p->parent[] remains uninitialized. */ +static int hit, total; +void show_counter(void) +{ + warning("%d / %d\n", hit, total); +} static struct combine_diff_path *path_appendnew(struct combine_diff_path *last, int nparent, const struct strbuf *base, const char *path, int pathlen, unsigned mode, const struct object_id *oid) @@ -168,6 +173,11 @@ static struct combine_diff_path *path_appendnew(struct combine_diff_path *last, FREE_AND_NULL(p); } + if (!total++) + atexit(show_counter); + if (p) + hit++; + if (!p) { p = xmalloc(alloclen); It seems to kick in about half of the time when running "git log --raw" on git.git and linux.git. The absolute best case for the optimization is comparing two trees with all entries of the same size, and all changed, like: git init blob1=$(echo one | git hash-object -w --stdin) blob2=$(echo two | git hash-object -w --stdin) mktree() { perl -e ' printf "100644 blob %s\tpath%08d\n", $ARGV[0], $_ for (1..1000000) ' $1 } git tag tree1 $(mktree $blob1 | git mktree) git tag tree2 $(mktree $blob2 | git mktree) git diff-tree tree1 tree2 In that optimal case I see ~3% speedup on glibc. If somebody on a platform with a different allocator can show a bigger change, that would definitely be interesting. I suspect it won't make that big a difference even with a slower allocator, though, because each changed path involves other allocations (like creating a diff_pair). Running under valgrind with that optimal case, the old code does ~3M allocations (so 3 per entry). Now we do 4 per entry. So if we really care about micro-optimizing, I suspect a more productive path would be getting a better allocator. ;) Here are hyperfine results for the existing code ("old") versus my series ("new") with the glibc allocator versus jemalloc: Benchmark 1: LD_PRELOAD= ./git.old -C repo diff-tree tree1 tree2 Time (mean ± σ): 625.3 ms ± 13.3 ms [User: 547.9 ms, System: 77.3 ms] Range (min … max): 599.8 ms … 649.9 ms 10 runs Benchmark 2: LD_PRELOAD= ./git.new -C repo diff-tree tree1 tree2 Time (mean ± σ): 650.8 ms ± 14.5 ms [User: 568.2 ms, System: 82.5 ms] Range (min … max): 632.2 ms … 673.6 ms 10 runs Benchmark 3: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.old -C repo diff-tree tree1 tree2 Time (mean ± σ): 563.9 ms ± 9.2 ms [User: 538.4 ms, System: 25.3 ms] Range (min … max): 545.4 ms … 571.0 ms 10 runs Benchmark 4: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.new -C repo diff-tree tree1 tree2 Time (mean ± σ): 582.9 ms ± 10.8 ms [User: 545.1 ms, System: 37.7 ms] Range (min … max): 568.6 ms … 595.5 ms 10 runs Summary LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.old -C repo diff-tree tree1 tree2 ran 1.03 ± 0.03 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.new -C repo diff-tree tree1 tree2 1.11 ± 0.03 times faster than LD_PRELOAD= ./git.old -C repo diff-tree tree1 tree2 1.15 ± 0.03 times faster than LD_PRELOAD= ./git.new -C repo diff-tree tree1 tree2 So rather than saving 2-3%, a better allocator gives you 10-15% (again, these are pretty synthetic numbers because this is a pathological test case). It is still faster to do fewer allocations with jemalloc, but both the relative and absolute improvement is smaller. -Peff