Re: [PATCH 07/14] tree-diff: drop path_appendnew() alloc optimization

Jeff King <peff@xxxxxxxx> · Tue, 14 Jan 2025 05:30:47 -0500

On Mon, Jan 13, 2025 at 04:40:28PM +0100, Patrick Steinhardt wrote:

> On Thu, Jan 09, 2025 at 03:46:49AM -0500, Jeff King wrote:
> > So my conclusion is that it probably does help a little, but it's mostly
> > lost in the noise. I could see an argument for keeping it, as the
> > complexity is hidden away in functions that do not often need to be
> > touched. But it does make them more confusing than necessary (despite
> > some detailed explanations from the author of that commit; it just took
> > me a while to wrap my head around what was going on) and prevents
> > further refactoring of the combine_diff_path struct. So let's drop it.
> 
> A 1% performance speedup does not feel like a good argument to me, so
> I'm perfectly fine with dropping the code, even if most of it is
> actually in the form of comments. But that already shows that it needs
> quite a bit of explanation.
> 
> I wonder though: did you also use e.g. Valgrind to compare the number of
> allocations? glibc tends to be heavily optimized with regards to small
> allocations, so you typically don't notice the performance impact caused
> by them even when the number of saved allocations is significant. So the
> effect might be more pronounced with other libcs that aren't optimized
> for such usecases, like e.g. musl libc.

I didn't use valgrind, but I did confirm via some hacky printf() calls
that the optimization does kick in. Here's a version with counting:

diff --git a/tree-diff.c b/tree-diff.c
index d9237ffd9b..60db2b2f51 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -154,6 +154,11 @@ static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_
  *
  * p->parent[] remains uninitialized.
  */
+static int hit, total;
+void show_counter(void)
+{
+	warning("%d / %d\n", hit, total);
+}
 static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 	int nparent, const struct strbuf *base, const char *path, int pathlen,
 	unsigned mode, const struct object_id *oid)
@@ -168,6 +173,11 @@ static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 		FREE_AND_NULL(p);
 	}
 
+	if (!total++)
+		atexit(show_counter);
+	if (p)
+		hit++;
+
 	if (!p) {
 		p = xmalloc(alloclen);
 
It seems to kick in about half of the time when running "git log --raw"
on git.git and linux.git. The absolute best case for the optimization is
comparing two trees with all entries of the same size, and all changed,
like:

  git init
  blob1=$(echo one | git hash-object -w --stdin)
  blob2=$(echo two | git hash-object -w --stdin)

  mktree() {
    perl -e '
      printf "100644 blob %s\tpath%08d\n", $ARGV[0], $_ for (1..1000000)
    ' $1
  }
  git tag tree1 $(mktree $blob1 | git mktree)
  git tag tree2 $(mktree $blob2 | git mktree)

  git diff-tree tree1 tree2

In that optimal case I see ~3% speedup on glibc. If somebody on a
platform with a different allocator can show a bigger change, that would
definitely be interesting.

I suspect it won't make that big a difference even with a slower
allocator, though, because each changed path involves other allocations
(like creating a diff_pair).

Running under valgrind with that optimal case, the old code does ~3M
allocations (so 3 per entry). Now we do 4 per entry.

So if we really care about micro-optimizing, I suspect a more productive
path would be getting a better allocator. ;) Here are hyperfine results
for the existing code ("old") versus my series ("new") with the glibc
allocator versus jemalloc:

  Benchmark 1: LD_PRELOAD= ./git.old -C repo diff-tree tree1 tree2
    Time (mean ± σ):     625.3 ms ±  13.3 ms    [User: 547.9 ms, System: 77.3 ms]
    Range (min … max):   599.8 ms … 649.9 ms    10 runs
  
  Benchmark 2: LD_PRELOAD= ./git.new -C repo diff-tree tree1 tree2
    Time (mean ± σ):     650.8 ms ±  14.5 ms    [User: 568.2 ms, System: 82.5 ms]
    Range (min … max):   632.2 ms … 673.6 ms    10 runs
  
  Benchmark 3: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.old -C repo diff-tree tree1 tree2
    Time (mean ± σ):     563.9 ms ±   9.2 ms    [User: 538.4 ms, System: 25.3 ms]
    Range (min … max):   545.4 ms … 571.0 ms    10 runs
  
  Benchmark 4: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.new -C repo diff-tree tree1 tree2
    Time (mean ± σ):     582.9 ms ±  10.8 ms    [User: 545.1 ms, System: 37.7 ms]
    Range (min … max):   568.6 ms … 595.5 ms    10 runs
  
  Summary
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.old -C repo diff-tree tree1 tree2 ran
      1.03 ± 0.03 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./git.new -C repo diff-tree tree1 tree2
      1.11 ± 0.03 times faster than LD_PRELOAD= ./git.old -C repo diff-tree tree1 tree2
      1.15 ± 0.03 times faster than LD_PRELOAD= ./git.new -C repo diff-tree tree1 tree2

So rather than saving 2-3%, a better allocator gives you 10-15% (again,
these are pretty synthetic numbers because this is a pathological test
case). It is still faster to do fewer allocations with jemalloc, but
both the relative and absolute improvement is smaller.

-Peff