Re: [PATCH 3/5] combine-diff: handle binary files as binary

Jeff King <peff@xxxxxxxx> · Mon, 30 May 2011 12:19:27 -0400

On Mon, May 30, 2011 at 10:36:27AM -0400, Jeff King wrote:

>   1. Grab each blob, check binary-ness, and free. This double-loads in
>      the common, non-binary case.
> [...]
>
> I'll try to take a look at it this week and get some measurements on (1)
> versus (2) for both speed and peak memory usage. And then see if I can
> do better with (3), and implement the "peek" solution both here and in
> regular diff.

I was curious about this, so I stole a few minutes to do some
preliminary benchmarks this morning.

The first thing to look at is the performance of the original code, that
does not check binary-ness at all. It's going to represent the best we
can do with any strategy. So I tried:

  git log -p --cc --merges origin/master

on git.git using both v1.7.5.3 and the jk/combine-diff-binary-etc
branch. And it turns out that the extra loads really don't make a
difference in practice. My best-of-5 for the two cases were:

  $ time git.v1.7.5.3 log -p --cc --merges origin/master >/dev/null
  real    0m59.518s
  user    0m58.672s
  sys     0m0.688s

  $ time git.jk.binary-combined-diff log -p --cc \
      --merges origin/master >/dev/null
  real    0m58.949s
  user    0m58.220s
  sys     0m0.572s

The new code actually came out slightly faster.  One reason may be that
there are 3 combined diffs of git-gui/lib/git-gui.ico that we avoid
doing (and just say "Binary files differ"). That's not a lot, but it
gives us a very tiny edge (though that edge is very close to the amount
of noise between runs). Still, I think it implies that the extra loads
in the common non-binary case are not actually measurable.

The peak memory use between the two should be the same (since we free
each blob immediately), but I didn't measure it.

So I think in practice it's not a big deal. I'll still take a look at
the "peek" optimization later this week, since that can make a
difference in some corner cases. And as part of that, it will probably
make sense to keep the buffers around for small-ish files, so we'll get
the optimization I mentioned more or less for free. I'll also do the
check for duplicated sha1s that you mentioned.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html