Re: [PATCH 2/2] archive: avoid spawning `gzip`

Jeff King <peff@xxxxxxxx> · Mon, 15 Apr 2019 17:35:56 -0400

On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:

> >> As we already link to the zlib library, we can perform the compression
> >> without even requiring gzip on the host machine.
> >
> > Very cool. It's nice to drop a dependency, and this should be a bit more
> > efficient, too.
> 
> Getting rid of dependencies is good, and using zlib is the obvious way to
> generate .tgz files. Last time I tried something like that, a separate gzip
> process was faster, though -- at least on Linux [1].  How does this one
> fare?

I'd expect a separate gzip to be faster in wall-clock time for a
multi-core machine, but overall consume more CPU. I'm slightly surprised
that your timings show that it actually wins on total CPU, too.

Here are best-of-five times for "git archive --format=tar.gz HEAD" on
linux.git (the machine is a quad-core):

  [before, separate gzip]
  real	0m21.501s
  user	0m26.148s
  sys	0m0.619s

  [after, internal gzwrite]
  real	0m25.156s
  user	0m25.059s
  sys	0m0.096s

which does show what I expect (longer overall, but less total CPU).

Which one you prefer depends on your situation, of course. A user on a
workstation with multiple cores probably cares most about end-to-end
latency and using all of their available horsepower. A server hosting
repositories and receiving many unrelated requests probably cares more
about total CPU (though the differences there are small enough that it
may not even be worth having a config knob to un-parallelize it).

> Doing compression in its own thread may be a good idea.

Yeah. It might even make the patch simpler, since I'd expect it to be
implemented with start_async() and a descriptor, making it look just
like a gzip pipe to the caller. :)

-Peff