Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 15 Jun 2022 22:32:04 +0200

On Wed, Jun 15 2022, René Scharfe wrote:

> Git uses zlib for its own object store, but calls gzip when creating tgz
> archives.  Add an option to perform the gzip compression for the latter
> using zlib, without depending on the external gzip binary.
>
> Plug it in by making write_block a function pointer and switching to a
> compressing variant if the filter command has the magic value "git
> archive gzip".  Does that indirection slow down tar creation?  Not
> really, at least not in this test:
>
> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
> './git -C ../linux archive --format=tar HEAD # {rev}'

Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@xxxxxxxxxxxxxxxxxxx/

I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
easier.

You'll find that you need less or no --warmup with it, since the
checkout flip-flopping and re-making (and resulting FS and other cache
eviction) will go away, as we'll use different "git worktree"'s for the
two "rev".

(Also, putting those on a ramdisk really helps)

> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>   Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
>   Range (min … max):    4.038 s …  4.059 s    10 runs
>
> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>   Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
>   Range (min … max):    4.038 s …  4.066 s    10 runs
>
> How does tgz creation perform?
>
> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>   Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
>   Range (min … max):   20.395 s … 20.414 s    10 runs
>
> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>   Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
>   Range (min … max):   23.782 s … 23.857 s    10 runs
>
> Summary
>   './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>     1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>
> So the internal implementation takes 17% longer on the Linux repo, but
> uses 2% less CPU time.  That's because the external gzip can run in
> parallel on its own processor, while the internal one works sequentially
> and avoids the inter-process communication overhead.
>
> What are the benefits?  Only an internal sequential implementation can
> offer this eco mode, and it allows avoiding the gzip(1) requirement.

I had been keeping one eye on this series, but didn't look at it in any
detail.

I found this after reading 6/6, which I think in any case could really
use some "why" summary, which seems to mostly be covered here.

I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
reference to the GZIP test dependency, or that our users are unlikely to
have "gzip(1)" on their systems.

If it's the latter I'd much rather (as a user) take a 17% wallclock
improvement over a 2% cost of CPU. I mostly care about my own time, not
that of the CPU.

Can't we have our 6/6 cake much easier and eat it too by learning a
"fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
use the "internal" one?

Re the "eco mode": I also wonder how much of the overhead you're seeing
for both that 17% and 2% would go away if you pin both processes to the
same CPU, I can't recall the command offhand, but IIRC taskset or
numactl can do that. I.e. is this really measuring IPC overhead, or
I-CPU overhead on your system?