Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation

René Scharfe <l.s.r@xxxxxx> · Thu, 16 Jun 2022 20:55:18 +0200

Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>
> On Wed, Jun 15 2022, René Scharfe wrote:
>
>> Git uses zlib for its own object store, but calls gzip when creating tgz
>> archives.  Add an option to perform the gzip compression for the latter
>> using zlib, without depending on the external gzip binary.
>>
>> Plug it in by making write_block a function pointer and switching to a
>> compressing variant if the filter command has the magic value "git
>> archive gzip".  Does that indirection slow down tar creation?  Not
>> really, at least not in this test:
>>
>> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
>> './git -C ../linux archive --format=tar HEAD # {rev}'
>
> Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@xxxxxxxxxxxxxxxxxxx/
>
> I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
> easier.
>
> You'll find that you need less or no --warmup with it, since the
> checkout flip-flopping and re-making (and resulting FS and other cache
> eviction) will go away, as we'll use different "git worktree"'s for the
> two "rev".

OK, but requiring hyperfine alone is burden enough for reviewers.

I had a try anyway and it took me a while to realize that git-hyperfine
requires setting the Git config option hyperfine.run-dir band that it
ignores it on my system.  Had to hard-code it in the script.

> (Also, putting those on a ramdisk really helps)
>
>> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>>   Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
>>   Range (min … max):    4.038 s …  4.059 s    10 runs
>>
>> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>>   Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
>>   Range (min … max):    4.038 s …  4.066 s    10 runs
>>
>> How does tgz creation perform?
>>
>> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
>> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
>> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>>   Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
>>   Range (min … max):   20.395 s … 20.414 s    10 runs
>>
>> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>>   Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
>>   Range (min … max):   23.782 s … 23.857 s    10 runs
>>
>> Summary
>>   './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>>     1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>>
>> So the internal implementation takes 17% longer on the Linux repo, but
>> uses 2% less CPU time.  That's because the external gzip can run in
>> parallel on its own processor, while the internal one works sequentially
>> and avoids the inter-process communication overhead.
>>
>> What are the benefits?  Only an internal sequential implementation can
>> offer this eco mode, and it allows avoiding the gzip(1) requirement.
>
> I had been keeping one eye on this series, but didn't look at it in any
> detail.
>
> I found this after reading 6/6, which I think in any case could really
> use some "why" summary, which seems to mostly be covered here.
>
> I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
> reference to the GZIP test dependency, or that our users are unlikely to
> have "gzip(1)" on their systems.

It's to avoid a run dependency; the build/test dependency remains.

> If it's the latter I'd much rather (as a user) take a 17% wallclock
> improvement over a 2% cost of CPU. I mostly care about my own time, not
> that of the CPU.

Understandable, and you can set tar.tgz.command='gzip -cn' to get the
old behavior.  Saving energy is a better default, though.

The runtime in the real world probably includes lots more I/O time.  The
tests above are repeated and warmed up to get consistent measurements,
but big repos are probably not fully kept in memory like that.

> Can't we have our 6/6 cake much easier and eat it too by learning a
> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
> use the "internal" one?

Interesting idea, but I think the existing config option suffices.  E.g.
a distro could set it in the system-wide config file if/when gzip is
installed.

> Re the "eco mode": I also wonder how much of the overhead you're seeing
> for both that 17% and 2% would go away if you pin both processes to the
> same CPU, I can't recall the command offhand, but IIRC taskset or
> numactl can do that. I.e. is this really measuring IPC overhead, or
> I-CPU overhead on your system?

I'd expect that running git archive and gzip at the same CPU core takes
more wall-clock time than using zlib because inflating the object files
and deflating the archive are done sequentially in both scenarios.
Can't test it on macOS because it doesn't offer a way to pin programs to
a certain core, but e.g. someone with access to a Linux system can check
that using taskset(1).

René