Re: [PATCH v3 0/5] Avoid spawning gzip in git archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi René,

On Sun, 12 Jun 2022, René Scharfe wrote:

> It's been a while, let's try again.

Thank you for picking this up again!

> Changes:
> - Use our own zlib helpers instead of the gz* functions of zlib,
> - ... which allows us to set the OS_CODE header consistently.
> - Pseudo-command "git archive gzip" to select the internal
>   implementation in config.
> - Use a function pointer to plug in the internal gzip.
> - Tests.
> - Discuss performance in commit message.

Makes sense. Here is the range-diff:

-- snip --
-:  ----------- > 1:  9847267888e archive: rename archiver data field to filter_command
1:  7d50f52e490 ! 2:  a98ef655af9 archive: factor out writing blocks into a separate function
    @@
      ## Metadata ##
    -Author: Rohit Ashiwal <rohit.ashiwal265@xxxxxxxxx>
    +Author: René Scharfe <l.s.r@xxxxxx>

      ## Commit message ##
    -    archive: factor out writing blocks into a separate function
    +    archive-tar: factor out write_block()

    -    The `git archive --format=tgz` command spawns `gzip` to perform the
    -    actual compression. However, the MinGit flavor of Git for Windows
    -    comes without `gzip` bundled inside.
    +    All tar archive writes have the same size and are done to the same file
    +    descriptor.  Move them to a common function, write_block(), to reduce
    +    code duplication and make it easy to change the destination.

    -    To help with that, we will teach `git archive` to let zlib perform the
    -    actual compression.
    -
    -    In preparation for this, we consolidate all the block writes into the
    -    function `write_block_or_die()`.
    -
    -    Note: .tar files have a well-defined, fixed block size. For that reason,
    -    it does not make any sense to pass anything but a fully-populated,
    -    full-length block to the `write_block_or_die()` function, and we can
    -    save ourselves some future trouble by not even allowing to pass an
    -    incorrect `size` parameter to it.
    -
    -    Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@xxxxxxxxx>
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
    +    Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@xxxxxxxxx>
    +    Signed-off-by: René Scharfe <l.s.r@xxxxxx>
    +    Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>

      ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
      #define USTAR_MAX_MTIME 077777777777ULL
      #endif

    -+/* writes out the whole block, or dies if fails */
    -+static void write_block_or_die(const char *block) {
    -+	write_or_die(1, block, BLOCKSIZE);
    ++static void write_block(const void *buf)
    ++{
    ++	write_or_die(1, buf, BLOCKSIZE);
     +}
     +
      /* writes out the whole block, but only if it is full */
    @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
      {
      	if (offset == BLOCKSIZE) {
     -		write_or_die(1, block, BLOCKSIZE);
    -+		write_block_or_die(block);
    ++		write_block(block);
      		offset = 0;
      	}
      }
    @@ archive-tar.c: static void do_write_blocked(const void *data, unsigned long size
      	}
      	while (size >= BLOCKSIZE) {
     -		write_or_die(1, buf, BLOCKSIZE);
    -+		write_block_or_die(buf);
    ++		write_block(buf);
      		size -= BLOCKSIZE;
      		buf += BLOCKSIZE;
      	}
    @@ archive-tar.c: static void write_trailer(void)
      	int tail = BLOCKSIZE - offset;
      	memset(block + offset, 0, tail);
     -	write_or_die(1, block, BLOCKSIZE);
    -+	write_block_or_die(block);
    ++	write_block(block);
      	if (tail < 2 * RECORDSIZE) {
      		memset(block, 0, offset);
     -		write_or_die(1, block, BLOCKSIZE);
    -+		write_block_or_die(block);
    ++		write_block(block);
      	}
      }

2:  ac2b2488a1b < -:  ----------- archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
3:  4ea94a87848 ! 3:  5e3c0d79589 archive: optionally use zlib directly for gzip compression
    @@
      ## Metadata ##
    -Author: Rohit Ashiwal <rohit.ashiwal265@xxxxxxxxx>
    +Author: René Scharfe <l.s.r@xxxxxx>

      ## Commit message ##
    -    archive: optionally use zlib directly for gzip compression
    +    archive-tar: add internal gzip implementation

    -    As we already link to the zlib library, we can perform the compression
    -    without even requiring gzip on the host machine.
    +    Git uses zlib for its own object store, but calls gzip when creating tgz
    +    archives.  Add an option to perform the gzip compression for the latter
    +    using zlib, without depending on the external gzip binary.

    -    Note: the `-n` flag that `git archive` passed to `gzip` wants to ensure
    -    that a reproducible file is written, i.e. no filename or mtime will be
    -    recorded in the compressed output. This is already the default for
    -    zlib's `gzopen()` function (if the file name or mtime should be
    -    recorded, the `deflateSetHeader()` function would have to be called
    -    instead).
    +    Plug it in by making write_block a function pointer and switching to a
    +    compressing variant if the filter command has the magic value "git
    +    archive gzip".  Does that indirection slow down tar creation?  Not
    +    really, at least not in this test:

    -    Note also that the `gzFile` datatype is defined as a pointer in
    -    `zlib.h`, i.e. we can rely on the fact that it can be `NULL`.
    +    $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
    +    './git -C ../linux archive --format=tar HEAD # {rev}'
    +    Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
    +      Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
    +      Range (min … max):    4.038 s …  4.059 s    10 runs

    -    At this point, this new mode is hidden behind the pseudo command
    -    `:zlib`: assign this magic string to the `archive.tgz.command` config
    -    setting to enable it.
    +    Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
    +      Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
    +      Range (min … max):    4.038 s …  4.066 s    10 runs

    -    Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@xxxxxxxxx>
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
    +    How does tgz creation perform?

    - ## archive-tar.c ##
    -@@ archive-tar.c: static unsigned long offset;
    -
    - static int tar_umask = 002;
    -
    -+static gzFile gzip;
    -+
    - static int write_tar_filter_archive(const struct archiver *ar,
    - 				    struct archiver_args *args);
    +    $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
    +    './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
    +    Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
    +      Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
    +      Range (min … max):   20.395 s … 20.414 s    10 runs
    +
    +    Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
    +      Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
    +      Range (min … max):   23.782 s … 23.857 s    10 runs
    +
    +    Summary
    +      './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
    +        1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
    +
    +    So the internal implementation takes 17% longer on the Linux repo, but
    +    uses 2% less CPU time.  That's because the external gzip can run in
    +    parallel on its own processor, while the internal one works sequentially
    +    and avoids the inter-process communication overhead.
    +
    +    What are the benefits?  Only an internal sequential implementation can
    +    offer this eco mode, and it allows avoiding the gzip(1) requirement.
    +
    +    This implementation uses the helper functions from our zlib.c instead of
    +    the convenient gz* functions from zlib, because the latter doesn't give
    +    the control over the generated gzip header that the next patch requires.
    +
    +    Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@xxxxxxxxx>
    +    Signed-off-by: René Scharfe <l.s.r@xxxxxx>
    +    Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
    +
    + ## Documentation/git-archive.txt ##
    +@@ Documentation/git-archive.txt: tar.<format>.command::
    + 	format is given.
    + +
    + The "tar.gz" and "tgz" formats are defined automatically and default to
    +-`gzip -cn`. You may override them with custom commands.
    ++`gzip -cn`. You may override them with custom commands. An internal gzip
    ++implementation can be used by specifying the value `git archive gzip`.

    + tar.<format>.remote::
    + 	If true, enable `<format>` for use by remote clients via
    +
    + ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    + #define USTAR_MAX_MTIME 077777777777ULL
    + #endif

    - /* writes out the whole block, or dies if fails */
    - static void write_block_or_die(const char *block) {
    --	write_or_die(1, block, BLOCKSIZE);
    -+	if (!gzip)
    -+		write_or_die(1, block, BLOCKSIZE);
    -+	else if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
    -+		die(_("gzwrite failed"));
    +-static void write_block(const void *buf)
    ++static void tar_write_block(const void *buf)
    + {
    + 	write_or_die(1, buf, BLOCKSIZE);
      }

    ++static void (*write_block)(const void *) = tar_write_block;
    ++
      /* writes out the whole block, but only if it is full */
    -@@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    - 	filter.use_shell = 1;
    - 	filter.in = -1;
    + static void write_if_needed(void)
    + {
    +@@ archive-tar.c: static int write_tar_archive(const struct archiver *ar,
    + 	return err;
    + }

    --	if (start_command(&filter) < 0)
    --		die_errno(_("unable to start '%s' filter"), argv[0]);
    --	close(1);
    --	if (dup2(filter.in, 1) < 0)
    --		die_errno(_("unable to redirect descriptor"));
    --	close(filter.in);
    -+	if (!strcmp(":zlib", ar->data)) {
    -+		struct strbuf mode = STRBUF_INIT;
    ++static git_zstream gzstream;
    ++static unsigned char outbuf[16384];
     +
    -+		strbuf_addstr(&mode, "wb");
    ++static void tgz_deflate(int flush)
    ++{
    ++	while (gzstream.avail_in || flush == Z_FINISH) {
    ++		int status = git_deflate(&gzstream, flush);
    ++		if (!gzstream.avail_out || status == Z_STREAM_END) {
    ++			write_or_die(1, outbuf, gzstream.next_out - outbuf);
    ++			gzstream.next_out = outbuf;
    ++			gzstream.avail_out = sizeof(outbuf);
    ++			if (status == Z_STREAM_END)
    ++				break;
    ++		}
    ++		if (status != Z_OK && status != Z_BUF_ERROR)
    ++			die(_("deflate error (%d)"), status);
    ++	}
    ++}
     +
    -+		if (args->compression_level >= 0 && args->compression_level <= 9)
    -+			strbuf_addf(&mode, "%d", args->compression_level);
    ++static void tgz_write_block(const void *data)
    ++{
    ++	gzstream.next_in = (void *)data;
    ++	gzstream.avail_in = BLOCKSIZE;
    ++	tgz_deflate(Z_NO_FLUSH);
    ++}
     +
    -+		gzip = gzdopen(fileno(stdout), mode.buf);
    -+		if (!gzip)
    -+			die(_("Could not gzdopen stdout"));
    -+		strbuf_release(&mode);
    -+	} else {
    -+		if (start_command(&filter) < 0)
    -+			die_errno(_("unable to start '%s' filter"), argv[0]);
    -+		close(1);
    -+		if (dup2(filter.in, 1) < 0)
    -+			die_errno(_("unable to redirect descriptor"));
    -+		close(filter.in);
    -+	}
    -
    - 	r = write_tar_archive(ar, args);
    ++static const char internal_gzip_command[] = "git archive gzip";
    ++
    + static int write_tar_filter_archive(const struct archiver *ar,
    + 				    struct archiver_args *args)
    + {
    +@@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    + 	if (!ar->filter_command)
    + 		BUG("tar-filter archiver called with no filter defined");

    --	close(1);
    --	if (finish_command(&filter) != 0)
    --		die(_("'%s' filter reported error"), argv[0]);
    -+	if (gzip) {
    -+		int ret = gzclose(gzip);
    -+		if (ret == Z_ERRNO)
    -+			die_errno(_("gzclose failed"));
    -+		else if (ret != Z_OK)
    -+			die(_("gzclose failed (%d)"), ret);
    -+	} else {
    -+		close(1);
    -+		if (finish_command(&filter) != 0)
    -+			die(_("'%s' filter reported error"), argv[0]);
    ++	if (!strcmp(ar->filter_command, internal_gzip_command)) {
    ++		write_block = tgz_write_block;
    ++		git_deflate_init_gzip(&gzstream, args->compression_level);
    ++		gzstream.next_out = outbuf;
    ++		gzstream.avail_out = sizeof(outbuf);
    ++
    ++		r = write_tar_archive(ar, args);
    ++
    ++		tgz_deflate(Z_FINISH);
    ++		git_deflate_end(&gzstream);
    ++		return r;
     +	}
    ++
    + 	strbuf_addstr(&cmd, ar->filter_command);
    + 	if (args->compression_level >= 0)
    + 		strbuf_addf(&cmd, " -%d", args->compression_level);
    +
    + ## t/t5000-tar-tree.sh ##
    +@@ t/t5000-tar-tree.sh: test_expect_success GZIP 'remote tar.gz can be disabled' '
    + 		>remote.tar.gz
    + '

    - 	strbuf_release(&cmd);
    - 	return r;
    ++test_expect_success 'git archive --format=tgz (internal gzip)' '
    ++	test_config tar.tgz.command "git archive gzip" &&
    ++	git archive --format=tgz HEAD >internal_gzip.tgz
    ++'
    ++
    ++test_expect_success 'git archive --format=tar.gz (internal gzip)' '
    ++	test_config tar.tar.gz.command "git archive gzip" &&
    ++	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
    ++	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
    ++'
    ++
    ++test_expect_success GZIP 'extract tgz file (internal gzip)' '
    ++	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
    ++	test_cmp_bin b.tar internal_gzip.tar
    ++'
    ++
    + test_expect_success 'archive and :(glob)' '
    + 	git archive -v HEAD -- ":(glob)**/sh" >/dev/null 2>actual &&
    + 	cat >expect <<EOF &&
-:  ----------- > 4:  af27bea4fc3 archive-tar: use OS_CODE 3 (Unix) for internal gzip
4:  0e5826e3f25 ! 5:  62038b8e911 archive: use the internal zlib-based gzip compression by default
    @@
      ## Metadata ##
    -Author: Johannes Schindelin <Johannes.Schindelin@xxxxxx>
    +Author: René Scharfe <l.s.r@xxxxxx>

      ## Commit message ##
    -    archive: use the internal zlib-based gzip compression by default
    +    archive-tar: use internal gzip by default

    -    We just introduced support for compressing `.tar.gz` archives in the
    -    `git archive` process itself, using zlib directly instead of spawning
    -    `gzip`.
    +    Drop the dependency on gzip(1) and use our internal implementation to
    +    create tar.gz and tgz files.

    -    While this takes less CPU time overall, on multi-core machines, this is
    -    slightly slower in terms of wall clock time (it seems to be in the
    -    ballpark of 15%).
    -
    -    It does reduce the number of dependencies by one, though, which makes it
    -    desirable to turn that mode on by default.
    -
    -    Changing the default benefits most notably the MinGit flavor of Git for
    -    Windows (which intends to support 3rd-party applications that want to
    -    use Git and want to bundle a minimal set of files for that purpose, i.e.
    -    stripping out all non-essential files such as interactive commands,
    -    Perl, and yes, also `gzip`).
    -
    -    We also can now remove the `GZIP` prerequisite from quite a number of
    -    test cases in `t/t5000-tar-tree.sh`.
    -
    -    This closes https://github.com/git-for-windows/git/issues/1970
    -
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
    +    Signed-off-by: René Scharfe <l.s.r@xxxxxx>
    +    Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
      	format is given.
      +
      The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands.
    -+`:zlib`, triggering an in-process gzip compression. You may override
    -+them with custom commands, e.g. `gzip -cn` or `pigz -cn`.
    +-`gzip -cn`. You may override them with custom commands. An internal gzip
    +-implementation can be used by specifying the value `git archive gzip`.
    ++the magic value `git archive gzip`, which invokes an internal
    ++implementation of gzip. You may override them with custom commands.

      tar.<format>.remote::
      	If true, enable `<format>` for use by remote clients via
    @@ archive-tar.c: void init_tar_archiver(void)
      	register_archiver(&tar_archiver);

     -	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
    -+	tar_filter_config("tar.tgz.command", ":zlib", NULL);
    ++	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
      	tar_filter_config("tar.tgz.remote", "true", NULL);
     -	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
    -+	tar_filter_config("tar.tar.gz.command", ":zlib", NULL);
    ++	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
      	tar_filter_config("tar.tar.gz.remote", "true", NULL);
      	git_config(git_tar_config, NULL);
      	for (i = 0; i < nr_tar_filters; i++) {
    @@ t/t5000-tar-tree.sh: test_expect_success 'only enabled filters are available rem
      	git archive --output=j3.tar.gz HEAD &&
      	test_cmp_bin j.tgz j3.tar.gz
      '
    -
    -+test_expect_success 'use `archive.tgz.command=:zlib` explicitly' '
    -+	git -c archive.tgz.command=:zlib archive --output=j4.tgz HEAD &&
    -+	test_cmp_bin j.tgz j4.tgz
    -+'
    -+
    - test_expect_success GZIP 'extract tgz file' '
    - 	gzip -d -c <j.tgz >j.tar &&
    +@@ t/t5000-tar-tree.sh: test_expect_success GZIP 'extract tgz file' '
      	test_cmp_bin b.tar j.tar
      '

    @@ t/t5000-tar-tree.sh: test_expect_success 'only enabled filters are available rem
      	git config tar.tar.gz.remote false &&
      	test_must_fail git archive --remote=. --format=tar.gz HEAD \
      		>remote.tar.gz
    + '
    +
    +-test_expect_success 'git archive --format=tgz (internal gzip)' '
    +-	test_config tar.tgz.command "git archive gzip" &&
    +-	git archive --format=tgz HEAD >internal_gzip.tgz
    ++test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
    ++	test_config tar.tgz.command "gzip -cn" &&
    ++	git archive --format=tgz HEAD >external_gzip.tgz
    + '
    +
    +-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
    +-	test_config tar.tar.gz.command "git archive gzip" &&
    +-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
    +-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
    ++test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
    ++	test_config tar.tar.gz.command "gzip -cn" &&
    ++	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
    ++	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
    + '
    +
    +-test_expect_success GZIP 'extract tgz file (internal gzip)' '
    +-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
    +-	test_cmp_bin b.tar internal_gzip.tar
    ++test_expect_success GZIP 'extract tgz file (external gzip)' '
    ++	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
    ++	test_cmp_bin b.tar external_gzip.tar
    + '
    +
    + test_expect_success 'archive and :(glob)' '
-- snap --

All of these changes look sensible to me, and the performance
implications, while at first glance unfavorable because wallclock time
increases even as CPU time decreases, are actually quite good. As Peff
said in
https://lore.kernel.org/git/20190501181807.GC4109@xxxxxxxxxxxxxxxxxxxxx/t/#u

> [...] whatever has the lowest overall CPU time is generally preferable
> [...]

By the way, the main reason why I did not work more is that in
http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
Mark Adler (the zlib maintainer) announced that...

> [...] There are many well-tested performance improvements in zlib
> waiting in the wings that will be incorporated over the next several
> months. [...]

This was in December 2019. And now it's June 2022 and I kind of wonder
whether those promised improvements will still come.

In the meantime, however, a viable alternative seems to have cropped up:
https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
zlib should have become after above-quoted announcement.

In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
be very interesting and I would not be completely surprised if building
Git with your patches and linking against zlib-ng would paint a very
favorable picture not only in terms of CPU time but also in terms of
wallclock time. Sadly, I have not been able to set aside time to look into
that angle, but maybe I can peak your interest?

Thanks,
Dscho

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux