[PATCH v14 0/7] unpack-objects: support streaming blobs to disk

Han Xin <chiyutianyi@xxxxxxxxx> · Fri, 10 Jun 2022 22:46:00 +0800

This series makes "unpack-objects" capable of streaming large objects
to disk.

As 7/7 shows streaming e.g. a 100MB blob now uses ~5MB of memory
instead of ~105MB. This streaming method is slower if you've got
memory to handle the blobs in-core, but if you don't it allows you to
unpack objects at all, as you might otherwise OOM.

Changes since v13:

* Make the error checking in the loop of get_data() the same way as
  we do in the non dry-run mode.

* Add batched disk flushes for stream_loose_object(). This is pointed
  out by Neeraj Singh[1].

* Minor typo/grammar/comment etc. fixes throughout.

1. https://lore.kernel.org/git/7ba4858a-d1cc-a4eb-b6d6-4c04a5dd6ce7@xxxxxxxxx/

Han Xin (4):
  unpack-objects: low memory footprint for get_data() in dry_run mode
  object-file.c: refactor write_loose_object() to several steps
  object-file.c: add "stream_loose_object()" to handle large object
  unpack-objects: use stream_loose_object() to unpack large objects

Ævar Arnfjörð Bjarmason (3):
  object-file.c: do fsync() and close() before post-write die()
  object-file.c: factor out deflate part of write_loose_object()
  core doc: modernize core.bigFileThreshold documentation

 Documentation/config/core.txt   |  33 +++--
 builtin/unpack-objects.c        | 106 ++++++++++++--
 object-file.c                   | 240 +++++++++++++++++++++++++++-----
 object-store.h                  |   8 ++
 t/t5351-unpack-large-objects.sh |  76 ++++++++++
 5 files changed, 408 insertions(+), 55 deletions(-)
 create mode 100755 t/t5351-unpack-large-objects.sh

Range-diff against v13:
1:  6703df6350 ! 1:  bf600a2fa8 unpack-objects: low memory footprint for get_data() in dry_run mode
    @@ Commit message

         Because in dry_run mode, "get_data()" is only used to check the
         integrity of data, and the returned buffer is not used at all, we can
    -    allocate a smaller buffer and reuse it as zstream output. Therefore,
    -    in dry_run mode, "get_data()" will release the allocated buffer and
    -    return NULL instead of returning garbage data.
    +    allocate a smaller buffer and use it as zstream output. Make the function
    +    return NULL in the dry-run mode, as no callers use the returned buffer.

         The "find [...]objects/?? -type f | wc -l" test idiom being used here
         is adapted from the same "find" use added to another test in
    @@ builtin/unpack-objects.c: static void use(int bytes)
      }

     +/*
    -+ * Decompress zstream from stdin and return specific size of data.
    ++ * Decompress zstream from the standard input into a newly
    ++ * allocated buffer of specified size and return the buffer.
     + * The caller is responsible to free the returned buffer.
     + *
     + * But for dry_run mode, "get_data()" is only used to check the
    @@ builtin/unpack-objects.c: static void *get_data(unsigned long size)
     +		if (dry_run) {
     +			/* reuse the buffer in dry_run mode */
     +			stream.next_out = buf;
    -+			stream.avail_out = bufsize;
    ++			stream.avail_out = bufsize > size - stream.total_out ?
    ++						   size - stream.total_out :
    ++						   bufsize;
     +		}
      	}
      	git_inflate_end(&stream);
2:  6e289d25c1 = 2:  a327f484f7 object-file.c: do fsync() and close() before post-write die()
3:  46f9def06c ! 3:  9bc8002282 object-file.c: refactor write_loose_object() to several steps
    @@ object-file.c: static int create_tmpfile(struct strbuf *tmp, const char *filenam
     + *
     + * - End the compression of zlib stream.
     + * - Get the calculated oid to "oid".
    -+ * - fsync() and close() the "fd"
     + */
     +static int end_loose_object_common(git_hash_ctx *c, git_zstream *stream,
     +				   struct object_id *oid)
4:  5a95ebede6 = 4:  7c73815f18 object-file.c: factor out deflate part of write_loose_object()
5:  26847541aa ! 5:  28a9588f9c object-file.c: add "stream_loose_object()" to handle large object
    @@ Commit message
         path.

         "freshen_packed_object()" or "freshen_loose_object()" will be called
    -    inside "stream_loose_object()" after obtaining the "oid".
    +    inside "stream_loose_object()" after obtaining the "oid". After the
    +    temporary file is written, we wants to mark the object to recent and we
    +    may find that where indeed is already the object. We should remove the
    +    temporary and do not leave a new copy of the object.

         Helped-by: René Scharfe <l.s.r@xxxxxx>
         Helped-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
    @@ object-file.c: static int freshen_packed_object(const struct object_id *oid)
     +	char hdr[MAX_HEADER_LEN];
     +	int hdrlen;
     +
    ++	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
    ++		prepare_loose_object_bulk_checkin();
    ++
     +	/* Since oid is not determined, save tmp file to odb path. */
     +	strbuf_addf(&filename, "%s/", get_object_directory());
     +	hdrlen = format_object_header(hdr, sizeof(hdr), OBJ_BLOB, len);
    @@ object-file.c: static int freshen_packed_object(const struct object_id *oid)
     +		die(_("write stream object %ld != %"PRIuMAX), stream.total_in,
     +		    (uintmax_t)len + hdrlen);
     +
    -+	/* Common steps for write_loose_object and stream_loose_object to
    ++	/*
    ++	 * Common steps for write_loose_object and stream_loose_object to
     +	 * end writing loose oject:
     +	 *
     +	 *  - End the compression of zlib stream.
6:  eb962b60b9 ! 6:  dea5c4172b core doc: modernize core.bigFileThreshold documentation
    @@ Documentation/config/core.txt: You probably do not need to adjust this value.
     +Files above the configured limit will be:
      +
     -Common unit suffixes of 'k', 'm', or 'g' are supported.
    -+* Stored deflated, without attempting delta compression.
    ++* Stored deflated in packfiles, without attempting delta compression.
     ++
     +The default limit is primarily set with this use-case in mind. With it
     +most projects will have their source code and other text files delta
    @@ Documentation/config/core.txt: You probably do not need to adjust this value.
     +usage, at the slight expense of increased disk usage.
     ++
     +* Will be treated as if though they were labeled "binary" (see
    -+  linkgit:gitattributes[5]). This means that e.g. linkgit:git-log[1]
    -+  and linkgit:git-diff[1] will not diffs for files above this limit.
    ++  linkgit:gitattributes[5]). e.g. linkgit:git-log[1] and
    ++  linkgit:git-diff[1] will not diffs for files above this limit.
     ++
     +* Will be generally be streamed when written, which avoids excessive
     +memory usage, at the cost of some fixed overhead. Commands that make
7:  88a2754fcb ! 7:  d236230a4c unpack-objects: use stream_loose_object() to unpack large objects
    @@ t/t5351-unpack-large-objects.sh: test_description='git unpack-objects with large
      }

      test_expect_success "create large objects (1.5 MB) and PACK" '
    +@@ t/t5351-unpack-large-objects.sh: test_expect_success "create large objects (1.5 MB) and PACK" '
    + 	test_commit --append foo big-blob &&
    + 	test-tool genrandom bar 1500000 >big-blob &&
    + 	test_commit --append bar big-blob &&
    +-	PACK=$(echo HEAD | git pack-objects --revs pack)
    ++	PACK=$(echo HEAD | git pack-objects --revs pack) &&
    ++	git verify-pack -v pack-$PACK.pack >out &&
    ++	sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\).*/\1/p" \
    ++		<out >obj-list
    + '
    + 
    + test_expect_success 'set memory limitation to 1MB' '
     @@ t/t5351-unpack-large-objects.sh: test_expect_success 'set memory limitation to 1MB' '
      '

    @@ t/t5351-unpack-large-objects.sh: test_expect_success 'set memory limitation to 1
     +	test_dir_is_empty dest.git/objects/pack
     +'
     +
    ++BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch'
    ++
    ++test_expect_success 'unpack big object in stream (core.fsyncmethod=batch)' '
    ++	prepare_dest 1m &&
    ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
    ++		git -C dest.git $BATCH_CONFIGURATION unpack-objects <pack-$PACK.pack &&
    ++	grep fsync/hardware-flush trace2.txt &&
    ++	test_dir_is_empty dest.git/objects/pack &&
    ++	git -C dest.git cat-file --batch-check="%(objectname)" <obj-list >current &&
    ++	cmp obj-list current
    ++'
    ++
     +test_expect_success 'do not unpack existing large objects' '
     +	prepare_dest 1m &&
     +	git -C dest.git index-pack --stdin <pack-$PACK.pack &&
-- 
2.36.1