This series makes "unpack-objects" capable of streaming large objects to disk. As 7/7 shows streaming e.g. a 100MB blob now uses ~5MB of memory instead of ~105MB. This streaming method is slower if you've got memory to handle the blobs in-core, but if you don't it allows you to unpack objects at all, as you might otherwise OOM. Changes since v13: * Make the error checking in the loop of get_data() the same way as we do in the non dry-run mode. * Add batched disk flushes for stream_loose_object(). This is pointed out by Neeraj Singh[1]. * Minor typo/grammar/comment etc. fixes throughout. 1. https://lore.kernel.org/git/7ba4858a-d1cc-a4eb-b6d6-4c04a5dd6ce7@xxxxxxxxx/ Han Xin (4): unpack-objects: low memory footprint for get_data() in dry_run mode object-file.c: refactor write_loose_object() to several steps object-file.c: add "stream_loose_object()" to handle large object unpack-objects: use stream_loose_object() to unpack large objects Ævar Arnfjörð Bjarmason (3): object-file.c: do fsync() and close() before post-write die() object-file.c: factor out deflate part of write_loose_object() core doc: modernize core.bigFileThreshold documentation Documentation/config/core.txt | 33 +++-- builtin/unpack-objects.c | 106 ++++++++++++-- object-file.c | 240 +++++++++++++++++++++++++++----- object-store.h | 8 ++ t/t5351-unpack-large-objects.sh | 76 ++++++++++ 5 files changed, 408 insertions(+), 55 deletions(-) create mode 100755 t/t5351-unpack-large-objects.sh Range-diff against v13: 1: 6703df6350 ! 1: bf600a2fa8 unpack-objects: low memory footprint for get_data() in dry_run mode @@ Commit message Because in dry_run mode, "get_data()" is only used to check the integrity of data, and the returned buffer is not used at all, we can - allocate a smaller buffer and reuse it as zstream output. Therefore, - in dry_run mode, "get_data()" will release the allocated buffer and - return NULL instead of returning garbage data. + allocate a smaller buffer and use it as zstream output. Make the function + return NULL in the dry-run mode, as no callers use the returned buffer. The "find [...]objects/?? -type f | wc -l" test idiom being used here is adapted from the same "find" use added to another test in @@ builtin/unpack-objects.c: static void use(int bytes) } +/* -+ * Decompress zstream from stdin and return specific size of data. ++ * Decompress zstream from the standard input into a newly ++ * allocated buffer of specified size and return the buffer. + * The caller is responsible to free the returned buffer. + * + * But for dry_run mode, "get_data()" is only used to check the @@ builtin/unpack-objects.c: static void *get_data(unsigned long size) + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; -+ stream.avail_out = bufsize; ++ stream.avail_out = bufsize > size - stream.total_out ? ++ size - stream.total_out : ++ bufsize; + } } git_inflate_end(&stream); 2: 6e289d25c1 = 2: a327f484f7 object-file.c: do fsync() and close() before post-write die() 3: 46f9def06c ! 3: 9bc8002282 object-file.c: refactor write_loose_object() to several steps @@ object-file.c: static int create_tmpfile(struct strbuf *tmp, const char *filenam + * + * - End the compression of zlib stream. + * - Get the calculated oid to "oid". -+ * - fsync() and close() the "fd" + */ +static int end_loose_object_common(git_hash_ctx *c, git_zstream *stream, + struct object_id *oid) 4: 5a95ebede6 = 4: 7c73815f18 object-file.c: factor out deflate part of write_loose_object() 5: 26847541aa ! 5: 28a9588f9c object-file.c: add "stream_loose_object()" to handle large object @@ Commit message path. "freshen_packed_object()" or "freshen_loose_object()" will be called - inside "stream_loose_object()" after obtaining the "oid". + inside "stream_loose_object()" after obtaining the "oid". After the + temporary file is written, we wants to mark the object to recent and we + may find that where indeed is already the object. We should remove the + temporary and do not leave a new copy of the object. Helped-by: René Scharfe <l.s.r@xxxxxx> Helped-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> @@ object-file.c: static int freshen_packed_object(const struct object_id *oid) + char hdr[MAX_HEADER_LEN]; + int hdrlen; + ++ if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) ++ prepare_loose_object_bulk_checkin(); ++ + /* Since oid is not determined, save tmp file to odb path. */ + strbuf_addf(&filename, "%s/", get_object_directory()); + hdrlen = format_object_header(hdr, sizeof(hdr), OBJ_BLOB, len); @@ object-file.c: static int freshen_packed_object(const struct object_id *oid) + die(_("write stream object %ld != %"PRIuMAX), stream.total_in, + (uintmax_t)len + hdrlen); + -+ /* Common steps for write_loose_object and stream_loose_object to ++ /* ++ * Common steps for write_loose_object and stream_loose_object to + * end writing loose oject: + * + * - End the compression of zlib stream. 6: eb962b60b9 ! 6: dea5c4172b core doc: modernize core.bigFileThreshold documentation @@ Documentation/config/core.txt: You probably do not need to adjust this value. +Files above the configured limit will be: + -Common unit suffixes of 'k', 'm', or 'g' are supported. -+* Stored deflated, without attempting delta compression. ++* Stored deflated in packfiles, without attempting delta compression. ++ +The default limit is primarily set with this use-case in mind. With it +most projects will have their source code and other text files delta @@ Documentation/config/core.txt: You probably do not need to adjust this value. +usage, at the slight expense of increased disk usage. ++ +* Will be treated as if though they were labeled "binary" (see -+ linkgit:gitattributes[5]). This means that e.g. linkgit:git-log[1] -+ and linkgit:git-diff[1] will not diffs for files above this limit. ++ linkgit:gitattributes[5]). e.g. linkgit:git-log[1] and ++ linkgit:git-diff[1] will not diffs for files above this limit. ++ +* Will be generally be streamed when written, which avoids excessive +memory usage, at the cost of some fixed overhead. Commands that make 7: 88a2754fcb ! 7: d236230a4c unpack-objects: use stream_loose_object() to unpack large objects @@ t/t5351-unpack-large-objects.sh: test_description='git unpack-objects with large } test_expect_success "create large objects (1.5 MB) and PACK" ' +@@ t/t5351-unpack-large-objects.sh: test_expect_success "create large objects (1.5 MB) and PACK" ' + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && +- PACK=$(echo HEAD | git pack-objects --revs pack) ++ PACK=$(echo HEAD | git pack-objects --revs pack) && ++ git verify-pack -v pack-$PACK.pack >out && ++ sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\).*/\1/p" \ ++ <out >obj-list + ' + + test_expect_success 'set memory limitation to 1MB' ' @@ t/t5351-unpack-large-objects.sh: test_expect_success 'set memory limitation to 1MB' ' ' @@ t/t5351-unpack-large-objects.sh: test_expect_success 'set memory limitation to 1 + test_dir_is_empty dest.git/objects/pack +' + ++BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch' ++ ++test_expect_success 'unpack big object in stream (core.fsyncmethod=batch)' ' ++ prepare_dest 1m && ++ GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \ ++ git -C dest.git $BATCH_CONFIGURATION unpack-objects <pack-$PACK.pack && ++ grep fsync/hardware-flush trace2.txt && ++ test_dir_is_empty dest.git/objects/pack && ++ git -C dest.git cat-file --batch-check="%(objectname)" <obj-list >current && ++ cmp obj-list current ++' ++ +test_expect_success 'do not unpack existing large objects' ' + prepare_dest 1m && + git -C dest.git index-pack --stdin <pack-$PACK.pack && -- 2.36.1