Re: [PATCH v12 5/8] object-file.c: add "stream_loose_object()" to handle large object

Neeraj Singh <nksingh85@xxxxxxxxx> · Thu, 31 Mar 2022 12:54:53 -0700

On Tue, Mar 29, 2022 at 03:56:10PM +0200, Ævar Arnfjörð Bjarmason wrote:
> From: Han Xin <hanxin.hx@xxxxxxxxxxxxxxx>
> 
> If we want unpack and write a loose object using "write_loose_object",
> we have to feed it with a buffer with the same size of the object, which
> will consume lots of memory and may cause OOM. This can be improved by
> feeding data to "stream_loose_object()" in a stream.
> 
> Add a new function "stream_loose_object()", which is a stream version of
> "write_loose_object()" but with a low memory footprint. We will use this
> function to unpack large blob object in later commit.
> 

Just a thought for optimization which you might want to try on top of this
series:
try using mmap on both the source and target files of your stream. Use a
big 'window' for the mmap (multiple MB) to reduce the TLB flush costs. TLB
flush costs should be minimal anyway if Git is single-threaded.

If you can set the source and target buffers of zlib to the source and
dest mappings respectively, you'd eliminate two copies of data into
Git's stack buffers.  You might need to over-allocate the dst file if
you don't know the size up front, but doing an over-allocate and truncate
should be pretty cheap if you're working with a big file.

Thanks,
Neeraj