Re: git --archive

René Scharfe <l.s.r@xxxxxx> · Sat, 24 Sep 2022 10:58:15 +0200

Am 23.09.22 um 18:30 schrieb Junio C Hamano:
> "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes:
>
>> Maybe they can technically be stored in any order, but people don't want
>> git archive to produce non-deterministic archives...
>> ...  I feel like it would be very difficult to achieve the
>> speedups you want and still produce a deterministic archive.
>
> I am not going to work on it myself, but I think the only possible
> parallelism would come from making the reading for F(n+1) and
> subsequent objects overlap writing of F(n), given a deterministic
> order of files in the resulting archive.  When we decide which file
> should come first, and learns that it is F(0), it probably comes the
> tree object of the root level, and it is very likely that we would
> already know what F(1) and F(2) are by that time, so it should be
> possible to dispatch reading and applying content filtering on F(1)
> and keeping the result in core, while we are still writing F(0) out.

That's what git grep does.  It can be seen as a very lossy compression
with output printed in a deterministic order.

git archive compresses a small file by reading it fully and writing the
result in one go.  It streams big files, though, i.e. reads, compresses and writes
them in small pieces.  That won't work as easily if multiple files are compressed
in parallel.

To allow multiple streams would require storing their results in temporary files.
Perhaps it would already help to allow only a single stream and start it only when
its time to output has come, though.

Giving up on deterministic order would reduce the memory usage for keeping
compressed small files.  That only matters if the product of core.bigFileThreshold
(default value 512 MB, the number of parallel threads and the compression ratio
exceeds the available memory.  The same effect could be achieved by using
temporary files.  We'd still have to keep up to core.bigFileThreshold times the
number of threads of uncompressed data in memory, though.

If I/O latency instead of CPU usage is the limiting factor and prefetching would
help then starting git grep or git archive in the background might work.  If the
order of visited blobs needs to be randomized then perhaps something like this
would be better:

   git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null

No idea how to randomize the order of tree object visits.

René