Am 23.09.22 um 18:30 schrieb Junio C Hamano: > "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes: > >> Maybe they can technically be stored in any order, but people don't want >> git archive to produce non-deterministic archives... >> ... I feel like it would be very difficult to achieve the >> speedups you want and still produce a deterministic archive. > > I am not going to work on it myself, but I think the only possible > parallelism would come from making the reading for F(n+1) and > subsequent objects overlap writing of F(n), given a deterministic > order of files in the resulting archive. When we decide which file > should come first, and learns that it is F(0), it probably comes the > tree object of the root level, and it is very likely that we would > already know what F(1) and F(2) are by that time, so it should be > possible to dispatch reading and applying content filtering on F(1) > and keeping the result in core, while we are still writing F(0) out. That's what git grep does. It can be seen as a very lossy compression with output printed in a deterministic order. git archive compresses a small file by reading it fully and writing the result in one go. It streams big files, though, i.e. reads, compresses and writes them in small pieces. That won't work as easily if multiple files are compressed in parallel. To allow multiple streams would require storing their results in temporary files. Perhaps it would already help to allow only a single stream and start it only when its time to output has come, though. Giving up on deterministic order would reduce the memory usage for keeping compressed small files. That only matters if the product of core.bigFileThreshold (default value 512 MB, the number of parallel threads and the compression ratio exceeds the available memory. The same effect could be achieved by using temporary files. We'd still have to keep up to core.bigFileThreshold times the number of threads of uncompressed data in memory, though. If I/O latency instead of CPU usage is the limiting factor and prefetching would help then starting git grep or git archive in the background might work. If the order of visited blobs needs to be randomized then perhaps something like this would be better: git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null No idea how to randomize the order of tree object visits. René