RE: git --archive

"Scheffenegger, Richard" <Richard.Scheffenegger@xxxxxxxxxx> · Sat, 24 Sep 2022 11:34:57 +0000

Hi Rene,

>  git archive compresses a small file by reading it fully and writing the result in one go.  It streams big files, though, i.e. reads, compresses and writes them in small pieces.  That won't work as easily if multiple files are compressed in parallel.

The ask was not to parallelize the compression step (scale CPU efficiency locally), but to heat up the *remote filesystem* metadata and data caches by massive parallel/multithreaded reads (if one so desires), in order for the stricty sequential, one-after-another existing archiving steps to complete faster.

> Giving up on deterministic order would reduce the memory usage for keeping compressed small files.  

That was just one idea. Retaining a deterministic order to have stable secure hashes of the resulting files is a valuable property.

> That only matters if the product of core.bigFileThreshold (default value 512 MB, the number of parallel threads and the compre>ssion ratio exceeds the available memory.  The same effect could be achieved by using temporary files.  We'd still have to keep up to core.bigFileThreshold times the number of threads of uncompressed data in memory, though.

Again, none of the data read during the preparatory, highly concurrent step would need to linger around anywhere. Just heating up *remote* filesystem metadata and data caches is sufficient to dramatically reduce access and read latency - which is the primary determining factor of how long it takes to prepare an archive (e.g 500 sec for 80k /1.3 GB uncompressed / 250 M compressed files when data partially/mostly resides on cold storage).

> If I/O latency instead of CPU usage is the limiting factor and prefetching would help then starting git grep or git archive in the background might work.  If the order of visited blobs needs to be randomized then perhaps something like this would be better:
>
>   git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null

Isn't the 2nd git, receiving input from stdin, running single-threaded?

Maybe

Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4 - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch > /dev/null &"; done; rm -f splitted????

To parallelize the reading of the objects?

Otherwise, the main problem of lacking concurrency in reading in all the objected while metadata/data caches are cold would only moved from the archive step to the cat-file step, while overall completion would not be any faster. 

> No idea how to randomize the order of tree object visits.

To heat up data caches, the order of objects visited is not relevant, the order or IOs issued to the actual object is relevant. Trivial sequential reads (from start to end) typically get marked for cache evicition after having been delivered once to the client - that cache memory becomes available for immediate overwrite. To increase their "stickiness" in caches, the object reads would need to be performed in a pseudo-random fashion, e.g. if the IO block size is 1MB, accessing blocks in in an order like 10,1,9,4,7,3,8,2,6,5 would have them marked for longer cache retention (in remote filesystem servers).

Richard