Am 24.09.22 um 13:34 schrieb Scheffenegger, Richard: > >> If I/O latency instead of CPU usage is the limiting factor and >> prefetching would help then starting git grep or git archive in the >> background might work. If the order of visited blobs needs to be >> randomized then perhaps something like this would be better: >> >> git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file >> --batch >/dev/null > > Isn't the 2nd git, receiving input from stdin, running > single-threaded? Yes. > Maybe > > Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4 > - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch > > /dev/null &"; done; rm -f splitted???? > > To parallelize the reading of the objects? Sure, but in a repository with 100000 files you'd end up with 1000 parallel processes, which may be a few too many. Splitting the list into similar-sized parts based on a given degree of parallelism is probably more practical. It could be done by relying on the randomness of the object IDs and partitioning by a sub-string. Or perhaps using pseudo-random numbers is sufficient: git ls-tree -r HEAD | awk '{print $3}' | sort | awk -v pieces=8 -v prefix=file ' { piece = int(rand() * pieces) filename = prefix piece print $0 > filename }' So how much does such a warmup help in your case? >> No idea how to randomize the order of tree object visits. > > To heat up data caches, the order of objects visited is not relevant, > the order or IOs issued to the actual object is relevant. What's the difference? NB: When I wrote "tree objects" I meant the type of objects from Git's object store (made up of packs and loose files) that represent sub-directories, and with "visit" I meant reading them to traverse the hierarchy of Git blobs and trees. Here's an idea after all: Using "git ls-tree" without "-r" and handling recursing in the prefetch script would allow traversing trees in a different order and even in parallel. Not sure how to limit parallelism to a sane degree. René