Changes since v1 This series is now rebased on top of Avar's run-command refactor series[1], which allows new functions to be easily added to run_parallel_processes without having to change other callers. The config option has been renamed to submodule.diffJobs. This name accurately reflects what functionality is affected. While other APIs parse for config options from the initial command, this config option is parsed for in the diff-lib.c library. This is because there are so many entry points into run_diff_files resulting in signicant code changes required to pass in this config option. I also wanted to pose another question to list regarding defaults for parallel processes. For jobs that clearly scale with the number of processes (aka jobs that are mostly processor bound), it is obvious that setting the default number of processes to the number of available cores is the most optimal option. However, this changes when the job is mostly I/O bound or has a combination of I/O and processing. Looking at my use case for `status` on a cold cache (see below), we notice that increasing the number of parallel processes speeds up status, but after a certain number, it actually starts slowing down. This is because `status` spends most of its time reading from the index. While SSDs are able to do a certain amount of reads in parallel, this limit can be significantly smaller than the number of cores and going past that limit causes the read head to constantly swap, slowing down `status`. With an HDD, parallel reads aren't even possible and while I haven't tested my patch on an HDD, I have a suspicion that adding multiple processes would probably make `status` slower than the baseline. How should the default be set then? In my case, the safe option would be to default it to 1, but then many users wouldn't be able to discover this optimization. There are also multiple places in the git documentation where we say that setting a config option for a parallel process to 0 will result in a "reasonable amount" of processes, which generally entails the number of available cores. Can this be problematic for other parallel processes that spend a significant time in I/O? Should my option even have the option to set it to 0 given the pitalls? Original cover letter: When running 'git status' in a superproject, git spawns a subprocess in series to run status for every submodule. For projects with many large submodules, parallelizing status subprocesses can significantly speed up the runtime for both cold and warm caches. While my initial intention was to speedup status, it turns out that much of the runtime spent on status is in diff-lib.c, resulting in speedups for many different other commands that utilize this library. Here are some timing tests from running status on the Android Open Source Project (AOSP). My machine has an SSD and 48 cores. Warm Cache: git 2.37 Time (mean ± σ): 17.685 s ± 2.040 s [User: 5.041 s, System: 22.799 s] Range (min … max): 16.168 s … 22.804 s 10 runs this patch (status.parallelSubmodules=1) Time (mean ± σ): 13.102 s ± 0.500 s [User: 4.894 s, System: 19.533 s] Range (min … max): 12.841 s … 14.447 s 10 runs this patch (status.parallelSubmodules=5) Time (mean ± σ): 3.994 s ± 0.152 s [User: 4.998 s, System: 20.805 s] Range (min … max): 3.744 s … 4.163 s 10 runs this patch (status.parallelSubmodules=10) Time (mean ± σ): 3.445 s ± 0.085 s [User: 5.151 s, System: 20.208 s] Range (min … max): 3.319 s … 3.586 s 10 runs this patch (status.parallelSubmodules=20) Time (mean ± σ): 3.626 s ± 0.109 s [User: 5.087 s, System: 20.366 s] Range (min … max): 3.438 s … 3.763 s 10 runs We can see that there are diminishing returns and even slightly worse performance after a certain number of max processes, but optimally there is a speed up factor of around 5. Cold Cache: git 2.37 mean of 3 runs: 6m32s this patch (status.parallelSubmodules=1) mean of 3 runs: 5m34s this patch (status.parallelSubmodules=5) mean of 3 runs: 2m23s this patch (status.parallelSubmodules=10) mean of 3 runs: 2m45s this patch (status.parallelSubmodules=20) mean of 3 runs: 3m23s We can witness the same phenomenon as above and optimally there is a speed up factor of around 2.7. Patch 1 adds output piping to run_processes_parallel_opt so the output from each submodule can be parsed. Patches 2 and 3 move preexisting functionality into separate functions and refactor code to prepare for patch 4 to implement parallelization. Future work: The reason why status is much slower on a cold cache vs warm cache is mainly due to refreshing the index. It is worth investigating whether this is entirely necessary. [1] https://lore.kernel.org/git/cover-00.15-00000000000-20220930T111343Z-avarab@xxxxxxxxx/ Calvin Wan (4): run-command: add pipe_output_fn to run_processes_parallel_opts submodule: move status parsing into function diff-lib: refactor match_stat_with_submodule diff-lib: parallelize run_diff_files for submodules Documentation/config/submodule.txt | 12 ++ diff-lib.c | 108 ++++++++++++-- run-command.c | 12 +- run-command.h | 22 +++ submodule.c | 230 +++++++++++++++++++++++++---- submodule.h | 9 ++ t/helper/test-run-command.c | 18 +++ t/t0061-run-command.sh | 65 ++++---- t/t4027-diff-submodule.sh | 19 +++ t/t7506-status-submodule.sh | 19 +++ 10 files changed, 441 insertions(+), 73 deletions(-) -- 2.38.0.rc1.362.ged0d419d3c-goog