When running 'git status' in a superproject, git spawns a subprocess in series to run status for every submodule. For projects with many large submodules, parallelizing status subprocesses can significantly speed up the runtime for both cold and warm caches. Here are some timing tests from running status on the Android Open Source Project (AOSP). My machine has an SSD and 48 cores. Warm Cache: git 2.37 Time (mean ± σ): 17.685 s ± 2.040 s [User: 5.041 s, System: 22.799 s] Range (min … max): 16.168 s … 22.804 s 10 runs this patch (status.parallelSubmodules=1) Time (mean ± σ): 13.102 s ± 0.500 s [User: 4.894 s, System: 19.533 s] Range (min … max): 12.841 s … 14.447 s 10 runs this patch (status.parallelSubmodules=5) Time (mean ± σ): 3.994 s ± 0.152 s [User: 4.998 s, System: 20.805 s] Range (min … max): 3.744 s … 4.163 s 10 runs this patch (status.parallelSubmodules=10) Time (mean ± σ): 3.445 s ± 0.085 s [User: 5.151 s, System: 20.208 s] Range (min … max): 3.319 s … 3.586 s 10 runs this patch (status.parallelSubmodules=20) Time (mean ± σ): 3.626 s ± 0.109 s [User: 5.087 s, System: 20.366 s] Range (min … max): 3.438 s … 3.763 s 10 runs We can see that there are diminishing returns and even slightly worse performance after a certain number of max processes, but optimally there is a speed up factor of around 5. Cold Cache: git 2.37 mean of 3 runs: 6m32s this patch (status.parallelSubmodules=1) mean of 3 runs: 5m34s this patch (status.parallelSubmodules=5) mean of 3 runs: 2m23s this patch (status.parallelSubmodules=10) mean of 3 runs: 2m45s this patch (status.parallelSubmodules=20) mean of 3 runs: 3m23s We can witness the same phenomenon as above and optimally there is a speed up factor of around 2.7. Patch 1 adds output piping to run_processes_parallel so the output from each submodule can be parsed. Patches 2 and 3 move preexisting functionality into separate functions and refactor code to prepare for patch 4 to implement parallelization. Future work: The reason why status is much slower on a cold cache vs warm cache is mainly due to refreshing the index. It is worth investigating whether this is entirely necessary. Calvin Wan (4): run-command: add pipe_output to run_processes_parallel submodule: move status parsing into function diff-lib: refactor functions diff-lib: parallelize run_diff_files for submodules Documentation/config/status.txt | 6 + add-interactive.c | 2 +- builtin/add.c | 4 +- builtin/commit.c | 6 + builtin/diff-files.c | 2 +- builtin/diff.c | 2 +- builtin/merge.c | 2 +- builtin/stash.c | 2 +- builtin/submodule--helper.c | 4 +- diff-lib.c | 120 +++++++++++++----- diff.h | 2 +- run-command.c | 6 +- run-command.h | 9 ++ submodule.c | 213 +++++++++++++++++++++++++++----- submodule.h | 9 ++ t/helper/test-run-command.c | 31 ++++- t/t0061-run-command.sh | 26 ++++ wt-status.c | 6 +- wt-status.h | 1 + 19 files changed, 372 insertions(+), 81 deletions(-) base-commit: 5502f77b6944eda8e26813d8f542cffe7d110aea -- 2.37.3.998.g577e59143f-goog