Hi Christian, I investigated this as well about 2 months ago and am happy to share my findings with you :) > When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially Correct. > What I’ve recognized when cloning with > - scalar clone --full-clone --recurse-submodules <URL> > or > - git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL> > > We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part. > > Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout) Also an astute observation that separating out the parallelization of fetch and checkout doesn't allow us to fully use our resources. > As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part. > E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2 > > I expect that this significantly improves the performance, especially when using partial clones. > > Do you think this is possible? Do I miss anything in my thoughts? Sort of. The issue with run_processes_parallel_tr2 is that it creates a subprocess with a git command. There is no git command that we can call that lets us do both the correct fetch and checkout command, so first you would have to create a new option/command for that (and what happens if we want to add to that parallelization in the future? Create another option/command?). I think we can do better than that! `git submodule update`, when called from clone, essentially does 4 things to the submodule: init, clone, checkout, and recursively calls itself for child submodules. One idea I had was to separate out the individual tasks that `git submodule update` does and create a new submodule--helper command (eg. git submodule--helper update-helper) that calls those individual tasks. Then, clone would directly call run_processes_parallel_tr2 with the new submodule--helper command and each process separated by submodule. This is what I imagine the general idea of what `git clone --recurse-submodules` would look like: superproject cloning run_processes_parallel_tr2(git submodule--helper update-helper) Init Clone Checkout Recursive git submdodule update-helper I'll discuss what I think are the benefits of this approach: - The entirety of submodule update would be parallelized so network and hard disk resources can be used together - There only needs to be one config option that controls how many parallel processes to spawn - Any new features to submodule update are automatically parallelized The drawback is that any new feature that would cause a race condition if run in parallel would have to have additional locking code written for it since separating it out would be difficult. In this case, only adding lines to .gitmodules in init is at risk of a race condition, but fortunately that can be handled first in series before running everything else in parallel. I haven't started implementing this and am not planning to fix this in the near future. This is because we are planning a more long-term solution (2y+) to solve problems like this (notice how much simpler it would've been to add parallelization if we didn't have to create subprocesses for every separate git command and instead could call from a variety of library functions). So if you need the parallelizations sooner or want to scratch your itch, you're more than welcome to implement it. Happy to bounce ideas off of and review any patches for this! Thanks, Calvin