Re: Parallelism for submodule update

Calvin Wan <calvinwan@xxxxxxxxxx> · Thu, 19 Jan 2023 21:39:11 +0000

Hi Christian,

I investigated this as well about 2 months ago and am happy to share my
findings with you :)

> When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially

Correct.

> What I’ve recognized when cloning with
> - scalar clone --full-clone --recurse-submodules <URL>
> or
> - git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL>
> 
> We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part.
> 
> Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout)

Also an astute observation that separating out the parallelization of
fetch and checkout doesn't allow us to fully use our resources.

> As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part.
> E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2
> 
> I expect that this significantly improves the performance, especially when using partial clones.
> 
> Do you think this is possible? Do I miss anything in my thoughts?

Sort of. The issue with run_processes_parallel_tr2 is that it creates a
subprocess with a git command. There is no git command that we can call
that lets us do both the correct fetch and checkout command, so first
you would have to create a new option/command for that (and what happens
if we want to add to that parallelization in the future? Create another
option/command?). I think we can do better than that!

`git submodule update`, when called from clone, essentially does 4
things to the submodule: init, clone, checkout, and recursively calls
itself for child submodules. One idea I had was to separate out the
individual tasks that `git submodule update` does and create a new
submodule--helper command (eg. git submodule--helper update-helper) that
calls those individual tasks. Then, clone would directly call
run_processes_parallel_tr2 with the new submodule--helper command and
each process separated by submodule.

This is what I imagine the general idea of what
`git clone --recurse-submodules` would look like:
superproject cloning
run_processes_parallel_tr2(git submodule--helper update-helper)
        Init
        Clone
        Checkout
        Recursive git submdodule update-helper

I'll discuss what I think are the benefits of this approach:
- The entirety of submodule update would be parallelized so network and
  hard disk resources can be used together
- There only needs to be one config option that controls how many
  parallel processes to spawn
- Any new features to submodule update are automatically parallelized

The drawback is that any new feature that would cause a race condition
if run in parallel would have to have additional locking code written
for it since separating it out would be difficult. In this case, only
adding lines to .gitmodules in init is at risk of a race condition, but
fortunately that can be handled first in series before running
everything else in parallel.

I haven't started implementing this and am not planning to fix this in
the near future. This is because we are planning a more long-term
solution (2y+) to solve problems like this (notice how much simpler it
would've been to add parallelization if we didn't have to create
subprocesses for every separate git command and instead could call from
a variety of library functions). So if you need the parallelizations
sooner or want to scratch your itch, you're more than welcome to
implement it. Happy to bounce ideas off of and review any patches for
this!

Thanks,
Calvin