Re: [PATCH v4 5/5] diff-lib: parallelize run_diff_files for submodules

Elijah Newren <newren@xxxxxxxxx> · Mon, 28 Nov 2022 21:13:52 -0800

On Tue, Nov 8, 2022 at 11:32 AM Calvin Wan <calvinwan@xxxxxxxxxx> wrote:
>
> During the iteration of the index entries in run_diff_files, whenever
> a submodule is found and needs its status checked, a subprocess is
> spawned for it. Instead of spawning the subprocess immediately and
> waiting for its completion to continue, hold onto all submodules and
> relevant information in a list. Then use that list to create tasks for
> run_processes_parallel. Subprocess output is duplicated and passed to
> status_pipe_output which parses it.
>
> Add config option submodule.diffJobs to set the maximum number
> of parallel jobs. The option defaults to 1 if unset. If set to 0, the
> number of jobs is set to online_cpus().
>
> Since run_diff_files is called from many different commands, I chose
> to grab the config option in the function rather than adding variables
> to every git command and then figuring out how to pass them all in.
>
> Signed-off-by: Calvin Wan <calvinwan@xxxxxxxxxx>
> ---
>  Documentation/config/submodule.txt |  12 +++
>  diff-lib.c                         |  80 +++++++++++++--
>  submodule.c                        | 154 +++++++++++++++++++++++++++++
>  submodule.h                        |   9 ++
>  t/t4027-diff-submodule.sh          |  19 ++++
>  t/t7506-status-submodule.sh        |  19 ++++
>  6 files changed, 287 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt
> index 6490527b45..1144a5ad74 100644
> --- a/Documentation/config/submodule.txt
> +++ b/Documentation/config/submodule.txt
> @@ -93,6 +93,18 @@ submodule.fetchJobs::
>         in parallel. A value of 0 will give some reasonable default.
>         If unset, it defaults to 1.
>
> +submodule.diffJobs::
> +       Specifies how many submodules are diffed at the same time. A
> +       positive integer allows up to that number of submodules diffed
> +       in parallel. A value of 0 will give the number of logical cores.

Why hardcode that 0 gives the number of logical cores?  Why not just
state that a value of 0 "gives a guess at optimal parallelism",
allowing us to adjust it in the future if we can do some smart
heuristics?  It'd be nice to not have us tied down and prevented from
taking a smarter approach.

> +       If unset, it defaults to 1. The diff operation is used by many
> +       other git commands such as add, merge, diff, status, stash and
> +       more. Note that the expensive part of the diff operation is
> +       reading the index from cache or memory. Therefore multiple jobs
> +       may be detrimental to performance if your hardware does not
> +       support parallel reads or if the number of jobs greatly exceeds
> +       the amount of supported reads.

So, in the future, someone who wants to speed things up is going to
need to configure submodule.diffJobs, submodule.fetchJobs,
submodule.checkoutJobs, submodule.grepJobs, submodule.mergeJobs, etc.?
 I worry that we're headed towards a bit of a suboptimal user
experience here.  It'd be nice to have a more central configuration of
"yes, I want parallelism; please don't make me benchmark things in
order to take advantage of it", if that's possible.  It may just be
that the "optimal" parallelism varies significantly between commands,
and also varies a lot based on hardware, repository sizes, background
load on the system, etc. such that we can't provide a reasonable
suggestion for those that want a value greater than 1.  Or maybe in
the future we allow folks somehow to request our best guess at a good
parallelization level and then let users override with these
individual flags.  I'm just a little worried we might be making users
do work that we should somehow figure out.