Re: [PATCH] fetch: add "--parallel", which fetches all remotes in parallel

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 18 Jul 2019 10:45:04 -0700

Palmer Dabbelt <palmer@xxxxxxxxxx> writes:

> * I'm not sure if it's safe to access the .git database from multiple
>   processes at the same time.

It is supposed to, and also you are supposed to keep it that way ;-)

> * As I was writing the documentation I found "fetch --jobs".  It seems
>   like I should use that instead of adding a new argrument, but I wasn't
>   sure.

Why not?  What makes you feel it is a bad idea to follow that
pattern?

Ah, --jobs that is taken already is right now too tied to fetches
that happen in submodules, which arguably was a design mistake.

    -j::
    --jobs=<n>::
            Number of parallel children to be used for fetching
            submodules.  Each will fetch from different submodules,
            such that fetching many submodules will be faster. By
            default submodules will be fetched one at a time.

The simplest endgame would be to replace "submodule" with
"repository" in the above description, perhaps like

	Number of parallel jobs to be used for fetching from
	multiple repositories (both fetching with "--multiple" from
	multiple repositories, and also fetching updated contents
	for submodules).  By default, fetching from multiple
	repositories and submodules is done one at a time.

and nobody would have complained if the system were like so from the
beginning.  Existing users, however, may want extra flexibility, and
would complain loudly if we did the above, in which case, we may
have to

 - introduce --fetch-jobs=<n> for what you are adding;

 - introduce --submodule-fetch-jobs=<n> as a synonym for existing
   --jobs=<n> and deprecate the current use of --jobs=<n>;

 - eventually repurpose --jobs=<n> as a short-hand to give both
   --fetch-jobs and --submoduje-fetch-jobs at the same time.

> +static int parallel = 0;

Don't explicitly "= 0;" initialize file-scope static.  Instead let
BSS take care of it.

>  static int git_fetch_config(const char *k, const char *v, void *cb)
>  {
> @@ -178,6 +179,8 @@ static struct option builtin_fetch_options[] = {
>  			TRANSPORT_FAMILY_IPV6),
>  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
>  			N_("report that we have only objects reachable from this object")),
> +	OPT_BOOL(0, "parallel", &parallel,
> +		 N_("fetch in parallel from each remote")),
>  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
>  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
>  		 N_("run 'gc --auto' after fetching")),
> @@ -1456,12 +1459,15 @@ static void add_options_to_argv(struct argv_array *argv)
>  
>  }
>  
> -static int fetch_multiple(struct string_list *list)
> +static int fetch_multiple(struct string_list *list, int i)
>  {
> -	int i, result = 0;

'i' is perfectly a good name for a local variable that is used for
loop control purposes, but makes a horrible name for a parameter.

Existing 'list' is not any better either---we know it is a list by
its type already, the name should say what the list is about, what
it represents.  But having a horribly named parameter already is not
a good reason to make the code even worse.

And as you said, recursion makes the code structure harder to follow
here.  Keeping an array of --jobs=<n> cmd structures, looping to
fill them by starting, doing wait() to reap any of the started ones
that first exits to refill the slot just opened, etc. would be easier
to see if done in a loop, I think.