Re: [PATCH] Enable parallelism in git submodule update.

Heiko Voigt <hvoigt@xxxxxxxxxx> · Sat, 28 Jul 2012 12:22:11 +0200

Hi Stefan,

neat patch. See below for a few notes.

On Fri, Jul 27, 2012 at 11:37:34AM -0700, Stefan Zager wrote:
> diff --git a/git-submodule.sh b/git-submodule.sh
> index dba4d39..761420a 100755
> --- a/git-submodule.sh
> +++ b/git-submodule.sh
> @@ -491,6 +492,20 @@ cmd_update()
>  		-r|--rebase)
>  			update="rebase"
>  			;;
> +		-j|--jobs)
> +			case "$2" in
> +			''|-*)
> +				jobs="0"
> +				;;
> +			*)
> +				jobs="$2"
> +				shift
> +				;;
> +			esac
> +			# Don't preserve this arg.
> +			shift
> +			continue
> +			;;
>  		--reference)
>  			case "$2" in '') usage ;; esac
>  			reference="--reference=$2"
> @@ -529,6 +544,12 @@ cmd_update()
>  		cmd_init "--" "$@" || return
>  	fi
>  
> +	if test "$jobs" != "1"
> +	then
> +		module_list "$@" | awk '{print $4}' | xargs -L 1 -P "$jobs" git submodule update $orig_args

I do not see orig_args set anywhere in submodule.sh. It seems the
existing usage of it in cmd_status() is a leftover from commit
98dbe63 when this variable got renamed to orig_flags.

I will follow up with a patch to that location.

Another problem here is the passing of arguments. Have a look at
a7eff1a8 to see how this was solved for other locations.

The next thing I noticed is that the parallelism is not recursive. You
drop the option and only execute the first depth in parallel. How about
using the amount of modules defined by arguments left in $@ as an
indicator whether you need to fork parallel execution or not. If there
is exactly one you do the update if there are more you do the parallel
thing. That way you can just keep passing the --jobs flag to the
subprocesses.

The next question to solve is UI: Since the output lines of the parallel
update jobs will be mixed we need some way to distinguish them. Imagine
one of the update fails somewhere how do we find out which it was?

Two possible solutions come to my mind:

 1. Prefix each line with a job number. This way you can distinguish
    which process outputted what and still have immediate feedback.

 2. Cache the output (to stderr and stdout) of each job and output it
    once one job is done. I imagine this needs some infrastructure which
    we need to implement. We already have some ideas how to collect such
    output in C here[1].

I would prefer solution 2 since the output of 1 will be hard to read but
I guess we could start with 1 and then move over to 2 later on.

Cheers Heiko

[1] http://article.gmane.org/gmane.comp.version-control.git/197747
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html