Re: [PATCH 8/8] git submodule update: Have a dedicated helper for cloning

Stefan Beller <sbeller@xxxxxxxxxx> · Wed, 21 Oct 2015 14:06:37 -0700

On Wed, Oct 21, 2015 at 1:47 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Stefan Beller <sbeller@xxxxxxxxxx> writes:
>
>> This introduces a new helper function in git submodule--helper
>> which takes care of cloning all submodules, which we want to
>> parallelize eventually.
>>
>> Some tests (such as empty URL, update_mode==none) are required in the
>> helper to make the decision for cloning. These checks have been moved
>> into the C function as well. (No need to repeat them in the shell
>> script)
>>
>> As we can only access the stderr channel from within the parallel
>> processing engine, so we need to reroute the error message for
>> specified but initialized submodules to stderr. As it is an error
>> message, this should have gone to stderr in the first place, so a
>> bug fix along the way.
>
> The last paragraph is hard to parse; perhaps it is slightly
> ungrammatical.

I seem to have started a habit starting my sentences with "so..."
even in spoken English. If left out, this may be easier to read:

    As we can only access the stderr channel from within the parallel
    processing engine, we need to reroute the error message for
    "specified but initialized submodules" to stderr. As it is an error
    message, this should have gone to stderr in the first place.
    It's a bug fix along the way.

>
> It would be a really good idea to split the small bit to redirect
> the output that should have gone to the standard error to where it
> should as a preparatory step before showing this patch.

ok.

>
> I sense that this one is still a WIP/RFC, so I'll only skim it in
> this round (but I may come back and read it again later with finer
> toothed comb).
>
>> +static int get_next_task(void **pp_task_cb,
>> +                      struct child_process *cp,
>> +                      struct strbuf *err,
>> +                      void *pp_cb)
>
> Will you have only one caller of the parallel run-command API in
> this file, or will you be adding more to allow various different
> operations run in parallel as more things are rewritten?  I am
> guessing that it would be the latter, but if that is the case,
> perhaps the function wants to be named a bit more specificly for
> this first user, no?  Same for start_failure and task_finished.

Ok, will rename.
Although I am not sure if I need to rewrite more in C for "git submodule".

I only rewrite git submodule update because git clone --recurse is just
blindly calling out to git submodule update.  So instead of parallelizing
"submodule update" I could have put a parallel submodule clone into
the clone command. (That looks strangely appealing now, because it
may be even faster as there is no downstream pipe with sequential
checkouts, so you could have one parallel pool with chained clone
and checkout commands).

>
>> diff --git a/git-submodule.sh b/git-submodule.sh
>> index 8b0eb9a..ea883b9 100755
>> --- a/git-submodule.sh
>> +++ b/git-submodule.sh
>> @@ -655,17 +655,18 @@ cmd_update()
>>               cmd_init "--" "$@" || return
>>       fi
>>
>> -     cloned_modules=
>> -     git submodule--helper list --prefix "$wt_prefix" "$@" | {
>> +     git submodule--helper update-clone ${GIT_QUIET:+--quiet} \
>> +             ${wt_prefix:+--prefix "$wt_prefix"} \
>> +             ${prefix:+--recursive_prefix "$prefix"} \
>> +             ${update:+--update "$update"} \
>> +             ${reference:+--reference "$reference"} \
>> +             ${depth:+--depth "$depth"} \
>> +             "$@" | {
>>       err=
>> -     while read mode sha1 stage sm_path
>> +     while read mode sha1 stage just_cloned sm_path
>>       do
>
> I wonder if you really want this to be upstream of a pipe.  When the
> downstream loop needs to abort, what happens to the remainder of the
> "clone" part of the processing that is still ongoing in the upstream
> of the pipe?  I would imagine that the "update-clone" network
> accessing phase is the more human-time consuming part, so I suspect
> that it would be much better to let the cloning part go and finish
> first (during which time the human-user can spend time for other
> things, like getting cup of coffee or filling expense reports) and
> before moving to the loop that can stop and ask the human-user for
> help.
>
> The fix for the above could be trivial (do not pipe, just take the
> output to a temporary file, and then feed the "while read" loop from
> that temporary file), and I suspect it would make a big difference
> for usability.

I'd like to counter your argument with quoting code from update_clone
method:

     run_processes_parallel(1, get_next_task, start_failure,
task_finished, &pp);

     if (pp.print_unmatched) {
         printf("#unmatched\n");
         return 1;
     }

     for_each_string_list_item(item, &pp.projectlines) {
         utf8_fprintf(stdout, "%s", item->string);
     }

So we do already all the cloning first, and then once we did all of that
we just put out all accumulated lines of text. (It was harder to come up with
a sufficient file name than just storing it in memory. I don't think
memory is an
issue here, only a few bytes per submodule. So even 1000 submodules would
consume maybe 100kB)

Having a file though would allow us to continue after human interaction fixed
one problem.

>
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html