Re: [PATCH 03/10] run-command: add an asynchronous parallel child processor

Stefan Beller <sbeller@xxxxxxxxxx> · Fri, 18 Sep 2015 09:36:30 -0700

On Thu, Sep 17, 2015 at 6:05 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Stefan Beller <sbeller@xxxxxxxxxx> writes:
>
>> On Thu, Sep 17, 2015 at 2:44 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>
>>> Hmm, you are relying on the fact that a valid pid can never be 0, so
>>> you can just use pp->children[i].child.pid to see if a "slot" is
>>> occupied without even using pp->slots[] (or pp->children[i].in_use).
>>
>> We could either use the pid as accessed via children[i].process.pid
>> or we could rely on the children[i].process.err != -1 as start_process
>> will set it to an actual fd, and when it's done we reset it to -1.
>>
>> I am unsure if this make it less readable though (all your other
>> suggestions improve readability a lot!)
>
> Sorry, the above was not a suggestion but merely an observation with
> a bit of thinking aloud mixed in.  I should have marked it as such
> more clearly.

Ok, I see. no harm done, I did not take that as a suggestion.

>
>>> ... and then picks a new owner of the output channel.
>>>
>>> Up to this point it looks sensible.
>>>
>>>> +                     fputs(pp->err[pp->foreground_child].buf, stderr);
>>>> +                     strbuf_reset(&pp->err[pp->foreground_child]);
>>>
>>> I do not think these two lines need to be here, especially if you
>>> follow the above advice of separating buffering and draining.
>>
>> They are outputting the buffer, if the next selected child is still running.
>> I mean it would output eventually anyway, but it would only output after
>> the next start of processes and poll() is done. Yeah maybe that's what we
>> want (starting new processes earlier is better than outputting earlier as we're
>> talking microseconds here).
>
> I am not talking about microseconds but refraining from doing the
> same thing in multiple places.

ok

>
>>> But calling start_new() unconditionally feels sloppy.  It should at
>>> least be something like
>>>
>>>         if (pp.nr_processes < pp.max_processes &&
>>>             !pp.all_task_started)
>>>                 start_new_process()
>>>
>>> no?
>>
>> That's the exact condition we have in start_new_process
>> so I don't want to repeat it here?
>
> You are advocating to have a function whose definition is "this may
> or may not start a new process and the caller should not care", and
> then make the caller keep calling it, knowing/hoping that the caller
> does not care if we spawn a new thing or not.
>
> I somehow find it a highly questionable design taste to base the
> decision on "don't want to repeat it here".
>
> Stepping back and thinking about it, what is suggested is more
> explicit in the caller.  "If we know we need a new worker and we can
> have a new worker, then start it." is the logic in the caller in the
> suggested structure, and we would have a well defined helper whose
> sole task is to start a new worker to be called from the caller.
> The helper may want to check if the request to start a new one makes
> sense (e.g. if slots[] are all full, it may even want to return an
> error), but the checks are primarily for error checking (i.e. "we
> can have max N processes, so make sure we do not exceed that limit"),
> not a semantic one (i.e. the caller _could_ choose to spawn less
> than that max when there is a good reason to do so).

I would not put the decision to spawn fewer processes in the caller either,
My understanding is to have one high level function which outlines the algorithm
like:

    loop:
        spawn_children_as_necessary
        make_sure_pipes_don't_overflow
        offer_early_output
        take_care_of_finished_children

such that the main function reads more like a bullet point list explaining
how it roughly works. Each helper function should come up with its own strategy,
so spawn_children_as_necessary could be

    spawn_children_as_necessary:
        while less than n children and there are more tasks:
            spawn_one_child

for now. Later we can add more logic if necessary there. But I'd prefer to have
these decisions put into the helpers.

Having written this, I think I'll separate the function to drain the pipes and
the early output and separate the helper to start children into two:
one to make the decision and one to actually start just one child.

>
>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html