Re: [PATCHv4 06/14] run-command: add an asynchronous parallel child processor

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 22 Sep 2015 23:29:17 -0700

Stefan Beller <sbeller@xxxxxxxxxx> writes:

>  run-command.c          | 264 +++++++++++++++++++++++++++++++++++++++++++++++++
>  run-command.h          |  36 +++++++
>  t/t0061-run-command.sh |  20 ++++
>  test-run-command.c     |  24 +++++
>  4 files changed, 344 insertions(+)

I think we are almost there, but there were still a few more "huh?"
in this patch.

> +/* returns 1 if a process was started, 0 otherwise */

This claim is dubious.

The last four lines of this function marks the i-th slot as in_use,
and increments nr_processes, so that exit codepath clearly is about
"a process was started" and should be returning 1, but the code
returns 0, which looks incorrect.

> +static int pp_start_one(struct parallel_processes *pp)
> +{
> +	int i;
> +
> +	for (i = 0; i < pp->max_processes; i++)
> +		if (!pp->children[i].in_use)
> +			break;
> +	if (i == pp->max_processes)
> +		die("BUG: bookkeeping is hard");
> +
> +	if (!pp->get_next_task(pp->data,
> +			       &pp->children[i].process,
> +			       &pp->children[i].err))
> +		return 1;

And this one, when get_next_task() says "nothing more to do", is
clearly "we returned without starting anything", so according to the
comment it should be returning 0, but the code returns 1, which
looks incorrect.

> +	if (start_command(&pp->children[i].process))
> +		pp->start_failure(pp->data,
> +				  &pp->children[i].process,
> +				  &pp->children[i].err);

What should happen if start_failure returns without dying?
Shouldn't this function return something, without doing the
remainder of it?  i.e.

	if (start_command(...)) {
		pp->start_failur(...);
                return SOMETHING;
	}

According to the comment at the beginning, that SOMETHING ought to
be 0, because we did not spawn anything, i.e. we did not increment
nr_processes.

But don't make that change yet; I do not think it is a great
interface to say "did we or did we not add a new process?" anyway
(see below).

> +	set_nonblocking(pp->children[i].process.err);
> +
> +	pp->nr_processes++;
> +	pp->children[i].in_use = 1;
> +	pp->pfd[i].fd = pp->children[i].process.err;
> +	return 0;

And this is "we spawned" that ought to have returned 1.

Perhaps the comment at the beginning is wrong, as the code
consistently does the opposite of what the comment says.

But it does not really matter, as I do not think this should be
returning "did we or did we not start a new process?" (see below).

> +}

> +int run_processes_parallel(int n, void *data,
> +			   get_next_task_fn get_next_task,
> +			   start_failure_fn start_failure,
> +			   return_value_fn return_value)
> +{
> +	struct parallel_processes pp;
> +	pp_init(&pp, n, data, get_next_task, start_failure, return_value);
> +
> +	while (1) {
> +		while (pp.nr_processes < pp.max_processes &&
> +		       !pp_start_one(&pp))
> +			; /* nothing */
> +		if (!pp.nr_processes)
> +			break;

This inner loop is why I think "did we or did we not spawn a new
process?" is not a great interface.

The reason why it is not a great interface is because there are two
possible reasons why pp_start_one() does not spawn a new process,
and this caller wants to behave differently depending on why it did
not spawn a new process.  They are:

 * get_next_task() truly ran out of things to do.

 * get_next_task() gave us a task, but it did not start, and
   start_failure was set not to die (e.g. the function can be used
   to tell the next_task machinery that it needs to return a
   replacement task for the one that failed to run.  That way, upon
   next call to get_next_task, a replacement task can run instead of
   the old one that failed).

For the former, we want to stop looping, for the latter, we
definitely do want to keep looping, as we want to make another call
to get_next_task() to grab the replacement task for the one that
just failed.

So I think it makes more sense to define the meaning of the return
value from pp_start_one() differently from the way this patch
defines.  "Return 0 when we truly ran out of things to do, otherwise
return non-zero", for example, would make more sense.  The return
value does not tell you if the call resulted in one more process,
but that is not any loss, as you can look at pp.nr_processes
yourself if you really cared.

With that, the above caller could be updated, with optional gradual
ramp_up, like so:

	#define RAMP_UP_LIMIT 2

	while (1) {
		int ramp_up;
		int no_more_task;

		for (no_more_task = 0, ramp_up = RAMP_UP_LIMIT;
                     !no_more_task && ramp_up && pp.nr_processes < pp.max_processes;
		     ramp_up--)
			if (!pp_start_one(&pp))
				no_more_task = 1;

		if (!pp.nr_processes && no_more_task)
			break;

If you prefer to swamp the system with a thundering herd at the
beginning, you can define RAMP_UP_LIMIT to really a high value
instead, e.g. "#define RAMPUP_LIMIT pp.max_processes".  I however
would not recommend it because doing so would hurt the perceived
latency at the beginning.

After the system goes into a steady state, how you set RAMP_UP_LIMIT
would not make that much difference, as your slots should be almost
always full and you will be replenishing an open slot with a single
task as each running task finishes, and you would not be running
more than one pp_start_one() at a time anyway.

> +		pp_buffer_stderr(&pp);
> +		pp_output(&pp);
> +		pp_collect_finished(&pp);
> +	}
> +
> +	pp_cleanup(&pp);
> +
> +	return 0;
> +}
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html