Stefan Beller <sbeller@xxxxxxxxxx> writes: > run-command.c | 264 +++++++++++++++++++++++++++++++++++++++++++++++++ > run-command.h | 36 +++++++ > t/t0061-run-command.sh | 20 ++++ > test-run-command.c | 24 +++++ > 4 files changed, 344 insertions(+) I think we are almost there, but there were still a few more "huh?" in this patch. > +/* returns 1 if a process was started, 0 otherwise */ This claim is dubious. The last four lines of this function marks the i-th slot as in_use, and increments nr_processes, so that exit codepath clearly is about "a process was started" and should be returning 1, but the code returns 0, which looks incorrect. > +static int pp_start_one(struct parallel_processes *pp) > +{ > + int i; > + > + for (i = 0; i < pp->max_processes; i++) > + if (!pp->children[i].in_use) > + break; > + if (i == pp->max_processes) > + die("BUG: bookkeeping is hard"); > + > + if (!pp->get_next_task(pp->data, > + &pp->children[i].process, > + &pp->children[i].err)) > + return 1; And this one, when get_next_task() says "nothing more to do", is clearly "we returned without starting anything", so according to the comment it should be returning 0, but the code returns 1, which looks incorrect. > + if (start_command(&pp->children[i].process)) > + pp->start_failure(pp->data, > + &pp->children[i].process, > + &pp->children[i].err); What should happen if start_failure returns without dying? Shouldn't this function return something, without doing the remainder of it? i.e. if (start_command(...)) { pp->start_failur(...); return SOMETHING; } According to the comment at the beginning, that SOMETHING ought to be 0, because we did not spawn anything, i.e. we did not increment nr_processes. But don't make that change yet; I do not think it is a great interface to say "did we or did we not add a new process?" anyway (see below). > + set_nonblocking(pp->children[i].process.err); > + > + pp->nr_processes++; > + pp->children[i].in_use = 1; > + pp->pfd[i].fd = pp->children[i].process.err; > + return 0; And this is "we spawned" that ought to have returned 1. Perhaps the comment at the beginning is wrong, as the code consistently does the opposite of what the comment says. But it does not really matter, as I do not think this should be returning "did we or did we not start a new process?" (see below). > +} > +int run_processes_parallel(int n, void *data, > + get_next_task_fn get_next_task, > + start_failure_fn start_failure, > + return_value_fn return_value) > +{ > + struct parallel_processes pp; > + pp_init(&pp, n, data, get_next_task, start_failure, return_value); > + > + while (1) { > + while (pp.nr_processes < pp.max_processes && > + !pp_start_one(&pp)) > + ; /* nothing */ > + if (!pp.nr_processes) > + break; This inner loop is why I think "did we or did we not spawn a new process?" is not a great interface. The reason why it is not a great interface is because there are two possible reasons why pp_start_one() does not spawn a new process, and this caller wants to behave differently depending on why it did not spawn a new process. They are: * get_next_task() truly ran out of things to do. * get_next_task() gave us a task, but it did not start, and start_failure was set not to die (e.g. the function can be used to tell the next_task machinery that it needs to return a replacement task for the one that failed to run. That way, upon next call to get_next_task, a replacement task can run instead of the old one that failed). For the former, we want to stop looping, for the latter, we definitely do want to keep looping, as we want to make another call to get_next_task() to grab the replacement task for the one that just failed. So I think it makes more sense to define the meaning of the return value from pp_start_one() differently from the way this patch defines. "Return 0 when we truly ran out of things to do, otherwise return non-zero", for example, would make more sense. The return value does not tell you if the call resulted in one more process, but that is not any loss, as you can look at pp.nr_processes yourself if you really cared. With that, the above caller could be updated, with optional gradual ramp_up, like so: #define RAMP_UP_LIMIT 2 while (1) { int ramp_up; int no_more_task; for (no_more_task = 0, ramp_up = RAMP_UP_LIMIT; !no_more_task && ramp_up && pp.nr_processes < pp.max_processes; ramp_up--) if (!pp_start_one(&pp)) no_more_task = 1; if (!pp.nr_processes && no_more_task) break; If you prefer to swamp the system with a thundering herd at the beginning, you can define RAMP_UP_LIMIT to really a high value instead, e.g. "#define RAMPUP_LIMIT pp.max_processes". I however would not recommend it because doing so would hurt the perceived latency at the beginning. After the system goes into a steady state, how you set RAMP_UP_LIMIT would not make that much difference, as your slots should be almost always full and you will be replenishing an open slot with a single task as each running task finishes, and you would not be running more than one pp_start_one() at a time anyway. > + pp_buffer_stderr(&pp); > + pp_output(&pp); > + pp_collect_finished(&pp); > + } > + > + pp_cleanup(&pp); > + > + return 0; > +} -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html