Re: run-command: output owner picking strategy

Stefan Beller <sbeller@xxxxxxxxxx> · Fri, 20 May 2016 11:39:48 -0700

On Fri, May 20, 2016 at 11:29 AM, William Duclot
<william.duclot@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> When running in parallel we already may be out of order
>> (relative to serial processing). See the second example in the
>> commit message to produce a different order.
>
> Right, I could (should) have understood that by myself.
>
>> Consider we scheduled tasks to be run in 3 parallel processes:
>> (As we NEEDSWORK comment only addresses the ouput selection,
>> let's assume this is a fixes schedule, which we cannot alter.
>> Which is true if we only change the code you quoted. That picks
>> the process to output.)
>>
>> [...]
>
>> The output is produced by the current algorithm:
>> (1) Start with process 1 (A) whose output will be live
>> (2) Once A is done, flush all other done things, (B)
>> (3) live output will be round robin, so process 2 (D)
>> (4) Once D is done, flush all other done things (C, F, E)
>>     in order of who finshed first
>>
>>
>> (1) is uncontroversial. We have no information about tasks A,B,C,
>>     so pick a random candidate. We hardcoded process 1 for now.
>>
>> (2) also uncontroversial IMHO. There is not much we can do different.
>
> Agreed
>
>> (3) is what this NEEDSWORK comment is about. Instead of outputting D
>>     we might have choosen C. (for $REASONS, e.g.: C is running longer than
>>     D already, so we expect it to finish sooner, by assuming
>>     any task takes the same expected time to finish. And as C
>>     is expected to finish earlier than D, we may have smoother
>>     output. "Less buffered bursts")
>>
>> [...]
>>
>> This seems to be better than the current behavior as we have more
>> different tasks with "live" output, i.e. you see stuff moving.
>> I made up the data to make the point though. We would need to use
>> live data and experiment with different strategies to find a
>> good/better solution.
>
> We should probably settle on what is the behavior we want to obtain,
> before trying to find a strategy to implement (or approximate) it:
> - Do we want to be as close as possible to a serial processing output?
> - Do we want to see as much live output as possible?
>
> I do not think that being close to serial processing is a relevant
> behavior: we applied an arbitrary order to tasks when naming them for
> explanations (A, B, C...), but the tasks aren't really sorted in any
> way (and that's why the parallelization is relevant).Neither the user
> nor git have any interest in getting these ouputs in a specific order.

IIRC In serial processing the output was according to the sort order
within the tree. I agree that this sorting property is of no value to the user.

>
> Therefore, a "as much live output as possible" behavior would be more
> sensible.

I choose "as much live output" as an approximation of "least amount buffered
over time, i.e. if you were to integrate the buffer size over time
that should be
minimized. (c.f. users waiting for output: http://imgur.com/gallery/lhjhbB9)
I am not sure if that is ultimate thing to optimize for though.

> But I wonder: is there a worthy benefit in optimizing the
> output owner strategy?

Eventually there are more users than just submodules for this
parallel processing machinery, I would hope. They would also benefit
of a good fundament?

> I'm not used to working with submodules, but I
> don't think that having a great number of submodules is a common thing.

(Not yet, because of the chicken and egg problem: submodule UI is not
as polished because very few people use it. And few people use it because
of confusing UI. ;)

At his GitMerge2016 talk, Lars Schneider proposed a guideline to
not use more than 25 submodules as it "doesn't scale" IIRC.
And that resentment seems to be all over the place.

> Basically: we could solve a problem, but is there a problem?
> I'm not trying to bury this NEEDSWORK, I'd be happy to look into it if
> need be!

Well Git as a community doesn't ask you to solve any problems. ;)
So if you have fun thinking about scheduling problems (and as you do
it as part of a university project, if Matthieu is happy about this problem
also), go for it :)

If you find another more "interesting" problem (either as defined by
you personal interests or by the possible impact or by possible grading),
choose that?

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html