Re: [PATCH 3/3] docs, parallelism: Rearrange how jobserver reservations are made

Kees Cook <keescook@xxxxxxxxxxxx> · Thu, 21 Nov 2019 11:39:03 -0800

On Thu, Nov 21, 2019 at 09:09:37AM +0100, Rasmus Villemoes wrote:
> On 21/11/2019 01.03, Kees Cook wrote:
> > diff --git a/Documentation/sphinx/parallel-wrapper.sh b/Documentation/sphinx/parallel-wrapper.sh
> > new file mode 100644
> > index 000000000000..a416dbfd2025
> > --- /dev/null
> > +++ b/Documentation/sphinx/parallel-wrapper.sh
> > @@ -0,0 +1,25 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0+
> > +#
> > +# Figure out if we should follow a specific parallelism from the make
> > +# environment (as exported by scripts/jobserver-exec), or fall back to
> > +# the "auto" parallelism when "-jN" is not specified at the top-level
> > +# "make" invocation.
> > +
> > +sphinx="$1"
> > +shift || true
> > +
> > +parallel="${PARALLELISM:-1}"
> > +if [ ${parallel} -eq 1 ] ; then
> > +	auto=$(perl -e 'open IN,"'"$sphinx"' --version 2>&1 |";
> > +			while (<IN>) {
> > +				if (m/([\d\.]+)/) {
> > +					print "auto" if ($1 >= "1.7")
> > +				}
> > +			}
> > +			close IN')
> > +	if [ -n "$auto" ] ; then
> > +		parallel="$auto"
> > +	fi
> > +fi
> > +exec "$sphinx" "-j$parallel" "$@"
> 
> I don't understand this logic. If the parent failed to claim any tokens
> (likely because the top make and its descendants are already running 16
> gcc processes), just let sphinx run #cpus jobs in parallel? That doesn't
> make sense - it gets us back to the "we've now effectively injected K
> tokens to the jobserver that weren't there originally".

I was going to say "but jobserver-exec can't be running unless there are
available slots", but I see the case is "if there are 16 slots and
jobserver-exec gets _1_, it should not fall back to 'auto'".

> From the comment above, what you want is to use "auto" if the top
> invocation was simply "make docs". Well, I kind of disagree with falling
> back to auto in that case; the user can say "make -j8 docs" and the
> wrapper is guaranteed to claim them all. But if you really want, the
> jobserver-count script needs to detect and export the "no parallelism
> requested at top level" in some way distinct from "PARALLELISM=1",
> because that's ambiguous.

Right -- failure needs to be be distinct from "only 1 available".

> > +	# Read out as many jobserver slots as possible.
> > +	while True:
> > +		try:
> > +			slot = os.read(reader, 1)
> > +			jobs += slot
> 
> I'd just try to slurp in 8 or 16 tokens at a time, there's no reason to
> limit to 1 in each loop.

Good point. I will change that.

> > +rc = subprocess.call(sys.argv[1:])
> > +
> > +# Return all the actually reserved slots.
> > +if len(jobs):
> > +	os.write(writer, jobs)
> > +
> > +sys.exit(rc)
> 
> What happens if the child dies from a signal? Will this correctly
> forward that information?

As far as I understand, yes, signal codes are passed through via the exit
code (i.e. see WIFSIGNALED, etc).

> Similarly (and the harder problem), what happens when our parent wants
> to send its child a signal to say "stop what you're doing, return the
> tokens, brush your teeth and go to bed". We should forward that signal
> to the real job instead of just dying, losing track of both the tokens
> we've claimed as well as orphaning the child.

Hm, hm. I guess I could pass INT and TERM to the child. That seems like
the most sensible best-effort here. It seems "make" isn't only looking
at the slots to determine process management.

-- 
Kees Cook