Re: [PATCH v2 2/6] hook: allow parallel hook execution

Emily Shaffer <emilyshaffer@xxxxxxxxxx> · Mon, 16 Aug 2021 16:59:45 -0700

On Thu, Aug 12, 2021 at 10:51:01AM -0700, Junio C Hamano wrote:
> 
> Emily Shaffer <emilyshaffer@xxxxxxxxxx> writes:
> 
> > In many cases, there's no reason not to allow hooks to execute in
> > parallel. run_processes_parallel() is well-suited - it's a task queue
> > that runs its housekeeping in series, which means users don't
> > need to worry about thread safety on their callback data. True
> > multithreaded execution with the async_* functions isn't necessary here.
> > Synchronous hook execution can be achieved by only allowing 1 job to run
> > at a time.
> >
> > Teach run_hooks() to use that function for simple hooks which don't
> > require stdin or capture of stderr.
> 
> Since run_hooks() has been using run_processes_parallel_tr2()
> already in the ab/config-based-hooks-base topic, the above
> description puzzled me quite a bit.  I think you meant to say with
> this step you update the callers to pass either .jobs=1 or .jobs=0
> in the hooks_opt, so that hooks _can_ run in parallel (as opposed
> to using hardcoded jobs=1 in run_hooks() when calling the underlying
> run_processes_parallel_tr2() function).

I think you are right, and this commit message is from a much older
version of the series. Ok; I will reword it. I think it can be much
simpler.

> 
> I do not think SYNC and ASYNC are great adjectives for what you are
> doing in this step.  Wouldn't ASYNC mean "the caller tells the hooks
> to run, continues without waiting for them to finish"?  In order to
> let more than one hooks to run at the same time, the caller has to
> continue without waiting for the first one it started so that it can
> start the second one before the first one finishes (otherwise these
> two hooks will not run at the same time), but ASYNC implys a lot
> more than that.  After starting all of these hooks, the caller may
> continue doing things that are not even related to these spawned
> hooks if a hook is run asynchronously.
> 
> But I do not think that is what is happening here.  After all,
> run_processes_parallel_tr2() will not return until all of the
> subprocesses are culled.
> 
> Perhaps the distinction you are trying to express is easier to
> convey to readers if you instead used the verb TO RUN with the
> adverb SERIALLY vs IN PARALLEL (as opposed to SYNCHRONOUSLY vs
> ASYNCHRONOUSLY)?

Good point. I think you are right and I'll change the init macros to
SERIAL/PARALLEL instead.

> 
> > diff --git a/builtin/hook.c b/builtin/hook.c
> > index 4d39c9e75e..12c9126032 100644
> > --- a/builtin/hook.c
> > +++ b/builtin/hook.c
> > @@ -22,7 +22,7 @@ static const char * const builtin_hook_run_usage[] = {
> >  static int run(int argc, const char **argv, const char *prefix)
> >  {
> >  	int i;
> > -	struct run_hooks_opt opt = RUN_HOOKS_OPT_INIT;
> > +	struct run_hooks_opt opt = RUN_HOOKS_OPT_INIT_SYNC;
> >  	int ignore_missing = 0;
> >  	const char *hook_name;
> >  	struct list_head *hooks;
> > @@ -32,6 +32,8 @@ static int run(int argc, const char **argv, const char *prefix)
> >  			 N_("exit quietly with a zero exit code if the requested hook cannot be found")),
> >  		OPT_STRING(0, "to-stdin", &opt.path_to_stdin, N_("path"),
> >  			   N_("file to read into hooks' stdin")),
> > +		OPT_INTEGER('j', "jobs", &opt.jobs,
> > +			    N_("run up to <n> hooks simultaneously")),
> >  		OPT_END(),
> >  	};
> >  	int ret;
> 
> OK, so by default "git hook" runs 1 at a time but we can instruct to
> run the optimal number of jobs via "git hook -j0" etc.

Is it surprising for an explicitly provided "-j0" to still be overridden
by a config if one was provided? That is, "-j0" doesn't mean
"-j$(nproc)" as it does in many other Unixy tools, here. It means
"-j(hook.jobs OR $(nproc))". I guess you could also say that "if you
configured hook.jobs, it is because you decided nproc was less optimal
than the value you put into hook.jobs" - so maybe it is fine?

> 
> > diff --git a/hook.c b/hook.c
> > index 80e150548c..37f682c6d8 100644
> > --- a/hook.c
> > +++ b/hook.c
> > @@ -228,6 +228,28 @@ static int notify_hook_finished(int result,
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Determines how many jobs to use after we know we want to parallelize. First
> > + * priority is the config 'hook.jobs' and second priority is the number of CPUs.
> > + */
> > +static int configured_hook_jobs(void)
> > +{
> > +	/*
> > +	 * The config and the CPU count probably won't change during the process
> > +	 * lifetime, so cache the result in case we invoke multiple hooks during
> > +	 * one process.
> > +	 */
> > +	static int jobs = 0;
> > +	if (jobs)
> > +		return jobs;
> > +
> > +	if (git_config_get_int("hook.jobs", &jobs))
> > +		/* if the config isn't set, fall back to CPU count. */
> > +		jobs = online_cpus();
> > +
> > +	return jobs;
> > +}
> 
> Not a suggestion to make this improvement as a part of this series,
> but there already are more than several codepaths doing essentially
> the same thing as above, taking their own configuration or
> environment variables and falling back to online_cpus().
> 
> It may be worth introducing a shared helper function so that this
> can become
> 
> 	static int configured_hook_jobs(void)
> 	{
> 		static int jobs; /* no need for "= 0" */
> 
> 		if (!jobs)
> 			jobs = default_parallelism("hook.jobs", NULL);
> 		return jobs;
> 	}
> 
> where the new helper takes the configuration and environment
> variable names.

Sounds handy.

 - Emily