Re: [PATCH v7 15/17] hook: provide stdin by string_list or callback

Emily Shaffer <emilyshaffer@xxxxxxxxxx> · Mon, 1 Mar 2021 17:47:44 -0800

On Thu, Feb 25, 2021 at 12:56:11PM -0800, Jonathan Tan wrote:
> 
> > > When would we need per-hook state? I see in patch 14 that you give each
> > > running process little by little (in pp_buffer_stdin()), perhaps so that
> > > each hook can make progress at roughly the same pace, but I don't think
> > > we can expect all hooks to work the same, so I don't think it's worth
> > > complicating the design for all that.
> > 
> > I agree that this is a complicated way of doing it, and if you have a
> > better design I'd be really excited to hear it.
> > 
> > It seemed like this was what was necessary for hooks like
> > https://lore.kernel.org/git/20201222000435.1529768-15-emilyshaffer@xxxxxxxxxx
> > where the hook and the invoking process talk back and forth, or like
> > https://lore.kernel.org/git/20201222000435.1529768-17-emilyshaffer@xxxxxxxxxx
> > which generates stdin on the fly for hooks which cannot be parallelized
> > (and so won't run at the same pace).
> > 
> > The former example - proc-receive - does have a constraint that multiple
> > hooks can't be specified, so we could theoretically keep the old
> > implementation and just pick up the single hook's location from the new
> > hook library. But the latter example still makes me think this much
> > complexity is needed.
> 
> Ah, I see. From your explanation, in these 2 cases, only one hook
> executes at a time (in the former case, because there is only one hook,
> and in the latter case, you said that the hooks cannot be parallelized).
> So it seems to me that the global state (in struct run_hooks_opt) would
> be sufficient to keep track of what's going on. (The feed_pipe_fn
> function can use pp_cb to keep track of the last executing pp_task_cb
> and then compare it against the new pp_task_cb, I think, to keep track
> of when a new hook has started.)
> 
> Even in the case of multiple hooks run in series (as opposed to a single
> hook), I would think that the reason they can't be run in parallel is
> because the nature of execution of a hook depends on what happened
> during the execution of the previous hook, which seems to me to be even
> more reason to centralize the state in struct run_hooks_opt.
> 
> Having said that, if my suggestion of not having per-hook state makes
> certain patches more complicated, then that might be reason enough to
> have per-hook state. In that case, you should write "per-hook state,
> though strictly not necessary, makes <case> simpler" (or something like
> that) in the commit message.

Jonathan and I discussed this a little more offline and agreed to leave
the implementation as is.

Jonathan had suggested "have one callback invocation apply to all hooks
that are running now", either by having the callback iterate over the
task queue or by having the run-command lib take the result from the
callback and have *that* iterate over the task queue. The idea being,
one pointer to one copy of source material is easier to handle than
many.

I suggested that the callback's implementation of the second version of
that, where the library takes care of the "and do it for each task in
progress" part, would be pretty much identical to the callback's
implementation as it is in this patch, except that as it is here the
context pointer is per-task and as Jonathan suggests the context pointer
is per-entire-hook-invocation - so there isn't much complexity
difference between the two, from the user's perspective.

We also talked about cases where N=# of hooks > M=# of jobs, that is,
where some hooks must wait for other hooks to finish executing before
that could start. In this case, users' callback implementations would
need to be able to start over from the beginning of the source material,
and a long-running hook would block other short-running hooks from
beginning (because the long-running hook would be confused by hearing
the source material to its stdin again).

Hopefully this diagram illustrates better based on my understanding of
Jonathan's suggestion:

 A B <- "hey everyone, stdin 'foo'"
 A B <- "hey everyone, stdin 'bar'"
 A B <- "hey everyone, stdin 'baz'"
   B
   B
   B
   B
 C   <- "hey everyone, stdin 'foo'"
 C   <- "hey everyone, stdin 'bar'"
 C   <- "hey everyone, stdin 'baz'"
 C

 Anyway, since the complexity is probably about the same to the end user
 and using per-hook context means we don't have to wait like this, we
 agreed to stick with the implementation as is.

  - Emily