Re: Plugin mechanism(s) for Git?

Jeff King <peff@xxxxxxxx> · Fri, 15 Jul 2016 08:18:39 -0400

On Fri, Jul 15, 2016 at 08:46:03AM +0200, Christian Couder wrote:

> One way to extend it for better performance is to require that the
> configured command should be able to deal with a number or a stream of
> files or objects (instead of just one objec/file) that are passed to
> it. It looks like that is what Lars wants for smudge/clean filters.
> 
> Another way is to have the external command run as a daemon, like what
> Duy and David implemented for the index-helper.

Where possible, I think we should avoid daemons. They introduce all
sorts of timings complexities and robustness issues (what happens when
the daemon isn't up? What happens when it hangs? Etc).

Junio mentioned elsewhere the way remote-helpers work, which is to have
a single program that is run once per git invocation, and that can serve
multiple requests interactively by speaking a specific protocol. I think
that's what you're getting at in the first paragraph I've quoted here,
and it's something that has worked reasonably well for us. I _do_ think
we've often not paid close attention to the protocol design, and it has
ended up biting us (there are some serious warts in the remote-helper
protocol, for instance).

I don't know if we would want to go so far as standardizing on something
like JSON for making RPC requests to any helpers. Probably the more
"git" thing would be to use something based around pkt-lines, but it's
a lot easier to find a JSON library for your helper program. :-/

For clean/smudge filters, that kind of model seems like it would work
well. Better still if it can actually accept requests asynchronously and
return them possibly out of order (so it can parallelize as it likes
under the hood).  I think that the external-odb stuff could run this way
pretty easily, too.

Though I'm not yet convinced that it wouldn't be sufficient to run each
request in its own program, but teach git to parallelize the invocations
and let multiple run at once. The problem often times is one of latency
in hitting the helper serially, not overall CPU time (and you'd need to
do this parallelizing anyway to make out-of-order requests of a single
program, so it seems like a useful first step anyway).

Some features, like the index-helper, aren't quite so easy. One reason
is that its data needs to persist as a cache between multiple git
invocations. In general, I think it would be nice to solve that by
communicating via on-disk files, rather than a running daemon (just
because it has fewer moving parts). But that's only half of it for
index-helper. It needs to monitor inotify while git isn't running at
all; so it really _does_ need some kind of long-running daemon.

> And a more integrated way is to require the external code to implement
> an API and to be compiled along with Git which looks like the approach
> taken by the ref backend work.

The nice thing about an API like this is that it can be very high
performance, and it's relatively easy to move data between the API and
the rest of Git. But I still don't think we've quite figured out how
backends are going to be compiled and linked into git. I'm not sure
anybody is really shooting for something like run-time loading of
modules. I think at this stage we're more likely to have a handful of
modules that are baked in at compile time.

That works OK for the refs code, which is mostly Git-related, and mostly
works synchronously; you ask it for a ref, it looks it up and returns
it. Something like Git-LFS seems much more complicated. Besides being
written in Go and having a bunch of extra library dependencies, it's
inherently network-oriented, and needs to handle being responsive on
multiple descriptors (especially if we try to do things in parallel).
That's a lot of complication to stuff into an API. It also has to make
policy decisions that shouldn't necessarily be part of git (like
managing the cache of objects).

> If people think that evolution is better than intelligent design, and
> want each current topic/work to just implement what is best for it,
> then that's ok for me. If on the other hand standardizing on some ways
> to interact with external processes could be helpful to avoid
> duplicating mechanisms/code in slightly different and incompatible
> ways, then I would be happy to discuss it in a thread that is not
> specific to one of the current work.

Those are all just my off-the-cuff thoughts. I reserve the right to
change my opinions above at any time. :)

I _do_ think each of the projects you've mentioned has their own needs,
so I don't think we'll find a one-size-fits-all solution.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html