Re: Proposal/Discussion: Turning parts of Git into libraries

Emily Shaffer <nasamuffin@xxxxxxxxxx> · Tue, 21 Feb 2023 13:42:31 -0800

On Fri, Feb 17, 2023 at 2:57 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Emily Shaffer <nasamuffin@xxxxxxxxxx> writes:
>
> > Basically, if this effort turns out not to be fruitful as a whole, I'd
> > like for us to still have left a positive impact on the codebase.
> > ...
> > So what's next? Naturally, I'm looking forward to a spirited
> > discussion about this topic - I'd like to know which concerns haven't
> > been addressed and figure out whether we can find a way around them,
> > and generally build awareness of this effort with the community.
>
> On of the gravest concerns is that the devil is in the details.
>
> For example, "die() is inconvenient to callers, let's propagate
> errors up the callchain" is an easy thing to say, but it would take
> much more than "let's propagate errors up" to libify something like
> check_connected() to do the same thing without spawning a separate
> process that is expected to exit with failure.

Because the error propagation path is complicated, you mean? Or
because the cleanup is painful?

I wonder about this idea of spawning a worker thread that can
terminate itself, though. Is it a bad idea? Is it a hacky way of
pretending that we have exceptions? I guess if we have a thread then
we still have the same concerns about memory management (which we
don't have if we use a child process). (I'll reply to demerphq's mail
in detail, but it seems like the hardest part of this is memory
cleanup, no?)

In other cases, we might want to perform some work that can be sped up
by using more threads; how do we want to expose that functionality to
the caller? Do we want to manage our own threads, or do we want to
pass off orchestrating those worker threads to the caller (who
theoretically might have a faster way to manage them, like GPU
execution or distributed execution or something, or who might be using
their own thread pool manager)?

>
> It is not clear if we can start small, work on a subset of the
> things and still reap the benefit of libification.  Is there an
> existing example that we have successfully modularlized the API into
> one subsystem?  Offhand, I suspect that the refs API with its two
> implementations may be reasonably close, but is the inteface into
> that subsystem the granularity of the library interface you guys
> have in mind?

I think many of our internal APIs, especially the lower level ones,
are actually quite well modularized, or close enough to it that you
can't really tell they aren't. run-command.h and config.h come to
mind. The ones that aren't, I tend to think are frustrating to work
with anyways - is it reasonable to consider, for example, further
cleanup of cache.h as part of this effort? Is it reasonable to rework
an ugly circular dependency between two headers as a prerequisite to
doing library work around one of them?

I had a look at the refs API documentation but it seems that we don't
actually have a way for the code to use reftable. Is that what you
meant by the two implementations of refs API, or am I missing
something else? Anyway, abstracting at the "which backend do I want to
use" layer seems absolutely appropriate to me, if we're discussing
places where Git can use an alternative implementation. (For example,
this means it's also easy for Git to use some random NoSQL table as a
ref store, if that's what the caller wants.) For the most part refs.h
seems like it has things I would want to expose to external callers
(or that I would want to reimplement as a library author).

 - Emily