Re: Proposal/Discussion: Turning parts of Git into libraries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 17, 2023 at 8:05 PM Elijah Newren <newren@xxxxxxxxx> wrote:
>
> On Fri, Feb 17, 2023 at 1:45 PM Emily Shaffer <nasamuffin@xxxxxxxxxx> wrote:
> [...]
> > The good news is that for practical near-term purposes, "libification"
> > mostly means cleanups to the Git codebase, and continuing code health
> > work that the project has already cared about doing:
> >
> > - Removing references to global variables and instead piping them
> > through arguments
> > - Finding and fixing memory leaks, especially in widely-used low-level code
>
> Does removing memory leaks also mean converting UNLEAK to free()?

I suspect so - as I understand it, UNLEAK is a macro that resolves to
"don't complain to me, compiler, I meant not to free it."

> Thinking of things in a library context probably pushes us in that
> direction (though, alternatively, it might just highlight the question
> of what is considered "low-level" instead).

I'm not sure whether use of UNLEAK has so much to do with "low-level"
or not. In cases when Git is being called as an ephemeral single-run
process, UNLEAK makes a lot of sense. In cases when Git is being
called in a long-lived process, UNLEAK is just a sign that says
"there's a leak here".  So I think the distinction is not low-level or
high-level, but more simply, within a library or not.

I do anticipate that we'll still have "non-libified" code for the
builtins, and that those builtins will invoke libraries at whatever
layer. So UNLEAKing memory allocated by the builtin - seems fine to
me, even if that builtin is a "low-level" plumbing command.

>
> > - Reducing or removing `die()` invocations in low-level code, and
> > instead reporting errors back to callers in a consistent way
>
> What delinates "low-level" code?  (A "we don't know yet but we'll
> start with obvious places and plan to have good discussions on the
> appropriate boundary in the future as we submit patches" is a fine
> answer, I'm just curious if you already have a rough idea of where you
> intend that boundary to lie.)

The biggest one is our "standard library" - stuff like strbuf,
string-list, strvec, etc. etc. I'd like to expose those to callers so
that we don't end up having library interfaces passing around
unterminated buffers, which means that they'll be used in almost any
other library.

That sort of hints at the next criteria - stuff that's used by lots of
operations, or lots of other parts of Git code. So that means things
like run-command and config.

Past that, we're determining libification order based on internal
priorities. A request like "our VFS helper needs to do `git commit`
with this specific set of constraints, please give us library calls to
do it" would probably result in us working on library interfaces to
hook execution, index parsing, and ref manipulation, and anything
that's a dependency of those three. It's very unlikely that it would
result in something like `git_do_commit(struct git_commit_flags)`.
(That's what I meant about avoiding high-level libraries to begin
with.)

> > - Clarifying the scope and layering of existing modules, for example
> > by moving single-use helpers from the shared module's scope into the
> > single user's scope
> > - Making module interfaces more consistent and easier to understand,
> > including moving "private" functions out of headers and into source
> > files and improving in-header documentation
>
> I think these are very positive directions.  I like the fact that your
> initial plan benefits all of us, whether or not libification is
> ultimately achieved.
>
> [...]
> > So what's next? Naturally, I'm looking forward to a spirited
> > discussion about this topic - I'd like to know which concerns haven't
> > been addressed and figure out whether we can find a way around them,
> > and generally build awareness of this effort with the community.
>
> I'm curious whether clarifying scope/layering and cleaning up
> interfaces might mean you'd be interested in things like:
>   * https://github.com/newren/git/commits/header-cleanups (which was
> still WIP; I paused working on it because I figured people would see
> it as big "cleanup" patches with no practical benefit)
>   * https://github.com/gitgitgadget/git/pull/1149 (which has been
> ready to submit for a _long_ time, but I just haven't yet)
> or if these two things are orthogonal to what you have in mind.

Extremely yes. :) Even "small" stuff like the need for header cleanups
have already come up for Glen and Calvin working on config and strbuf.

>
> > I'm also planning to send a proposal for a document full of "best
> > practices" for turning Git code into libraries (and have quite a lot
> > of discussion around that document, too). My hope is that we can use
> > that document to help us during implementation as well as during
> > review, and refine it over time as we learn more about what works and
> > what doesn't. Having this kind of documentation will make it easy for
> > others to join us in moving Git's codebase towards a clean set of
> > libraries. I hope that, as a project, we can settle on some tenets
> > that we all agree would make Git nicer.
>
> I like the sound of this.
>
> > After that, we're still hoping to target low-level libraries first - I
> > certainly don't think it will make sense to ship a high-level `git
> > commit` library in the near future, if ever - in the order that
> > they're required from the VFS project we're working closely with. As
> > far as I can tell right now, that's likely to cover object store and
> > worktree access, as well as commit creation and pushing, but we'll see
> > how planning shakes out over the next month or so. But Google's
> > schedule should have no bearing on what others in the Git project feel
> > is important to clean up and libify, and if there is interest in the
> > rest of the project in converting other existing modules into
> > libraries, my team and I are excited to participate in the review.
>
> If we can't libify something like commit, does that prevent libifying
> higher level things like merge?
>
> I spent some time thinking about this a while back.  I tried to
> carefully design merge-ort to improve the odds it could be used
> elsewhere, maybe even libgit2.  (I hope it shows in the many comments
> in merge-ort.h, and I think the "priv" field in particular allowing me
> to hide the first ~300 lines of merge-ort.c declaring data structures
> from users was really nice.)  However, I still had to accept data in
> some known format.  So input parameters are things like trees and
> commits.  But tree.h and commit.h both include object.h first, which
> includes cache.h, which is basically all of Git.  And the functions I
> call to interoperate with the system are similarly entangled.  So, the
> odds of merge-ort being reused by libgit2 or otherwise used in a
> library seems essentially nil, at least without some broader
> libification effort.

Yeah, it depends a lot on the usage. For merge, it would be tricky to
get the scope just right - should the merge library be responsible for
locating the merge-base? Should it just perform the conflict
resolution? Something else?

As for "I tried to include this thing which eventually included
cache.h" - yeah, I think we will be pulling stuff out of cache.h quite
heavily. But IMO this falls under "cleanups we want to do in Git
anyway" - I think it's widely understood that cache.h is not
well-scoped and could use improvement.

>
> I'd like to make that story better, time permitting (which is much
> more of a challenge these days than it was a couple years ago), but
> I'm curious if you or others have thoughts on something like that.
>
> > Much, much later on, I'm expecting us to form a plan around allowing
> > "plugins" - that is, replacing library functionality we use today with
> > an alternative library, such as an object store relying on a
> > distributed file store like S3. Making that work well will also likely
> > involve us coming up with a solution for dependency injection, and to
> > begin using vtables for some libraries. I'm hoping that we can figure
> > out a way to do that that won't make the Git source ugly. Around this
> > time, I think it will make sense to buy into unit tests even more and
> > start using an approach like mocking to test various edge cases. And
> > at some point, it's likely that we'll want to make the interfaces to
> > various Git libraries consistent with each other, which would involve
> > some large-scale but hopefully-mechanical refactors.
>
> Would these plugins resemble the pluggable merge backends that was
> added to builtin/merge.c?  Would it replace that mechanism with a
> different one?  Would it be more like the refs backends?

I suspect that it's likely to be most similar to the refs backend
replacement, although I investigated it only a little bit just now.

The pluggable merge backends are an interesting thought - right now
all those alternatives are built in and we decide based on config,
right? But if we were able to easily decide which library to link
based on config during runtime, then I could see that being an
appealing use of plugins too. I wonder whether "custom" merge backends
make the story for people doing compiled asset storage in Git (like
game assets or hardware layouts, both of which famously merge
horribly) any easier.

>
> Would this plugin scheme allow us to, for example, use gitoxide[1] as
> a clone replacement to make clones 2x as fast (and with half the
> memory -- although I suspect they cheated and used sha1 instead of
> sha1dc, so maybe it wouldn't really be 2x)?

Interesting. It would depend on whether we can match the interface to
gitoxide, or write a translation layer. I could see it! I'm also a
little curious how much of that speedup is because of corner-cutting
(since you mentioned skipping the collision detection) vs. how much is
due to Rust magic. In theory, building the Git CLI out of a handful of
libraries means that we could write some of those libraries in
something besides C; in practice, I understand there's a
maintainability issue around introducing new languages into the pile
of stuff the community is expected to understand and maintain. (For
example, I think many people don't like to touch git-gui, probably
primarily because it's in Tcl.)

>
> Oh, and it's totally okay if you don't know the answers to any or all
> of my questions right now.  I'm just curious, because I've long
> thought these kinds of directions would be good.  Since I've spent
> time thinking about it, I have questions that I don't know the answers
> to, but I figured it couldn't hurt to bounce them off others who are
> thinking about this area.
>
> Anyway, it's a large pile of work that you're undertaking, and as
> Junio comments elsewhere in this thread it's unclear if libification
> can be achieved for a big enough component (and you seem to admit as
> much in your email as well), but I applaud the general direction and
> your initial plans.

Thanks for your thoughtful reply.

 - Emily

>
>
> [1] https://github.com/Byron/gitoxide/discussions/579



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux