On Fri, Feb 17, 2023 at 8:05 PM Elijah Newren <newren@xxxxxxxxx> wrote: > > On Fri, Feb 17, 2023 at 1:45 PM Emily Shaffer <nasamuffin@xxxxxxxxxx> wrote: > [...] > > The good news is that for practical near-term purposes, "libification" > > mostly means cleanups to the Git codebase, and continuing code health > > work that the project has already cared about doing: > > > > - Removing references to global variables and instead piping them > > through arguments > > - Finding and fixing memory leaks, especially in widely-used low-level code > > Does removing memory leaks also mean converting UNLEAK to free()? I suspect so - as I understand it, UNLEAK is a macro that resolves to "don't complain to me, compiler, I meant not to free it." > Thinking of things in a library context probably pushes us in that > direction (though, alternatively, it might just highlight the question > of what is considered "low-level" instead). I'm not sure whether use of UNLEAK has so much to do with "low-level" or not. In cases when Git is being called as an ephemeral single-run process, UNLEAK makes a lot of sense. In cases when Git is being called in a long-lived process, UNLEAK is just a sign that says "there's a leak here". So I think the distinction is not low-level or high-level, but more simply, within a library or not. I do anticipate that we'll still have "non-libified" code for the builtins, and that those builtins will invoke libraries at whatever layer. So UNLEAKing memory allocated by the builtin - seems fine to me, even if that builtin is a "low-level" plumbing command. > > > - Reducing or removing `die()` invocations in low-level code, and > > instead reporting errors back to callers in a consistent way > > What delinates "low-level" code? (A "we don't know yet but we'll > start with obvious places and plan to have good discussions on the > appropriate boundary in the future as we submit patches" is a fine > answer, I'm just curious if you already have a rough idea of where you > intend that boundary to lie.) The biggest one is our "standard library" - stuff like strbuf, string-list, strvec, etc. etc. I'd like to expose those to callers so that we don't end up having library interfaces passing around unterminated buffers, which means that they'll be used in almost any other library. That sort of hints at the next criteria - stuff that's used by lots of operations, or lots of other parts of Git code. So that means things like run-command and config. Past that, we're determining libification order based on internal priorities. A request like "our VFS helper needs to do `git commit` with this specific set of constraints, please give us library calls to do it" would probably result in us working on library interfaces to hook execution, index parsing, and ref manipulation, and anything that's a dependency of those three. It's very unlikely that it would result in something like `git_do_commit(struct git_commit_flags)`. (That's what I meant about avoiding high-level libraries to begin with.) > > - Clarifying the scope and layering of existing modules, for example > > by moving single-use helpers from the shared module's scope into the > > single user's scope > > - Making module interfaces more consistent and easier to understand, > > including moving "private" functions out of headers and into source > > files and improving in-header documentation > > I think these are very positive directions. I like the fact that your > initial plan benefits all of us, whether or not libification is > ultimately achieved. > > [...] > > So what's next? Naturally, I'm looking forward to a spirited > > discussion about this topic - I'd like to know which concerns haven't > > been addressed and figure out whether we can find a way around them, > > and generally build awareness of this effort with the community. > > I'm curious whether clarifying scope/layering and cleaning up > interfaces might mean you'd be interested in things like: > * https://github.com/newren/git/commits/header-cleanups (which was > still WIP; I paused working on it because I figured people would see > it as big "cleanup" patches with no practical benefit) > * https://github.com/gitgitgadget/git/pull/1149 (which has been > ready to submit for a _long_ time, but I just haven't yet) > or if these two things are orthogonal to what you have in mind. Extremely yes. :) Even "small" stuff like the need for header cleanups have already come up for Glen and Calvin working on config and strbuf. > > > I'm also planning to send a proposal for a document full of "best > > practices" for turning Git code into libraries (and have quite a lot > > of discussion around that document, too). My hope is that we can use > > that document to help us during implementation as well as during > > review, and refine it over time as we learn more about what works and > > what doesn't. Having this kind of documentation will make it easy for > > others to join us in moving Git's codebase towards a clean set of > > libraries. I hope that, as a project, we can settle on some tenets > > that we all agree would make Git nicer. > > I like the sound of this. > > > After that, we're still hoping to target low-level libraries first - I > > certainly don't think it will make sense to ship a high-level `git > > commit` library in the near future, if ever - in the order that > > they're required from the VFS project we're working closely with. As > > far as I can tell right now, that's likely to cover object store and > > worktree access, as well as commit creation and pushing, but we'll see > > how planning shakes out over the next month or so. But Google's > > schedule should have no bearing on what others in the Git project feel > > is important to clean up and libify, and if there is interest in the > > rest of the project in converting other existing modules into > > libraries, my team and I are excited to participate in the review. > > If we can't libify something like commit, does that prevent libifying > higher level things like merge? > > I spent some time thinking about this a while back. I tried to > carefully design merge-ort to improve the odds it could be used > elsewhere, maybe even libgit2. (I hope it shows in the many comments > in merge-ort.h, and I think the "priv" field in particular allowing me > to hide the first ~300 lines of merge-ort.c declaring data structures > from users was really nice.) However, I still had to accept data in > some known format. So input parameters are things like trees and > commits. But tree.h and commit.h both include object.h first, which > includes cache.h, which is basically all of Git. And the functions I > call to interoperate with the system are similarly entangled. So, the > odds of merge-ort being reused by libgit2 or otherwise used in a > library seems essentially nil, at least without some broader > libification effort. Yeah, it depends a lot on the usage. For merge, it would be tricky to get the scope just right - should the merge library be responsible for locating the merge-base? Should it just perform the conflict resolution? Something else? As for "I tried to include this thing which eventually included cache.h" - yeah, I think we will be pulling stuff out of cache.h quite heavily. But IMO this falls under "cleanups we want to do in Git anyway" - I think it's widely understood that cache.h is not well-scoped and could use improvement. > > I'd like to make that story better, time permitting (which is much > more of a challenge these days than it was a couple years ago), but > I'm curious if you or others have thoughts on something like that. > > > Much, much later on, I'm expecting us to form a plan around allowing > > "plugins" - that is, replacing library functionality we use today with > > an alternative library, such as an object store relying on a > > distributed file store like S3. Making that work well will also likely > > involve us coming up with a solution for dependency injection, and to > > begin using vtables for some libraries. I'm hoping that we can figure > > out a way to do that that won't make the Git source ugly. Around this > > time, I think it will make sense to buy into unit tests even more and > > start using an approach like mocking to test various edge cases. And > > at some point, it's likely that we'll want to make the interfaces to > > various Git libraries consistent with each other, which would involve > > some large-scale but hopefully-mechanical refactors. > > Would these plugins resemble the pluggable merge backends that was > added to builtin/merge.c? Would it replace that mechanism with a > different one? Would it be more like the refs backends? I suspect that it's likely to be most similar to the refs backend replacement, although I investigated it only a little bit just now. The pluggable merge backends are an interesting thought - right now all those alternatives are built in and we decide based on config, right? But if we were able to easily decide which library to link based on config during runtime, then I could see that being an appealing use of plugins too. I wonder whether "custom" merge backends make the story for people doing compiled asset storage in Git (like game assets or hardware layouts, both of which famously merge horribly) any easier. > > Would this plugin scheme allow us to, for example, use gitoxide[1] as > a clone replacement to make clones 2x as fast (and with half the > memory -- although I suspect they cheated and used sha1 instead of > sha1dc, so maybe it wouldn't really be 2x)? Interesting. It would depend on whether we can match the interface to gitoxide, or write a translation layer. I could see it! I'm also a little curious how much of that speedup is because of corner-cutting (since you mentioned skipping the collision detection) vs. how much is due to Rust magic. In theory, building the Git CLI out of a handful of libraries means that we could write some of those libraries in something besides C; in practice, I understand there's a maintainability issue around introducing new languages into the pile of stuff the community is expected to understand and maintain. (For example, I think many people don't like to touch git-gui, probably primarily because it's in Tcl.) > > Oh, and it's totally okay if you don't know the answers to any or all > of my questions right now. I'm just curious, because I've long > thought these kinds of directions would be good. Since I've spent > time thinking about it, I have questions that I don't know the answers > to, but I figured it couldn't hurt to bounce them off others who are > thinking about this area. > > Anyway, it's a large pile of work that you're undertaking, and as > Junio comments elsewhere in this thread it's unclear if libification > can be achieved for a big enough component (and you seem to admit as > much in your email as well), but I applaud the general direction and > your initial plans. Thanks for your thoughtful reply. - Emily > > > [1] https://github.com/Byron/gitoxide/discussions/579