Hi folks, As I mentioned in standup this week[1], my colleagues and I at Google have become very interested in converting parts of Git into libraries usable by external programs. In other words, for some modules which already have clear boundaries inside of Git - like config.[ch], strbuf.[ch], etc. - we want to remove some implicit dependencies, like references to globals, and make explicit other dependencies, like references to other modules within Git. Eventually, we'd like both for an external program to use Git libraries within its own process, and for Git to be given an alternative implementation of a library it uses internally (like a plugin at runtime). This turned out pretty long-winded, so a quick summary before I dive in: - We want to compile parts of Git as independent libraries - We want to do it by making incremental code quality improvements to Git - Let's avoid promising stability of the interfaces of those libraries - We think it'll let Git do cool stuff like unit tests and allowing purpose-built plugins - Hopefully by example we can convince the rest of the project to join in the effort My team has spent the past year or so trying to make improvements to Git's behavior with submodules, and we found that the current structure of Git is quite tricky to work with: because Git doesn't execute on second repositories in the same process well, recursing into submodules typically involves spawning child processes, and piping new arguments through the helpers around those child processes, and then through Git's typical codepaths, is very tricky. After spending more than a year trying to make improvements, we have very little to show for it, largely as a result of the difficulty of passing information between superprojects and submodules. It seems like being able to invoke parts of Git as a library, or Git being able to invoke custom libraries, does a lot of good for the Git project: - Having clear, modular libraries makes it easy to find the code responsible for some behavior, or determine where to add something new. - Operations recursing into submodules can be run in-process, and the callsite makes it very clear which repository context is being used; if we're not subprocessing to recurse, then we can lose the awkward git-submodule.sh -> builtin/submodule__helper.c -> top-level Git process codepath. - Being able to test libraries in isolation via unit tests or mocks speeds up determining the root cause of bugs. - If we can swap out an entire library, or just a single function of one, it's easy to experiment with the entire codebase without sweeping changes. The ability to use Git as a library also makes it easier for other tooling to interact with a Git repository *correctly*. As an example, `repo` has a long history of abusing Git by directly manipulating the gitdir[2], but if it were written in a world where Git exists as easy-to-use libraries, it probably wouldn't have needed to, as it could have invoked Git directly or replaced the relevant modules with its own implementation. Both `repo`[3] and `git-gui[4]` have reimplemented logic from git.git. Other interfaces that cooperate with Git's filesystem storage layer, like `scalar` or `jj`[5], would be able to interop with a Git repository without having to reimplement custom logic or keep up with new Git changes. Of course, there's a reason Google wants it, too. We've talked previously about wanting better integration between Git and something like a VFS; as we've experimented with it internally, we've found a couple of tricky areas: - The VFS relies on running Git commands to determine the state of the repository. However, these Git commands interact with the gitdir or worktree, which is populated by the VFS. For example, parsing a .gitattributes or .gitmodules which is already stored in the VFS requires the VFS to provide a POSIX file handle, spawn a Git subprocess, populate other files needed by that subprocess (like .git/config), and finally collect the output stream of the subprocess. As you can imagine, this interaction of VFS -> Git -> VFS [-> Git] creates all sort of complications. The alternative is for the VFS to write its own parser (or use a library like libgit2; more on that later). But having a Git library means that a subset of Git functionality can happen in-process, and that filesystem access could be replaced by the VFS directly providing high-level objects or plain bytestreams. - A user running `git status` in a directory controlled by the VFS will require the VFS to populate the entire (large) worktree - even though the VFS is sure that only one file has been modified. The closest we've come with an alternative is an orchestrated use of sparse-checkout - but modifying the sparse-checkout configs automatically in response to the user's filesystem operations takes us right back to the previous point. If Git could take a plugin and replacement for the object store that directly interacts with the VFS daemon, a layer of complexity would disappear and performance would improve. We discussed using an existing library like libgit2 or JGit, but it's not a very exciting proposal: these libraries are already lagging behind git.git in features, and trying to use them in combination with brand-new improvements to Git (like new partial clone filters) means that we'll always get to implement those improvements twice to bring libgit2 up to speed. It also means that people using the `git` client directly won't get performance benefits derived from having Git internal libraries replaced by purpose-built ones in certain contexts. That said, if libgit2 already provides functionality and performance equivalent to git.git's in an appropriate wrapper, I'd be excited to pursue integrating that library into git.git's codebase directly. The good news is that for practical near-term purposes, "libification" mostly means cleanups to the Git codebase, and continuing code health work that the project has already cared about doing: - Removing references to global variables and instead piping them through arguments - Finding and fixing memory leaks, especially in widely-used low-level code - Reducing or removing `die()` invocations in low-level code, and instead reporting errors back to callers in a consistent way - Clarifying the scope and layering of existing modules, for example by moving single-use helpers from the shared module's scope into the single user's scope - Making module interfaces more consistent and easier to understand, including moving "private" functions out of headers and into source files and improving in-header documentation Basically, if this effort turns out not to be fruitful as a whole, I'd like for us to still have left a positive impact on the codebase. In the longer term, if Git has libraries with easily-replaced dependencies, we get a few more benefits: - Unit tests. We already have some in t/helper/, but if we can replace all the dependencies of a certain library with simple stubs, it's easier for us to write comprehensive unit tests, in addition to the work we already do introducing edge cases in bash integration tests. - If our users can use plugins to improve performance in specific scenarios (like a VFS-aware object store in the VFS case I cited above), then Git works better for them without having to adopt a different workflow, such as using an alternative tool or wrapper. - An easy-to-understand modular codebase makes it easy for new contributors to start hacking and understand the consequences of their patch. Of course, we haven't maintained any guarantee about the consistency of our implementation between releases. I don't anticipate that we'll write the perfect library interface on our first try. So I hope that we can be very explicit about refusing to provide any compatibility guarantee whatsoever between versions for quite a long time. On Google's end, that's well-understood and accepted. As I understand, some other projects already use Git's codebase as a "library" by including it as a submodule and using the code directly[6]; even a breakable API seems like an improvement over that, too. So what's next? Naturally, I'm looking forward to a spirited discussion about this topic - I'd like to know which concerns haven't been addressed and figure out whether we can find a way around them, and generally build awareness of this effort with the community. I'm also planning to send a proposal for a document full of "best practices" for turning Git code into libraries (and have quite a lot of discussion around that document, too). My hope is that we can use that document to help us during implementation as well as during review, and refine it over time as we learn more about what works and what doesn't. Having this kind of documentation will make it easy for others to join us in moving Git's codebase towards a clean set of libraries. I hope that, as a project, we can settle on some tenets that we all agree would make Git nicer. >From the rest of my own team, we're planning on working first on some limited scope, low-level libraries so that we can all see how the process works. We're starting with strbuf.[ch] (as it's used everywhere with few or no dependencies and helps us guarantee string safety at API boundaries), config.[ch] (as many external tools are probably interested in parsing Git config formatted files directly), and a subset of operations related to the object store. These starting points are intended to have a small impact on the codebase and teach us a lot about logistics and best practices while doing these kinds of conversions. After that, we're still hoping to target low-level libraries first - I certainly don't think it will make sense to ship a high-level `git commit` library in the near future, if ever - in the order that they're required from the VFS project we're working closely with. As far as I can tell right now, that's likely to cover object store and worktree access, as well as commit creation and pushing, but we'll see how planning shakes out over the next month or so. But Google's schedule should have no bearing on what others in the Git project feel is important to clean up and libify, and if there is interest in the rest of the project in converting other existing modules into libraries, my team and I are excited to participate in the review. Much, much later on, I'm expecting us to form a plan around allowing "plugins" - that is, replacing library functionality we use today with an alternative library, such as an object store relying on a distributed file store like S3. Making that work well will also likely involve us coming up with a solution for dependency injection, and to begin using vtables for some libraries. I'm hoping that we can figure out a way to do that that won't make the Git source ugly. Around this time, I think it will make sense to buy into unit tests even more and start using an approach like mocking to test various edge cases. And at some point, it's likely that we'll want to make the interfaces to various Git libraries consistent with each other, which would involve some large-scale but hopefully-mechanical refactors. I'm looking forward to the discussion! - Emily 1: https://colabti.org/irclogger/irclogger_log/git-devel?date=2023-02-13#l29 2: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/docs/internal-fs-layout.md 3: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/git_config.py 4: https://github.com/git/git/blob/master/git-gui/git-gui.sh#L305 5: https://github.com/martinvonz/jj 6: https://github.com/glandium/git-cinnabar