[TOPIC 02/12] Libification Goals and Progress

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 2 Oct 2023 11:18:15 -0400



(Presenter: Emily Shaffer, Notetaker: Taylor Blau)

* The effort is to isolate some parts of Git into smaller, independently
  buildable libraries. Can unit test it, swap out implementations, etc.
* Calvin Wan has been working on extracting a common set of interfaces, refining
  the types, etc. This is in pursuit of a "standard library" implementation for
  Git. Close to being shippable.
* Josh Steadmon spent some time in the second half of the year suggesting a unit
  testing framework in order to test the library interfaces beyond our standard
  shell tests.
* Goals:
   * Google has a couple of ways to proceed with their libification effort.
     Community input is solicited:
      * Interfaces for VFS / callable by IDE integration to avoid shelling out
      * Target libification for the sake of Git itself. Code clean-up, making
        the code more predictable / testable. Example being submodules, which
        are messy and difficult to reason about. References backend, etc.
* Is there an appetite for libification? Some particular component that would
  especially benefit from clean-up, being made more test-able, hot-swappable,
  etc.
* (From Emily's comment above) If others are implementing the basic references
  backend via a different implementation, how do we make sure that we are
  building compatible parts? Goal would be to have Git's unit tests pass against
  a different API.
* (Patrick Steinhardt) For reference backends especially: would like to be able
  to split between "policy" and "mechanism". This would avoid the issue
  discussed in the last session where different e.g. refs backend
  implementations have different behavior.
   * Emily: white-box tests for the API to make sure that different
     implementations meet the policy
* (Jonathan Nieder) For reference backends in particular, the current
  implementation has an odd "layering" scheme - packed-refs today is an
  incomplete backend using the same interface as the complete "loose and packed
  refs" backend, serves as a mechanism without fulfilling the policy
  requirements. The approach above seems like a positive change.
* (Emily) Are also looking into a similar project around the object store, but
  have found that it is deeply intertwined throughout the rest of the code base.
  Difficult to reason about, even without a library interface. Can we make any
  given change safely?
   * Hunch is that it is still useful to target that sort of thing, even if
     there aren't clear boundaries.
   * In the interim, can still be part of the same compilation unit, just
     creating a clearer boundary.
* (Emily) For hosting providers and others building things on top of git, are
  there parts of git functionality that you'd like to have libified so you can
  get benefits without having to wait for feature lag?
* (brian) not interested in using Git as a library in GitHub's codebase because
  of license incompatibility. Would like to experiment with different algorithms
  for packing and delta-fication in Rust as a highly parallel system. Would be
  nice to be able to swap out something that is C-compatible. Have been able to
  make changes in libgit2 while causing libgit2 to segfault, doesn't want to
  write more segfaults.
* (Taylor) There's an effort going on in GitHub to reduce our dependency on
  libgit2, precisely for the feature lag reason Emily mentions. I don't think
  we're planning on using it as a library soon, but we rely on the Git
  command-line interface through fork/exec
* (Emily) Is licensing the only obstacle to using Git as a library, or are there
  other practical concerns?
* (Jeff Hostetler) Pulled libgit2-sharp out of Visual Studio. Issues with
  crashing, but also running into classical issues with large repositories.
  Memory consumption was a real issue at the time. Safer to have memory
  segmented across multiple processes so that processes can't touch other
  processes memory space.
* (Emily) Interesting: thinks that performance overhead would outweigh the
  memory issues.
* (Patrick) To reiterate from GitLab's point of view: we are in the same boat as
  Microsoft and GitHub. Have used libgit2 extensively in the past, but was able
  to drop support last month. No plans to use Git as a library in the future.
  Having a process boundary is useful, avoids memory leaks, bugs in Git spilling
  out to GitLab. Still have an "upstream-first" policy. Benefits everybody by
  spreading the maintenance burden and ensuring that others can benefit from
  such functionality.
* (Emily) If we had the capacity to write portions of Git's code in Rust (memory
  safety, performance, use it as a library), would we want to use it?
   * (Junio) I notice in the participant list people like Randall who work on
     NonStop. I'd worry about the effect on minority stakeholders, portability.
   * (Junio) Not fundamentally opposed to the direction.
* (Elijah) did not parallelize the C implementation of the new ORT backend.
  Wanted to rewrite it in Rust, cleaned up headers as a side-effect, and looked
  at other bits. Merge backends are already pluggable, could have a "normal" one
  in addition to a Rust backend.
* (Emily) If we already have something in C that establishes an existing API
  boundary, that makes it more tenable to rewrite it in Rust. Could say that the
  C version is deprecated and make future changes to Rust.
* (brian) Thinks they would be in favor of that; is personally happy to say that
  operating systems need to accept support for bottom languages eventually. All
  of the main Debian architectures in use have Rust ports. They are portable to
  all of the main architectures. Would make it easier to do unit testing. Could
  add parallelization and optimization without worrying about race conditions,
  which would be a benefit. Is happy to implement unit tests with Rust's nicer
  ecosystem.
* (Taylor) Is it just NonStop?
* (Elijah) Randall mentioned that they have a contractual agreement that is
  supposed to expire at some point
  (https://lore.kernel.org/git/004601d8ed6b$13a2f580$3ae8e080$@nexbridge.com/).
  Could we have a transition plan that:
   * Keeps NonStop users happy until their contract expires.
   * Allows the rest of us to get up to speed with Rust.
* (Jonathan Nieder) doing this in a "self-contained module" mode with fallback C
  implementation gives us the opportunity to back out in the future (at least in
  the early periods while we're still learning).
* (Jonathan Tan) back to process isolation: is the short lifetime of the process
  important?
* (Taylor Blau) seems like an impossible goal to be able to do multi-command
  executions in a single process, the code is just not designed for it.
* (Junio) is anybody using the `git cat-file --batch-command` mode that switches
  between batch and batch-check.
* (Patrick Steinhardt) they are longer lived, but only "middle" long-lived.
  GitLab limits the maximum runtime, on the order of ~minutes, at which point
  they are reaped.
* (Taylor Blau) lots of issues besides memory leaks that would become an issue
* (Jeff Hostetler) would be nice to keep memory-hungry components pinned across
  multiple command-equivalents.
* (Taylor Blau): same issue as reading configuration.