[TOPIC 4/12] Scaling Git from a forge's perspective

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 2 Oct 2023 11:19:06 -0400



(Presenter: Taylor Blau, Notetaker: Karthik Nayak)

* Things on my mind!
* There's been a bunch of work from the forges over the last few years -
  bitmaps, commit-graphs. etc.
* Q: What should we do next? Curious to hear from everyone. Including Keanen's
  team
* Boundary-based bitmap traversals, already spoke about it last year. If you
  have lots of tips that you're excluding from the rev-list query. Backlog to
  check the perf of this.
   * Patrick: still not activated it on production. Faced some issues the last
     time it was activated. We do plan to experiment with this
     (https://gitlab.com/gitlab-org/gitaly/-/issues/5537)
   * Taylor: Curious of the impact.
   * In almost all cases they perform better, in some equal and very few worse.
* (Jonathan Nieder) Two open-ended questions:
   * Different forges run into the same problems. Maybe its worth comparing
     notes. Do we have a good way to do this. In Git discord there is a server
     operator channel, but only two messages.
      * Taylor and Patrick have conversations over this via email exchange.
      * Keanen: Used to have a quarterly meeting. Attendance is low.
      * From an opportunistic perspective, when people want to do this,
        currently seems like 1:1 conversations take place, but there hasn't been
        a wider-group forum
      * Server operator monthly might be fun to revive
      * Git contributor summit is where this generally happens. :)
   * At the last Git Merge there was a talk by Stolee about Git as a database
     and how as a user that can guide you in scaling. Potential roadmap for how
     a git server could do some of that automatically. Potential idea? For
     example, sharding by time? Like gc automatically generating a pack to serve
     shallow clones for recent history.
      * Extending cruft-pack implementation to more organically have a threshold
        on the number of bytes. The current scheme of rewriting the entire
        cruft-pack might not be the best for big repos.
      * Patrick: We currently have such a mechanism for geometric repacking.
* (Taylor Blau) Geometric repacking was done a number of years ago, to more
  gradually compress the repository from many to few packfiles. We still have
  periodic cases where the repository is reduced to 2 packs, one cruft and one
  of the objects. If you had some set of packs which contained disjoint objects
  (no duplicates), could we extend the verbatim packs to work with these
  multiple packs. Anyone had similar issues?
   * Jonathan: One problem is whether to know if a pack has a non-redundant
     reachable object or not without worrying about things like TTL. In git,
     there is "push quarantine" code, if the hook rejects it, it doesn't get
     added to the repo. In JGit there is nothing similar yet, so someone could
     push a bunch of objects, which get stored even though they're rejected by a
     pre-receive hook. Which could end up with packs with unreachable objects.
     With history rewriting we also run into complexity about knowing what packs
     are "live".
      * Patrick: Deterministically pruning objects from the repository is hard
        to solve. In GitLab it's a problem where replicas of the repository
        contain objects which probably need to be deleted.
      * Jeff H: Can we have a classification of refs which makes classification
        possible wherein some refs are transient and some are long term.
         * Jeff King: There are a bunch of heuristic inputs which can help with
           this. Like how older objects have lesser chance of change vs newer.
         * Taylor: Order by recency, so older ones are in one bitmap and newer
           changeable ones could be one clump of bitmaps.
* Minh: I have a question about Taylor's proposal of a single pack composed of
  multiple disjoint packs. Midx can notice duplicate objects. Does that help
  with knowing what can be streamed through?
   * Taylor: The pack reuse code is a bit too naive at this point, but
     conceptually this would work. We already have tools for working with packs
     like this. But this does give more flexibility.
* Taylor: GitHub recently switched to merge-ort for test merges, tremendous
  improvements, but sometimes creates a bunch of loose objects. Option to have
  merge-ort to side step loose objects (write to fast-import or write a pack
  directly)?
   * Things slow down when writing to the filesystem so much.
   * Jonathan Tan: one thing we've discussed is having support in git for a pack
     handle representing a still-open pack file that you can append to and read
     from in the context of an operation.
   * Dscho: that sounds like the sanest thing to do. There's a robust invariant
     of needing an idx for the pack file that you need for working with it
     efficiently, which requires the pack file to be closed. So some things to
     figure out there, I'm interested to follow it.
   * Junio: There was a patch sent to list to restrict the streaming interface.
     I wonder if that moves in the opposite direction of what we're describing
   * brian: In sha256 work I noticed it only currently works on blobs. But I
     don't think adapting it to other object types would be a major departure.
     As long as we don't make the interop harder, I don't see a big problem with
     doing that. Conversion happens at the pack-indexing time.
   * Elijah: Did I understand correctly that this produces a lot of cruft
     objects?
   * Dscho: Yes. We perform test merges and then no ref points to them.
   * Elijah: Nice. "git log --remerge-diff" similarly produces objects that
     don't need to be stored when it performs test merges; that code path is
     careful not to commit them to the object store. You might be able to reuse
     some of that code.
   * Dscho: Thanks! I'll take a look.