(Presenter: Taylor Blau, Notetaker: Karthik Nayak) * Things on my mind! * There's been a bunch of work from the forges over the last few years - bitmaps, commit-graphs. etc. * Q: What should we do next? Curious to hear from everyone. Including Keanen's team * Boundary-based bitmap traversals, already spoke about it last year. If you have lots of tips that you're excluding from the rev-list query. Backlog to check the perf of this. * Patrick: still not activated it on production. Faced some issues the last time it was activated. We do plan to experiment with this (https://gitlab.com/gitlab-org/gitaly/-/issues/5537) * Taylor: Curious of the impact. * In almost all cases they perform better, in some equal and very few worse. * (Jonathan Nieder) Two open-ended questions: * Different forges run into the same problems. Maybe its worth comparing notes. Do we have a good way to do this. In Git discord there is a server operator channel, but only two messages. * Taylor and Patrick have conversations over this via email exchange. * Keanen: Used to have a quarterly meeting. Attendance is low. * From an opportunistic perspective, when people want to do this, currently seems like 1:1 conversations take place, but there hasn't been a wider-group forum * Server operator monthly might be fun to revive * Git contributor summit is where this generally happens. :) * At the last Git Merge there was a talk by Stolee about Git as a database and how as a user that can guide you in scaling. Potential roadmap for how a git server could do some of that automatically. Potential idea? For example, sharding by time? Like gc automatically generating a pack to serve shallow clones for recent history. * Extending cruft-pack implementation to more organically have a threshold on the number of bytes. The current scheme of rewriting the entire cruft-pack might not be the best for big repos. * Patrick: We currently have such a mechanism for geometric repacking. * (Taylor Blau) Geometric repacking was done a number of years ago, to more gradually compress the repository from many to few packfiles. We still have periodic cases where the repository is reduced to 2 packs, one cruft and one of the objects. If you had some set of packs which contained disjoint objects (no duplicates), could we extend the verbatim packs to work with these multiple packs. Anyone had similar issues? * Jonathan: One problem is whether to know if a pack has a non-redundant reachable object or not without worrying about things like TTL. In git, there is "push quarantine" code, if the hook rejects it, it doesn't get added to the repo. In JGit there is nothing similar yet, so someone could push a bunch of objects, which get stored even though they're rejected by a pre-receive hook. Which could end up with packs with unreachable objects. With history rewriting we also run into complexity about knowing what packs are "live". * Patrick: Deterministically pruning objects from the repository is hard to solve. In GitLab it's a problem where replicas of the repository contain objects which probably need to be deleted. * Jeff H: Can we have a classification of refs which makes classification possible wherein some refs are transient and some are long term. * Jeff King: There are a bunch of heuristic inputs which can help with this. Like how older objects have lesser chance of change vs newer. * Taylor: Order by recency, so older ones are in one bitmap and newer changeable ones could be one clump of bitmaps. * Minh: I have a question about Taylor's proposal of a single pack composed of multiple disjoint packs. Midx can notice duplicate objects. Does that help with knowing what can be streamed through? * Taylor: The pack reuse code is a bit too naive at this point, but conceptually this would work. We already have tools for working with packs like this. But this does give more flexibility. * Taylor: GitHub recently switched to merge-ort for test merges, tremendous improvements, but sometimes creates a bunch of loose objects. Option to have merge-ort to side step loose objects (write to fast-import or write a pack directly)? * Things slow down when writing to the filesystem so much. * Jonathan Tan: one thing we've discussed is having support in git for a pack handle representing a still-open pack file that you can append to and read from in the context of an operation. * Dscho: that sounds like the sanest thing to do. There's a robust invariant of needing an idx for the pack file that you need for working with it efficiently, which requires the pack file to be closed. So some things to figure out there, I'm interested to follow it. * Junio: There was a patch sent to list to restrict the streaming interface. I wonder if that moves in the opposite direction of what we're describing * brian: In sha256 work I noticed it only currently works on blobs. But I don't think adapting it to other object types would be a major departure. As long as we don't make the interop harder, I don't see a big problem with doing that. Conversion happens at the pack-indexing time. * Elijah: Did I understand correctly that this produces a lot of cruft objects? * Dscho: Yes. We perform test merges and then no ref points to them. * Elijah: Nice. "git log --remerge-diff" similarly produces objects that don't need to be stored when it performs test merges; that code path is careful not to commit them to the object store. You might be able to reuse some of that code. * Dscho: Thanks! I'll take a look.