# Ideas for speeding up object connectivity checks in git-receive-pack (Patrick Steinhardt) - Patrick: At GitLab we've noticed some repositories have become slow with a lot of references. - Initially I thought it was the object negotiation. - The connectivity checks seem to be the culprit. The connectivity check is implemented quite naively. "git rev-list <commits> --not --all" - A while ago I tried to optimize it. Stop walking when you reach an existing object. List feedback was that I had not gotten the semantics right, since that existing object is not necessarily pointed to by a ref and not everything it references is necessarily present in the repository. - Trial 2: optimize rev-list. Sped up connectivity check by 60-70%! But the motivating repositories had grown, so it was still too slow. - How do other people do it, with millions of references? - Terry: Yes, we've seen this problem. - Just the setup time for the revision walk takes a long time. Initially we weren't using reachability bitmaps; using those also helped. Then using subsets of references — first HEAD, then refs/heads/-, … - Jrnieder: Initially we did this for reachability checks; do connectivity checks do it, too? Terry: Yes, Demetr did that. - Demetr: When the connectivity check fails, we fall back to a fuller set of refs. - Patrick: Many of our references are not even visible to the public. If Git makes configurable which refs are part of the connectivity check, that would already make things faster in our case. - Taylor: Did you experiment with reachability bitmaps? Am I remembering correctly that they made things slower? - Patrick: We did some experiments with reachability bitmaps, but not specifically for this problem. In those experiments they did make some things slower. - Taylor: one thing that you could do is make the bitmap traversal by building up a complete bitmap of the boundary between haves and wants instead of a bitmap of all of the haves. Involves far less object traversal, and there are some clever ways to do this. - Taylor: As long as you can quickly determine the boundary between the haves and wants for a given request, the connectivity check should be as fast as you need. - Peff: One difference between how Git and JGit does this is JGit is building up a structure of "what is reachable". You could persist a bitmap representing "here's everything that's reachable from this repository" and subtract that out; that would help with many cases. The problem with one reachability bitmap like that is that it goes stale whenever someone pushes something. But you could make a set of "pseudo-merge bitmaps" for each group of 10,000 refs or so. Especially if you're clever about which refs you do that for, that can be a significant improvement ("these 2,000,000 refs haven't been touched for a year, so I can use this bitmap and don't even have to examine them"). - Terry: There are two ways that unreachable objects appear. One is by branches being deleted or rewound. The other is failed pushes where objects were persisted and the ref update didn't succeed. I wanted to distinguish between objects that are "applied" and "unapplied", became very thorny. And with that, on a branch rewind, we can calculate what just became unreachable. - Waleed Khan: Object negotiation with bloom filters paper - Kleppman and Howard 2020 - Blog post https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html - Paper https://arxiv.org/abs/2012.00472 - Quote: "Unlike the skipping algorithm used by Git, our algorithm never unnecessarily sends any commits that the other side already has, and the Bloom filters are very compact, even for large commit histories." Alternative ways to write to VFS-backed worktrees (e.g. write($HASH) instead of write(<bytes>)) - Google tried to use the sparse-index to control what gets materialized by in a VFS-like system. What if we replaced the write() syscall + writing bytes and write a hash instead. It could save a lot, e.g. in an online IDE. - Stolee: The "write" analog of FSMonitor for VFS-for-Git is the post-index-change hook. We suppress the writes by manipulating the index and then communicating the write back to the VFS. - Jrnieder: This is bigger than just the "write()" call; if we have a git-aware filesystem, we can be much less wasteful than we are today. E.g. FSMonitor can only make `git status` so fast because we still have to stat(), but with a VFS, we could ask the VFS what has changed. - Do we think this is useful for anything other than a VFS? Do we still want this even if it's only useful for VFS? - Stolee: We could make FSMonitor more git-aware e.g. it doesn't know about writes that we make. JeffH: We also can't say "ignore .o files, i only care about source files". That also helps greatly. Writing a hash instead of the contents is probably about as expensive, most of the savings are in avoiding the stat() calls. This also sounds racy. - Emily: Another reason to do this work is that this is a good jumping off point to libify the Git internals. Is there any reason not to do that? Jrnieder: To make this concrete: do you mean for example creating a worktree.h with a vtable of worktree operations and having things talk to that instead of the FS? Emily: Yes. Things like that are the reason why we have libgit2, so what if Git could just ship its own library. Bmc: Libgit2 is used in lots of places like the Rust package manager (Cargo). The problem is that Git is GPLv2, which is not usable by lots of folks. - Stolee: The stance of the Git project is that the API is the CLI, not individual files. But I think this is a good thing to have for the project as a whole, even just internally. - We can finally have unit tests!!??!