[TOPIC 7/8] Speeding up the connectivity check

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 29 Sep 2022 15:21:52 -0400



# Ideas for speeding up object connectivity checks in git-receive-pack
(Patrick Steinhardt)

- Patrick: At GitLab we've noticed some repositories have become slow
	with a lot of references.
- Initially I thought it was the object negotiation.
- The connectivity checks seem to be the culprit. The connectivity check
	is implemented quite naively. "git rev-list <commits> --not --all"
- A while ago I tried to optimize it. Stop walking when you reach an
	existing object. List feedback was that I had not gotten the semantics
	right, since that existing object is not necessarily pointed to by a
	ref and not everything it references is necessarily present in the
	repository.
- Trial 2: optimize rev-list. Sped up connectivity check by 60-70%! But
	the motivating repositories had grown, so it was still too slow.
- How do other people do it, with millions of references?
- Terry: Yes, we've seen this problem.
- Just the setup time for the revision walk takes a long time. Initially
	we weren't using reachability bitmaps; using those also helped. Then
	using subsets of references — first HEAD, then refs/heads/-, …
- Jrnieder: Initially we did this for reachability checks; do
	connectivity checks do it, too? Terry: Yes, Demetr did that.
- Demetr: When the connectivity check fails, we fall back to a fuller
	set of refs.
- Patrick: Many of our references are not even visible to the public. If
	Git makes configurable which refs are part of the connectivity check,
	that would already make things faster in our case.
- Taylor: Did you experiment with reachability bitmaps? Am I remembering
	correctly that they made things slower?
- Patrick: We did some experiments with reachability bitmaps, but not
	specifically for this problem. In those experiments they did make some
	things slower.
	 - Taylor: one thing that you could do is make the bitmap traversal by
		 building up a complete bitmap of the boundary between haves and
		 wants instead of a bitmap of all of the haves. Involves far less
		 object traversal, and there are some clever ways to do this.
	 - Taylor: As long as you can quickly determine the boundary between
		 the haves and wants for a given request, the connectivity check
		 should be as fast as you need.
- Peff: One difference between how Git and JGit does this is JGit is
	building up a structure of "what is reachable". You could persist a
	bitmap representing "here's everything that's reachable from this
	repository" and subtract that out; that would help with many cases.
	The problem with one reachability bitmap like that is that it goes
	stale whenever someone pushes something. But you could make a set of
	"pseudo-merge bitmaps" for each group of 10,000 refs or so. Especially
	if you're clever about which refs you do that for, that can be a
		significant improvement ("these 2,000,000 refs haven't been touched
		for a year, so I can use this bitmap and don't even have to examine
			them").
- Terry: There are two ways that unreachable objects appear. One is by
	branches being deleted or rewound. The other is failed pushes where
	objects were persisted and the ref update didn't succeed. I wanted to
	distinguish between objects that are "applied" and "unapplied", became
	very thorny. And with that, on a branch rewind, we can calculate what
	just became unreachable.
- Waleed Khan: Object negotiation with bloom filters paper
	 - Kleppman and Howard 2020
	 - Blog post
		 https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html
	 - Paper https://arxiv.org/abs/2012.00472
	 - Quote: "Unlike the skipping algorithm used by Git, our algorithm
		 never unnecessarily sends any commits that the other side already
		 has, and the Bloom filters are very compact, even for large commit
		 histories." Alternative ways to write to VFS-backed worktrees (e.g.
		 write($HASH) instead of write(<bytes>))
- Google tried to use the sparse-index to control what gets materialized
	by in a VFS-like system. What if we replaced the write() syscall +
	writing bytes and write a hash instead. It could save a lot, e.g. in
	an online IDE.
- Stolee: The "write" analog of FSMonitor for VFS-for-Git is the
	post-index-change hook. We suppress the writes by manipulating the
	index and then communicating the write back to the VFS.
- Jrnieder: This is bigger than just the "write()" call; if we have a
	git-aware filesystem, we can be much less wasteful than we are today.
	E.g. FSMonitor can only make `git status` so fast because we still
	have to stat(), but with a VFS, we could ask the VFS what has changed.
- Do we think this is useful for anything other than a VFS? Do we still
	want this even if it's only useful for VFS?
- Stolee: We could make FSMonitor more git-aware e.g. it doesn't know
	about writes that we make. JeffH: We also can't say "ignore .o files,
	i only care about source files". That also helps greatly. Writing a
	hash instead of the contents is probably about as expensive, most of
	the savings are in avoiding the stat() calls. This also sounds racy.
- Emily: Another reason to do this work is that this is a good jumping
	off point to libify the Git internals. Is there any reason not to do
	that? Jrnieder: To make this concrete: do you mean for example
	creating a worktree.h with a vtable of worktree operations and having
	things talk to that instead of the FS? Emily: Yes. Things like that
	are the reason why we have libgit2, so what if Git could just ship its
	own library. Bmc: Libgit2 is used in lots of places like the Rust
	package manager (Cargo). The problem is that Git is GPLv2, which is
	not usable by lots of folks.
- Stolee: The stance of the Git project is that the API is the CLI, not
	individual files. But I think this is a good thing to have for the
	project as a whole, even just internally.
- We can finally have unit tests!!??!