[TOPIC 4/8] Commit `--filter`'s

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 29 Sep 2022 15:20:43 -0400



# git clone --filter=commit:0 (jonathantanmy)

- Partial clone that can omit commits (we already support trees and
	blobs)
- Pros:
	 - Don't need all commits, save network and disk I/O. There's a repo
		 at Google that grows so quickly that having just commits is too
		 much
- Cons
	 - Git assumes that all of the commits are present locally; very
		 pervasive assumption.
			- Blobs don't have outlinks, not a problem. Tree depth is somewhat
				limited. Commits go all the back to the beginning of the repo
			-  `git bisect` without commits
			- Lose out on optimizations like `fetch` skipping negotiator,
				commit graph generation numbers
- Has everyone else thought about this?
- Peff: Compare to shallow clone (create a ‘graft' and pretend that the
	commit has no more parents). How do we handle the continuous N + 1
	fetching?  Jonathantanmy: Not a big issue, we can batch fetch. It's
	jumping around that's a problem
- Peff: What if the server sends the commit graph?
	 - Taylor: we could just send the generation number(s) of the parents
		 of the commits on the boundary of what you're sending.
	 - Emily: We can't verify it though, we'd have to just trust the
		 server
	 - Taylor: true, but that's the case even if you send the whole commit
		 graph, too
- Jrn: Partial clone - we know the server is there, so we still have a
	"full clone", but part of the "full clone" lives in the server. There
	are git services that don't need a full copy of the repo, e.g. for CI,
	we only need a view of the directories we're building.
- Two use cases for partial clone
	 - Shallow clone replacement: user only cares about a single commit
	 - Operation that involves history walking (e.g. git describe). We
		 might as well fetch all the commits (i.e., convert to tree or blob
		 filtering when we notice this operation). Are these operations
		 distinguishable?
- Rdamazio: What if all of the history walking happens only on the
	server? (e.g.  git blame). Jrnieder: For git blame specifically, that
	makes sense, but are you thinking of other things like git log?
	Rodrigo: Yes
- Johannes: That doesn't sound like it will scale. Stolee: At GitHub, we
	already do run blame on the server Rodrigo: At google, we precompute
	that
- Terry: More and more things want "stateless" operations (don't care
	about history) - that's probably the majority of use cases. There's
	also a popular use case of "1 week/month" of history. It would be
	great to not pay the penalty of fetching all commits. Today, we only
	have shallow clone, which pretends that history is different from what
	it actually is, and it's very difficult to maintain this on the server
	(sending not enough objects, sending too many objects). But filters
	are much easier to maintain.
- Victoria: Is this a replacement for shallow clones then? Terry: Yes.
- Stolee: 2 technical areas of apprehension
	 - VFS for git tried to do this by having only the initial commit and
		 fetching later objects one-by-one. Didn't work at all, was very
		 slow.
	 - Treeless clones - when traversing the tree, we keep refetching the
		 tree when we traverse it.
- We would have to drastically rework how GIt interacts with partial
	clones
- Taylor: Or we could teach the server to preempt the operations ("i'm
	going to run git log, send me the right things")
	 - Stolee: Or run it on the server
	 - Taylor: yes, that would be the other extreme approach to this
- Jrnieder: With treeless clones, we don't propagate the filter on the
	catch up fetch, and there are some code locations that assume that if
	we have a tree, we have all of its children. If anything, commit
	filters are even easier because we have nothing - so we can do "all or
	nothing"
	 - Stolee: I agree that it's simpler, but I still think it'll be
		 really slow.  So either, we need to do something much smarter than
		 object-by-object fetches, or to prevent users from running
		 problematic commands. We would eventually have to fix the problem
		 for treeless clones, so what if we start with full commit history,
			 but not all of the trees. We can fix that first before starting
			 on commit filters
	 - Jrnieder: I can see the need for all of the commits up until a
		 certain point in time, but I don't know if there's a need to solve
		 the general problem of omitting arbitrary commits e.g. jumping
		 around in bisect
- Rodrigo: We have some experience for doing this with Mercurial at
	Google - we hide the full history, users know they exist, but they can
	refetch if they wish. Stolee/Peff: That sounds like reimplementing
	shallow clone.
- Taylor: Is there any other kind of filter other than commit:0?
	Jonathantanmy: No plans yet.
- Peff: Wouldn't you need to implement the general case to do batching
	of commits in "git log"? Jonathantanmy: Maybe not, we could e.g. reuse
	the shallow clone protocol.  Managing ever growing pack index sizes on
	servers
- Some repositories have over 15 years of history with 1000 active
	developers, so pack indices can be between 1 and 2 GB. "GC pack"
	contains everything reachable from refs/heads/- and refs/tags/-
- Time-based slicing for repositories to allow smaller repositories.
	"Remove" history from before a certain point. Done by taking a shallow
	clone and using that as the new repository.
- What about folks who are only interested in the last week's history?
- Pack repositories based on time-based slicing.  Moving back to older
	history can fall back to older packs as necessary.
- Some people, like documentation folks, don't need the entire history
	and might be fine with a more limited environment.
- Chromium packs to three packs: one is a cruft/garbage pack, and the
	other are reachable objects.  refs/heads is packed into one pack, and
	refs/changes (PR-equivalent) are in the other.
- JGit doesn't have a reverse index yet
- Taylor: considering packing reverse index into main index.  The
	tension is that we need to make using multiple packs more flexible.
	Introduce bitmap chains when repacking to make things more stable and
	less expensive.
- Stolee: Consider older packs that are stable and only repack newer
	things.
- Peff: One reason not to have lots of packs on disk is missing out on
	deltas.  We could use thin packs on disk.
- Stolee: Future goal is to only include full delta chains in the stable
	packs.
- git gc –aggressive used to make really deep deltas and has been fixed
	to be less aggressive to avoid runtime performance costs.  Between 10
	and 50 shows real performance improvements but the old default was
	like 200.
- The original numbers were picked randomly without measurement.
- Patrick: GitLab maintenance architecture is evolving.  Each push is
	incremental (repack into one pack) or full repack (everything into one
	pack with deltas).
- Stable ordering for determining preferred objects (SHA ordering is not
	suitable).