Hi Taylor! It was great seeing you at Review Club today :) If you'd like, the notes are available at: https://docs.google.com/document/d/14L8BAumGTpsXpjDY8VzZ4rRtpAjuGrFSRqn3stCuS_w/edit but that's optional, since a ground rule is that all important feedback should be sent to the mailing list. Taylor Blau <me@xxxxxxxxxxxx> writes: > This patch introduces a new multi-valued configuration option, > `gc.recentObjectsHook` as a means to mark certain objects as recent (and > thus exempt from garbage collection), regardless of their age. > > However, we have no convenient way to keep certain precious, unreachable > objects around in the repository, even if they have aged out and would > be pruned. Our options today consist of: > > - Point references at the reachability tips of any objects you > consider precious, which may be undesirable or infeasible. What I like about the hook is that it matches the desired use case quite well - if an object is precious but unreachable, it can't be because Git thinks it's precious. Rather, something else thinks it's precious, so the obvious fix is for Git to learn how to listen to that tool. Based off only what's in this commit message though, this feature doesn't seem sufficiently justified yet. In particular, it's not immediately clear how this fits in with the alternatives, especially pointing refs at precious objects (which is probably the most popular way people have been doing this). It would be useful to flesh out why keeping these extra refs are either "undesirable" or "infeasible". Presumably, you already have some idea of why this is the case for the GitHub 'audit log'. Another potential use case I can think of is client-side third party Git tools that implement custom workflows on top of a Git repo, e.g. git-branchless (https://github.com/arxanas/git-branchless) and jj (https://github.com/martinvonz/jj/, btw I contribute a little to jj too). Both of those tools reference commits that Git can consider unreachable (e.g. they support anonymous branches and "undo" operations), and so they both create custom refs to keep those commits alive. The number of refs isn't huge, though it is more than what a typical repo would see (maybe in the 100s), and ISTR that some users have reported that it's enough to noticeably slow down the "git fetch" negotiation. I don't think managing these refs is "infeasible", and there are workarounds to shrink the number of refs (e.g. creating an octopus merge of the commits you want and pointing a ref at it), but it would certainly be a lot easier to use the hook instead. I've cc-ed the maintainers of both projects here in case they have more to share. > +gc.recentObjectsHook:: I have a small preference to use "command" instead of "hook" to avoid confusion with Git hooks (I've already observed some of this confusion in off-list conversations). There's some precedent for "hook" in `uploadpack.packObjectsHook`, but that's a one-off (and it used to confuse me a lot too :P). > + When considering the recency of an object (e.g., when generating > + a cruft pack or storing unreachable objects as loose), use the > + shell to execute the specified command(s). Interpret their > + output as object IDs which Git will consider as "recent", > + regardless of their age. >From a potential user's POV, two things seem unclear: - What does "recenct" mean in this context? Does it just mean "precious"? - What objects do I need to list? Is it every object I want to keep or just the reachable tips? Documentation/git-gc.txt has some pointers: . Any object with modification time newer than the `--prune` date is kept, along with everything reachable from it. but it seems quite hard to discover this and put the pieces together. I wonder if we could make this easier by: - Replacing "recent" with "precious" (if this is really what we mean), and the fact that we use the recency check is an implementation detail. - Explicitly saying that everything reachable from the object is also kept. In the code changes, I noticed a few out-of-date references to "cruft tips", but everything else looked reasonable to me. > diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh > index 303f7a5d84..3ae61ca995 100755 > --- a/t/t5329-pack-objects-cruft.sh > +++ b/t/t5329-pack-objects-cruft.sh > @@ -739,4 +739,175 @@ test_expect_success 'cruft objects are freshend via loose' ' > ) > ' > > +test_expect_success 'gc.recentObjectsHook' ' > + git init repo && > + test_when_finished "rm -fr repo" && > + ( > + cd repo && > + > + # Create a handful of objects. > + # > + # - one reachable commit, "base", designated for the reachable > + # pack > + # - one unreachable commit, "cruft.discard", which is marked > + # for deletion > + # - one unreachable commit, "cruft.old", which would be marked > + # for deletion, but is rescued as an extra cruft tip > + # - one unreachable commit, "cruft.new", which is not marked > + # for deletion > + test_commit base && > + git branch -M main && > + > + git checkout --orphan discard && > + git rm -fr . && > + test_commit --no-tag cruft.discard && > + > + git checkout --orphan old && > + git rm -fr . && > + test_commit --no-tag cruft.old && > + cruft_old="$(git rev-parse HEAD)" && > + > + git checkout --orphan new && > + git rm -fr . && > + test_commit --no-tag cruft.new && > + cruft_new="$(git rev-parse HEAD)" && > + > + git checkout main && > + git branch -D discard old new && > + git reflog expire --all --expire=all && > + > + # mark cruft.old with an mtime that is many minutes > + # older than the expiration period, and mark cruft.new > + # with an mtime that is in the future (and thus not > + # eligible for pruning). > + test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" && > + test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" && > + > + # Write the list of cruft objects we expect to > + # accumulate, which is comprised of everything reachable > + # from cruft.old and cruft.new, but not cruft.discard. > + git rev-list --objects --no-object-names \ > + $cruft_old $cruft_new >cruft.raw && > + sort cruft.raw >cruft.expect && > + > + # Write the script to list extra tips, which are limited > + # to cruft.old, in this case. > + write_script extra-tips <<-EOF && > + echo $cruft_old > + EOF > + git config gc.recentObjectsHook ./extra-tips && > + > + git repack --cruft --cruft-expiration=now -d && > + > + mtimes="$(ls .git/objects/pack/pack-*.mtimes)" && > + git show-index <${mtimes%.mtimes}.idx >cruft && > + cut -d" " -f2 cruft | sort >cruft.actual && > + test_cmp cruft.expect cruft.actual && > + > + # Ensure that the "old" objects are removed after > + # dropping the gc.recentObjectsHook hook. > + git config --unset gc.recentObjectsHook && > + git repack --cruft --cruft-expiration=now -d && > + > + mtimes="$(ls .git/objects/pack/pack-*.mtimes)" && > + git show-index <${mtimes%.mtimes}.idx >cruft && > + cut -d" " -f2 cruft | sort >cruft.actual && > + > + git rev-list --objects --no-object-names $cruft_new >cruft.raw && > + cp cruft.expect cruft.old && > + sort cruft.raw >cruft.expect && > + test_cmp cruft.expect cruft.actual && > + > + # ensure objects which are no longer in the cruft pack were > + # removed from the repository > + for object in $(comm -13 cruft.expect cruft.old) > + do > + test_must_fail git cat-file -t $object || return 1 > + done > + ) Thanks for the comments. The tests are quite verbose, and I wouldn't be able to understand them otherwise. I thought it would be nice to have a test helper to check that the cruft pack contains the expected objects (replacing the "git show-index" + "cut | sort" + "test_cmp" recipe), but this test fits in with the existing style, so I don't consider that a blocker.