Re: Consider adding pruning of refs to git maintenance

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 18 Dec 2024 07:35:01 -0800

Shubham Kanodia <shubham.kanodia10@xxxxxxxxx> writes:

>> ...
>> Thanks.

[administrivia: respond inline, trim out parts that do not have to
be read by bystanders to understand your response].

>> In any case, stepping back a bit, for the population of user who
>> benefit from enabing the prune-remote-refs task, wouldn't it be an
>> even better solution for them to set fetch.prune?  You can tell them
>> to run "git remote prune" just once, set that configuration
>> variable, and then the remote-tracking branches will stay clean from
>> then on.  Any future interactions with the remote make sure stale
>> remote-tracking branches will be removed automatically.  Wouldn't
>> that be a much better option?  I am sure I must be missing a use
>> case where fetch.prune (or remote.<name>.prune) is not a good idea
>> but background prune-remote-refs task works better.
>
> Let me expand on the context for suggesting this change:
>
> I work with a large repository that has over 50k refs, with about 4k
> new ones added weekly.
> We have maintenance scripts on our git server that clean up stale refs
> (unused older than N months).

> Using `fetch.prune` with a normal git fetch isn't ideal because it
> would cause git fetch to unnecessarily download many new refs that
> users don't need. So we actively discourage that.

This is what I did not quite understand.  What do your users
normally do to update their repository from the remote to become in
sync, if they are not running "git fetch"?

	Side note: it is very likely that your users are not
	directly be running "git fetch", but using various
	front-ends like "git pull", "git pull --rebase", or even
	"repo", but they all at some point call "git fetch" to get
	the new objects and update refs.

Ah, are they using "git fetch origin +foo:refs/remotes/origin/foo",
i.e., only selectively fetch the thing that they use and nothing
else (again, their wrappers may supply the refspec to do the
limiting)?  Now it slowly starts to make sense to me (sorry, I am
slow, especially without caffeine in the morning).

Am I following / guessing your set-up more or less correctly so far?

In any case, if your users are doing selective fetching, 50k refs or
4k ref turnover per week on the other side does not really matter.
Your users' desktop repositories won't see remote-tracking refs that
they didn't use and ask for.

But you are right that these selectively fetched refs will
accumulate unless pruned, and fetch.prune would not prune anything
when

	git fetch origin +foo:refs/remotes/origin/foo

because it will not prune what is outside the hierarchy the refspec
covers and this is a deliberate design decision.

For "git fetch origin '+refs/heads/*:refs/remotes/origin/*'", which
is pretty much how "git clone" sets up the remotes, anything we have
in refs/remotes/origin/ hierarchy that do not appear in the current
refs/heads/ hiearchy they have are pruned with fetch.prune=true.
But if you fetch selectively, either 'foo' exists (in which case it
won't be pruned), or 'foo' went away (in which case the fetch itself
fails before even pruning what is on our end), so fetch.prune may
not help.

And at least for a shorter term, periodically running "remote prune"
would be an acceptable workaround for such a workflow.

In the longer term, I suspect that we may want a new option that
lets you more aggressively prune your remote-tracking refs, telling
the tool something like

  git fetch --prune-aggressive origin +refs/heads/foo:refs/remotes/origin/foo

to mean "I only am interested in getting the object to complete
their current 'foo' branch, and get my remote-tracking ref for that
branch updated, BUT if you notice some ref in my refs/remotes/origin/*
that they do not have in their refs/heads/*, please prune it, even when
they are not 'foo' (which means normal --prune would not prune them)",
would not be a terrible idea.

It would be more involved than running "remote prune" periodically,
of course.

> In theory, users could just run `git remote prune` once and carefully
> avoid full fetches to keep their local ref count low.
> However, in practice, we've found that full fetches happen through
> various indirect means:
>
> - Shell plugins like zsh/pure
> - Git GUIs like Sourcetree
> - Code editors like VSCode
>
> among others.

And do any of these bypass underlying "git fetch"?  If not, then one
easier solution is to accept that somebody will do the regular

	refs/heads/*:refs/remotes/origin/*

full-fetch *anyway*.  Once we accept that to happen, we can tell
"git fetch" to always prune.  Then when these "various indirect
means" attempt their full fetch, "git fetch" invoked by them would
still honor fetch.prune, so even though they would try to maintain
these 50k refs in-sync with the remote, which means you may see 4k
new refs per week, but plausibly the refs that got retired are seen
as stale and get removed on the users' repositories.

> - If full `git fetch` is completely avoided, this will gradually
> reduce the local ref count from tens of thousands to just a few
> hundred active refs (even if the remote has 50k+ active refs) as old
> branches on the remote expire with time.

Yes.  You'd somehow need to arrange these third-party tools not to
fetch too much unneeded cruft.

> - Even if not —say, if an errant tool or the developer executes `git
> fetch` mistakenly, then the maintenance job ensures this doesn't
> become their permanent state until the next manual remote prune.

For the latter case, fetch.prune=true would be the ideal solution, I
would imagine.  The "errant tool"'s 'git fetch' would prune the
stale ones.

Now, the documentation should explain when this "periodically running
remote prune" is an acceptable workaround and/or a useful solution,
relative to setting fetch.prune, as most parts of the existing
documentation do assume that the users, intended audience of the
document, are using the bog-standard "git clone" result, that copies
all their branches to remote-tracking branches.

Thanks.