Re: git gc & deleted branches

Jeremy Maitin-Shepard <jbms@xxxxxxx> · Fri, 09 May 2008 20:07:44 -0400

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Nicolas Pitre <nico@xxxxxxx> writes:
>> On Fri, 9 May 2008, Brandon Casey wrote:
>> 
>>> Unreferenced objects are sometimes used by other repositories which have
>>> this repository listed as an alternate. So it may not be a good idea to
>>> make the unreferenced objects inaccessible.
>> 
>> Nah.  If this is really the case then you shouldn't be running gc at all 
>> in the first place.

> True.

> I think the true motivation behind --keep-unreachable is not about the
> shared object store (aka "alternates") but about races between gc and
> push (or fetch).  Before push (or fetch) finishes and updates refs, the
> new objects they create would be dangling _and_ the objects these dangling
> objects refer to may be packed but unreferenced.  Repacking unreferenced
> packed objects was a way to avoid losing them.

I feel like the current approach of (not very well) keeping track of
which objects are still needed is very messy, not very well defined or
based on specific solid principles, and prone to errors and losing
objects.

Things like git clone -shared can only really be used in extremely
specialized setups, or if pruning of unreferenced objects is completely
disabled in the source repository, or if specialized scripts are used to
do the garbage collection that take into account the references of the
"child" repository.  It is my impression that even repo.or.cz, while it
has some safe guards, does not even completely safely handle garbage
collection.  Probably it would be very useful to examples of such
scripts in contrib.

I think that ultimately, some general purpose and reliable solution
needs to be found to handle the cases of (1) a repository having its
objects referenced by another via info/alternates; (2) a repository with
multiple working directories (presumably this should warn/error out
unless given a force option/detach head and warn if you try to switch
HEAD for some working directory to the same branch as some other working
directory).  It seems, btw, that a third type of clone, one which merely
symlinks the objects directory, would also be useful, once there is a
solution to the robustness issue.  This would be a case (3) that needs
to be handled as well.

It seems that clear that ultimately, to handle these three cases, every
repository needs to know about every other repository, probably via a
symlink to other repository's .git directory.  Git gc would then also
examine any refs in this directory, making sure to avoid circular
references that might result from following the symlinks.  It should
also probably error out if it finds a symlink that doesn't point to a
valid git repository, because such a symlink either refers to a
now-deleted repository for which the symlink needs to be cleaned up, or
it refers to a repository that was moved and therefore the symlink needs
to be updated.  Simply ignoring invalid symlinks could result in pruning
objects that need to be kept for repositories that have moved.

It is extremely cumbersome to have to worry about whether there are
other concurrent accesses to the repository when running e.g. git gc.
For servers, you may never be able to guarantee that nothing else is
accessing the repository concurrently.  Here is a possible solution:

Each git process creates a log file of the references that it has
created.  The log file should be named in some way with e.g. the process
id and start time of the process, and simply consist of a list of
20-byte sha1 hashes to be considered additional in-use references for
the purpose of garbage collection.  The log file would be cleaned up
when the process exits, and would also be deleted by any instance of git
gc that notices a stale log file that doesn't correspond to a running
process.  To handle shell scripts that need to deal with git-hash-object
directly, git hash-object could be passed maybe a file descriptor or
filename of a log file to use instead of creating one.  Maybe the log
file format could be more complicated, and also support paths to
e.g. alternate index files to also consider for references.  Things
would need to be one so that race conditions do not occur, but I think
something like this would work.

-- 
Jeremy Maitin-Shepard
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html