Re: git gc & deleted branches

Jeremy Maitin-Shepard <jbms@xxxxxxx> · Fri, 09 May 2008 20:43:52 -0400

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> writes:

> Jeremy Maitin-Shepard <jbms@xxxxxxx> wrote:
>> It is extremely cumbersome to have to worry about whether there are
>> other concurrent accesses to the repository when running e.g. git gc.
>> For servers, you may never be able to guarantee that nothing else is
>> accessing the repository concurrently.  Here is a possible solution:
>> 
>> Each git process creates a log file of the references that it has
>> created.  The log file should be named in some way with e.g. the process
>> id and start time of the process, and simply consist of a list of
>> 20-byte sha1 hashes to be considered additional in-use references for
>> the purpose of garbage collection.

> I believe we partially considered that in the past and discarded it
> as far too complex implementation-wise for the benefit it gives us.

It doesn't seem all that complex, and I'd say that fundamentally it is
the _correct_ way to do things.  Being sloppy is always easier in the
short run, but then either means the system is permanently broken or
results in a lot of "fixing up" work later.  I think almost all of the
work of handling these log files could be done without impacting a lot
of code that calls the relevant APIs that would actually use the log
files.  I think the biggest impact would be on non-C code, but even for
that code, appropriate wrapper could be used to avoid having to make
many changes.

> The current approach of leaving unreachable loose objects around
> for 2 weeks is good enough.  Any Git process that has been running
> for 2 weeks while still not linking everything it needs into the
> reachable refs of that repository is already braindamaged and
> shouldn't be running anymore.

This sort of reasoning just leads to an inherently unreliable system.
Sure, two weeks might seem good enough for nearly all cases, but why
_shouldn't_ I be able to leave my editor open for two weeks before
typing in my commit message and finishing the commit, or wait for two
weeks in the middle of a rebase (it seems that in the new
implementation, temporary refs are created basically to do the same
thing as the log file I described.)  I could easily be typing up my
commit message, then switch to something else, and happen not to come
back to it for two weeks.

Because such a "timeout" based solution isn't really the "correct
solution" but will work most of the time, potential problems won't be
noticed while testing.

Another significant issue is that this timeout means that unreferenced
junk has to stay around in the repository for two weeks for no (good)
reason.

> If we are dealing with a pack file, those are protected by .keep
> "lock files" between the time they are created on disk and the
> time that the git-fetch or git-receive-pack process has finished
> updating the refs to anchor the pack's contents as reachable.
> Every once in a while a stale .keep file gets left behind when a
> process gets killed by the OS, and its damn annoying to clean up.

> I'd hate to clean up logs from every little git-add or git-commit
> that aborted in the middle uncleanly.

First of all, merely exiting due to an error should not cause log files
to be left around.  The only thing that should cause log files to be
left around is kill -9 or a system crash.  Second, by storing the
process id and a timestamp of when the log file was created, it is
possible to reliably determine if a log file is stale.

-- 
Jeremy Maitin-Shepard
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html