Fixing the git-repack replacement gap?

Martin Fick <mfick@xxxxxxxxxxxxxx> · Tue, 18 Jun 2013 10:52:50 -0600

I have been trying to think of ways to fix git-repack so 
that it no longer momentarily makes the objects in a repo 
inaccessible to all processes when it replaces packfiles 
with the same objects in them as an already existing pack 
file.  To be more explicit, I am talking about the way it 
moves the existing pack file (and index) to old-<sha1>.pack 
before moving the new packfile in place.  During this moment 
in time the objects in that packfile are simply not 
available to anyone using the repo.  This can be 
particularly problematic for busy servers.

There likely are at lest 2 ways that the fundamental design 
of packfiles, their indexes, and their names have led to 
this issue.  If the packfile and index were stored in a 
single file, they could have been replaced atomically and 
thus it would potentially avoid the issue of them being 
temporarily inaccessible (although admittedly that might not 
work anyway on some filesystems).  Alternatively, if the 
pack file were named after the sha1 of the packed contents 
of the file instead of the sha1 of the objects in the sha1, 
then the replacement would never need to happen since it 
makes no sense to replace a file with another file with the 
exact same contents (unless, of course the first one is 
corrupt, but then you aren't likely making the repo 
temporarily worse, you are fixing a broken repo).

I suspect these 2 ideas have been discussed before, but 
since they are fundamental changes to the way pack files 
work (and thus would not be backwards compatible), they are 
not likely to get implemented soon.  This got me wondering 
if there wasn't an easier backwards compatible solution to 
avoid making the objects inaccessible?

It seems like the problem could be avoided if we could 
simply change the name of the pack file when a replacement 
would be needed?  Of course, if we just changed the name, 
then the name would not match the sha1 of the contained 
objects and would likely be considered bad by git?  So, what 
if we could simply add a dummy object to the file to cause 
it to deserve a name change?

So the idea would be, have git-repack detect the conflict in 
filenames and have it repack the new file with an additional 
dummy (unused) object in it, and then deliver the new file 
which no longer conflicts.  Would this be possible?  If so, 
what sort of other problems would this cause?  It would 
likely cause an unreferenced object and likely cause it to 
want to get pruned by the next git-repack?  Is that OK, 
maybe you want it to get pruned because then the pack file 
will get repacked once again without the dummy object later 
and avoid the temporarily inaccessible period for objects in 
the file?  

Hmm, but then maybe that could even be done in a single git-
repack run (at the expense of extra disk space)?  

1) Detect the conflict, 
2) Save the replacement file 
3) Create a new packfile with the dummy object
4) Put the new file with the dummy object into service
5) Remove the old conflicting file (no gap)
6) Place the new conflicting file in service (no dummy)
7) Remove the new file with dummy object (no gap again)

done?  Would it work?

If so, is there an easy way to create the dummy file?  Can 
any object simply be added at the end of a pack file after 
the fact (and then added to the index too)?  Also, what 
should the dummy object be?  Is there some sort of null 
object that would be tiny and that would never already be in 
the pack?

Thanks for any thoughts,

-Martin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html