Re: Preserve/Prune Old Pack Files

Mike Hommey <mh@xxxxxxxxxxxx> · Mon, 9 Jan 2017 16:01:19 +0900

On Mon, Jan 09, 2017 at 01:21:37AM -0500, Jeff King wrote:
> On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick wrote:
> 
> > I am replying to this email across lists because I wanted to 
> > highlight to the git community this jgit change to repacking 
> > that we have up for review
> > 
> >  https://git.eclipse.org/r/#/c/87969/
> > 
> > This change introduces a new convention for how to preserve 
> > old pack files in a staging area 
> > (.git/objects/packs/preserved) before deleting them.  I 
> > wanted to ensure that the new proposed convention would be 
> > done in a way that would be satisfactory to the git 
> > community as a whole so that it would be more easy to 
> > provide the same behavior in git eventually.  The preserved 
> > pack files (and accompanying index and bitmap files), are not 
> > only moved, but they are also renamed so that they no longer 
> > will match recursive finds looking for pack files.
> 
> It looks like objects/pack/pack-123.pack becomes
> objects/pack/preserved/pack-123.old-pack, and so forth.
> Which seems reasonable, and I'm happy that:
> 
>   find objects/pack -name '*.pack'
> 
> would not find it. :)
> 
> I suspect the name-change will break a few tools that you might want to
> use to look at a preserved pack (like verify-pack). I know that's not
> your primary use case, but it seems plausible that somebody may one day
> want to use a preserved pack to try to recover from corruption. I think
> "git index-pack --stdin <objects/packs/preserved/pack-123.old-pack"
> could always be a last-resort for re-admitting the objects to the
> repository.
> 
> I notice this doesn't do anything for loose objects. I think they
> technically suffer the same issue, though the race window is much
> shorter (we mmap them and zlib inflate immediately, whereas packfiles
> may stay mapped across many object requests).
> 
> I have one other thought that's tangentially related.
> 
> I've wondered if we could make object pruning more atomic by
> speculatively moving items to be deleted into some kind of "outgoing"
> object area. Right now you can have a case like:
> 
>   0. We have a pack that has commit X, which is reachable, and commit Y,
>      which is not.
> 
>   1. Process A is repacking. It walks the object graph and finds that X
>      is reachable. It begins creating a new pack with X and its
>      dependent objects.
> 
>   2. Meanwhile, process B pushes up a merge of X and Y, and updates a
>      ref to point to it.
> 
>   3. Process A finishes writing the new pack, and deletes the old one,
>      removing Y. The repository is now corrupt.
> 
> I don't have a solution here.  I don't think we want to solve it by
> locking the repository for updates during a repack. I have a vague sense
> that a solution could be crafted around moving the old pack into a
> holding area instead of deleting (during which time nobody else would
> see the objects, and thus not reference them), while the repacking
> process checks to see if the actual deletion would break any references
> (and rolls back the deletion if it would).
> 
> That's _way_ more complicated than your problem, and as I said, I do not
> have a finished solution. But it seems like they touch on a similar
> concept (a post-delete holding area for objects). So I thought I'd
> mention it in case if spurs any brilliance.

Something that is kind-of in the same family of problems is the
"loosening" or objects on repacks, before they can be pruned.

When you have a large repository and do large rewrite operations
(extreme case, a filter-branch on a multi-hundred-thousands commits),
and you gc for the first time, git will possibly create a *lot* of loose
objects, each of which will consume an inode and a file system block. In
the extreme case, you can end up with git gc filling up multiple extra
gigabytes on your disk.

Mike