On Mon, Jan 09, 2017 at 01:21:37AM -0500, Jeff King wrote: > On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick wrote: > > > I am replying to this email across lists because I wanted to > > highlight to the git community this jgit change to repacking > > that we have up for review > > > > https://git.eclipse.org/r/#/c/87969/ > > > > This change introduces a new convention for how to preserve > > old pack files in a staging area > > (.git/objects/packs/preserved) before deleting them. I > > wanted to ensure that the new proposed convention would be > > done in a way that would be satisfactory to the git > > community as a whole so that it would be more easy to > > provide the same behavior in git eventually. The preserved > > pack files (and accompanying index and bitmap files), are not > > only moved, but they are also renamed so that they no longer > > will match recursive finds looking for pack files. > > It looks like objects/pack/pack-123.pack becomes > objects/pack/preserved/pack-123.old-pack, and so forth. > Which seems reasonable, and I'm happy that: > > find objects/pack -name '*.pack' > > would not find it. :) > > I suspect the name-change will break a few tools that you might want to > use to look at a preserved pack (like verify-pack). I know that's not > your primary use case, but it seems plausible that somebody may one day > want to use a preserved pack to try to recover from corruption. I think > "git index-pack --stdin <objects/packs/preserved/pack-123.old-pack" > could always be a last-resort for re-admitting the objects to the > repository. > > I notice this doesn't do anything for loose objects. I think they > technically suffer the same issue, though the race window is much > shorter (we mmap them and zlib inflate immediately, whereas packfiles > may stay mapped across many object requests). > > I have one other thought that's tangentially related. > > I've wondered if we could make object pruning more atomic by > speculatively moving items to be deleted into some kind of "outgoing" > object area. Right now you can have a case like: > > 0. We have a pack that has commit X, which is reachable, and commit Y, > which is not. > > 1. Process A is repacking. It walks the object graph and finds that X > is reachable. It begins creating a new pack with X and its > dependent objects. > > 2. Meanwhile, process B pushes up a merge of X and Y, and updates a > ref to point to it. > > 3. Process A finishes writing the new pack, and deletes the old one, > removing Y. The repository is now corrupt. > > I don't have a solution here. I don't think we want to solve it by > locking the repository for updates during a repack. I have a vague sense > that a solution could be crafted around moving the old pack into a > holding area instead of deleting (during which time nobody else would > see the objects, and thus not reference them), while the repacking > process checks to see if the actual deletion would break any references > (and rolls back the deletion if it would). > > That's _way_ more complicated than your problem, and as I said, I do not > have a finished solution. But it seems like they touch on a similar > concept (a post-delete holding area for objects). So I thought I'd > mention it in case if spurs any brilliance. Something that is kind-of in the same family of problems is the "loosening" or objects on repacks, before they can be pruned. When you have a large repository and do large rewrite operations (extreme case, a filter-branch on a multi-hundred-thousands commits), and you gc for the first time, git will possibly create a *lot* of loose objects, each of which will consume an inode and a file system block. In the extreme case, you can end up with git gc filling up multiple extra gigabytes on your disk. Mike