RE: [PATCH] repack: respect gc.pid lock

David Turner <David.Turner@xxxxxxxxxxxx> · Tue, 18 Apr 2017 17:08:14 +0000

> -----Original Message-----
> From: Jeff King [mailto:peff@xxxxxxxx]
> Sent: Monday, April 17, 2017 11:42 PM
> To: David Turner <David.Turner@xxxxxxxxxxxx>
> Cc: git@xxxxxxxxxxxxxxx; christian.couder@xxxxxxxxx; mfick@xxxxxxxxxxxxxx;
> jacob.keller@xxxxxxxxx
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Mon, Apr 17, 2017 at 11:29:18PM +0000, David Turner wrote:
> 
> > We saw this failure in the logs multiple  times (with three different
> > shas, while a gc was running):
> > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d
> --pack-kept-objects' in [repo] failed:
> > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > Possibly some other repack was also running at the time as well.
> >
> > My colleague also saw it while manually doing gc (again while repacks
> > were likely to be running):
> 
> This is sort of a side question, but...why are you running other repacks alongside
> git-gc? It seems like you ought to be doing one or the other.
>
> I don't begrudge anybody with a complicated setup running their own set of gc
> commands, but I'd think you would want to do locking there, and disable auto-
> gc entirely. Otherwise you're going to get different results depending on who
> gc'd last.

That's what gitlab does, so you'll have to ask them why they do it that way.  
>From https://gitlab.com/gitlab-org/gitlab-ce/issues/30939#note_27487981
 it looks like they may have intended to have a lock but not quite succeeded.

> > $ git gc --aggressive
> > Counting objects: 13800073, done.
> > Delta compression using up to 8 threads.
> > Compressing objects:  99% (11465846/11465971)
> > Compressing objects: 100% (11465971/11465971), done.
> > fatal: packfile [repo]/objects/pack/pack-[sha].pack cannot be accessed
> 
> OK, so this presumably happened during the writing phase. Which seems like the
> "a pack was closed, and we couldn't re-open it" problem we've seen before.
> 
> > We have a reasonable rlimit (64k soft limit), so that failure mode is
> > pretty unlikely.  I  think we should have had 20 or so packs -- not tens of
> thousands.
> > [...]
> > Do you have any idea why this would be happening other than the rlimit thing?
> 
> Yeah, that should be enough (you could double check the return of
> get_max_fd_limit() on your system if you wanted to be paranoid).
> 
> We also keep only a limited number of bytes mmap'd at one time. Normally we
> don't actually close packfiles when we release their mmap windows.
> But I think there is one path that might. When use_pack() maps a pack, if the
> entire pack fits in a single window, then we close it; this is due to d131b7afe
> (sha1_file.c: Don't retain open fds on small packs, 2011-03-02).
> 
> But if we ever unmap that window, now we have no handle to the pack.
> Normally on a 64-bit system this wouldn't happen at all, since the default
> core.packedGitLimit is 8GB there.

Aha, I missed that limit while messing around with the code.  That must be it.

> So if you have a few small packs and one very large pack (over 8GB), I think this
> could trigger. We may do the small-pack thing for some of them, and then the
> large pack forces us to drop the mmaps for some of the others. When we go
> back to access the small pack, we find it's gone.
> 
> One solution would be to bump core.packedGitLimit to something much higher
> (it's an mmap, so we're really just chewing up address space; it's up to the OS to
> decide when to load pages from disk and when to drop them).
>
> The other alternative is to disable the small-pack closing from d131b7afe. It
> might need to be configurable, or perhaps auto-tuned based on the fd limit.
> Linux systems tend to have generous descriptor limits, but I'm not sure we can
> rely on that. OTOH, it seems like the code to close descriptors when needed
> would take care of things. So maybe we should just revert d131b7afe entirely.

I definitely remember running into fd limits when processing very large numbers 
of packs at Twitter, but I don't recall the exact details.  Presumably, d131b7afe
was supposed to help with this, but in fact, it did not totally solve it. Perhaps 
we were doing something funny.  Adjusting the fd limits was the easy fix.

On 64-bit systems, I think core.packedGitLimit doesn't make a 
lot of sense. There is plenty of address space.  Why not use it?

For 32-bit systems, of course, address space is more precious.

I'll ask our git server administrator to adjust core.packedGitLimit
and turn repacks back on to see if that fixes the issue.

> The final thing I'd ask is whether you might be on a networked filesystem that
> would foil our usual "open descriptors mean packs don't go away" logic. But
> after having dug into the details above, I have a feeling the answer is simply that
> you have repositories >8GB.

Yes, our repo is >8GB, and no, it's not on a networked filesystem.

> And if that is the case, then yeah, your locking patch is definitely a band-aid. If
> you fetch and repack at the same time, you'll eventually see a racy failed fetch.

Fair enough.