On Wed, Feb 08, 2017 at 05:14:03PM -0500, David Turner wrote: > > I wonder if you'd want to either bump the auto-gc object limit, or > > possibly reduce the gc.pruneExpire limit to keep this situation from > > coming up in the first place (or at least mitigating the amount of time > > it's the case). > > Auto-gc might not succeed in pruning objects, but it will at least > reduce the number of packs, which is super-important for performance. Right, I mean to bump the loose-object limit but keep the gc.autoPackLimit at 50. If you couple that with setting transfer.unpackLimit, then each push creates a single pack, and you repack after 50 pushes. You don't have to care about loose objects, because you know you only get them when a "gc" ejects loose objects (so they're not as efficient, but nothing actually accesses them; they just hang around until their mtime grace period is up). > I think the intent of automatic gc is to have a git repository be > relatively low-maintenance from a server-operator perspective. (Side > note: it's fairly trivial for a user with push access to mess with the > check simply by pushing a bunch of objects whose shas start with 17). > It seems odd that git gets itself into a state where it refuses to do > any maintenance just because at some point some piece of the maintenance > didn't make progress. In my experience, auto-gc has never been a low-maintenance operation on the server side (and I do think it was primarily designed with clients in mind). At GitHub we disable it entirely, and do our own gc based on a throttled job queue (one reason to throttle is that repacking is memory and I/O intensive, so you really don't want to a bunch of repacks kicking off all at once). So basically any repo that gets pushed to goes on the queue, and then we pick the worst cases from the queue based on how badly they need packing[1]. I wish regular Git were more turn-key in that respect. Maybe it is for smaller sites, but we certainly didn't find it so. And I don't know that it's feasible to really share the solution. It's entangled with our database (to store last-pushed and last-maintenance values for repos) and our job scheduler. [1] The "how bad" thing is a heuristic, and we found it's generally proportional to the number of bytes stored in objects _outside_ of the big "main" pack. So 2 big pushes may need maintenance more than 10 tiny pushes, because they have more objects (and our goal with maintenance isn't just saving disk space or avoiding the linear pack search, but having up-to-date bitmaps and good on-disk deltas to make serving fetches as cheap as possible). > Sure, I could change my configuration, but that doesn't help the other > folks (e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=813084 ) > who run into this. Yeah, I'm certainly open to improving Git's defaults. If it's not clear from the above, I mostly just gave up for a site the size of GitHub. :) > Idea 1: when gc --auto would issue this message, instead it could create > a file named gc.too-much-garbage (instead of gc.log), with this message. > If that file exists, and it is less than one day (?) old, then we don't > attempt to do a full gc; instead we just run git repack -A -d. (If it's > more than one day old, we just delete it and continue anyway). I kind of wonder if this should apply to _any_ error. I.e., just check the mtime of gc.log and forcibly remove it when it's older than a day. You never want to get into a state that will fail to resolve itself eventually. That might still happen (e.g., corrupt repo), but at the very least it won't be because Git is too dumb to try again. > Idea 2 : Like idea 1, but instead of repacking, just smash the existing > packs together into one big pack. In other words, don't consider > dangling objects, or recompute deltas. Twitter has a tool called "git > combine-pack" that does this: > https://github.com/dturner-tw/git/blob/dturner/journal/builtin/combine-pack.c We wrote something similar at GitHub, too, but we never ended up using it in production. We found that with a sane scheduler, it's not too big a deal to just do maintenance once in a while. Also, our original problem was that repos which have gotten out of hand (say, 5000 packs) repacked _very_ slowly with a normal repack. So a "fast pack" followed by a real pack was a viable way out of that. In the end, I just made pack-objects handle this case better, and we don't need the fast-pack. > That's less space-efficient than a true repack, but it's no worse than > having the packs separate, and it's a win for read performance because > there's no need to do a linear search over N packs to find an object. Over the long term you may end up with worse packs, because the true repack will drop some delta opportunities between objects in the same pack (reasoning that they weren't made into deltas last time, so it's not worth trying again). You'd probably need to use "-f" periodically. This is all speculation, though. We never did it in production, so I was never able to measure the real impact over time. > Idea 3: As I suggested last time, separate fatal and non-fatal errors. > If gc fails because of EIO or something, we probably don't want to touch > the disk anymore. But here, the worst consequence is that we waste some > processing power. And it's better to occasionally waste processing power > in a non-interactive setting than it is to do so when a user will be > blocked on it. So non-fatal warnings should go to gc.log, and fatal > errors should go to gc.fatal. gc.log won't block gc from running. I > think this is my preferred option. This seems like your (1), except that it handles more than one type of non-fatal error. So I like it much better. I'm still not sure if it's worth making the fatal/non-fatal distinction. Doing so is perhaps safer, but it does mean that somebody has to decide which errors are important enough to block a retry totally, and which are not. In theory, it would be safe to always _try_ and then the gc process can decide when something is broken and abort. And all you've wasted is some processing power each day. -Peff