Re: "disabling bitmap writing, as some objects are not being packed"?

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 08 Feb 2017 16:18:25 -0800

Jeff King <peff@xxxxxxxx> writes:

> In my experience, auto-gc has never been a low-maintenance operation on
> the server side (and I do think it was primarily designed with clients
> in mind).

I do not think auto-gc was ever tweaked to help server usage, in its
history since it was invented strictly to help end-users (mostly new
ones).

> At GitHub we disable it entirely, and do our own gc based on a throttled
> job queue ...
> I wish regular Git were more turn-key in that respect. Maybe it is for
> smaller sites, but we certainly didn't find it so. And I don't know that
> it's feasible to really share the solution. It's entangled with our
> database (to store last-pushed and last-maintenance values for repos)
> and our job scheduler.

Thanks for sharing the insights from the trenches ;-)

> Yeah, I'm certainly open to improving Git's defaults. If it's not clear
> from the above, I mostly just gave up for a site the size of GitHub. :)
>
>> Idea 1: when gc --auto would issue this message, instead it could create
>> a file named gc.too-much-garbage (instead of gc.log), with this message.
>> If that file exists, and it is less than one day (?) old, then we don't
>> attempt to do a full gc; instead we just run git repack -A -d.  (If it's
>> more than one day old, we just delete it and continue anyway).
>
> I kind of wonder if this should apply to _any_ error. I.e., just check
> the mtime of gc.log and forcibly remove it when it's older than a day.
> You never want to get into a state that will fail to resolve itself
> eventually. That might still happen (e.g., corrupt repo), but at the
> very least it won't be because Git is too dumb to try again.

;-)

>> Idea 2 : Like idea 1, but instead of repacking, just smash the existing
>> packs together into one big pack.  In other words, don't consider
>> dangling objects, or recompute deltas.  Twitter has a tool called "git
>> combine-pack" that does this:
>> https://github.com/dturner-tw/git/blob/dturner/journal/builtin/combine-pack.c
>
> We wrote something similar at GitHub, too, but we never ended up using
> it in production. We found that with a sane scheduler, it's not too big
> a deal to just do maintenance once in a while.

Thanks again for this.  I've also been wondering about how effective
a "concatenate packs without paying reachability penalty" would be.

> I'm still not sure if it's worth making the fatal/non-fatal distinction.
> Doing so is perhaps safer, but it does mean that somebody has to decide
> which errors are important enough to block a retry totally, and which
> are not. In theory, it would be safe to always _try_ and then the gc
> process can decide when something is broken and abort. And all you've
> wasted is some processing power each day.

Yup, and somebody or something need to monitor so that repeated
failures can be dealt with.