Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 15 Mar 2018 20:21:35 +0100

On Thu, Mar 15 2018, Duy Nguyen jotted:

> On Mon, Mar 12, 2018 at 8:30 PM, Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>> We already have pack.packSizeLimit, perhaps we could call this
>> e.g. gc.keepPacksSize=2GB?
>
> I'm OK either way. The "base pack" concept comes from the
> "--keep-base-pack" option where we do keep _one_ base pack. But gc
> config var has a slightly different semantics when it can keep
> multiple packs.

I see, yeah it would be great to generalize it to N packs.

>> Finally I wonder if there should be something equivalent to
>> gc.autoPackLimit for this. I.e. with my proposed semantics above it's
>> possible that we end up growing forever, i.e. I could have 1000 2GB
>> packs and then 50 very small packs per gc.autoPackLimit.
>>
>> Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
>> gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
>> two smallest one and not issue a --keep-pack for those, although then
>> maybe our memory use would spike past the limit.
>>
>> I don't know, maybe we can leave that for later, but I'm quite keen to
>> turn the top-level config variable into something that just considers
>> size instead of "base" if possible, and it seems we're >95% of the way
>> to that already with this patch.
>
> At least I will try to ignore gc.keepPacksSize if all packs are kept
> because of it. That repack run will hurt. But then we're back to one
> giant pack and plenty of small packs that will take some time to grow
> up to 2GB again.

I think that semantic really should have its own option. The usefulness
of this is significantly diminished if it's not a guarantee on the
resource use of git-gc.

Consider a very large repo where we clone and get a 4GB pack. Then as
time goes on we end up with lots of loose objects and small packs from
pulling, and eventually end up with say 4GB + 2x 500MB packs (if our
limit is 500MB).

If I understand what you're saying correctly if we ever match the gc
--auto requirements because we have *just* the big packs and then a
bunch of loose objects (say we rebased a lot) then we'll try to create a
giant 5GB pack (+ loose objects).

>> Finally, I don't like the way the current implementation conflates a
>> "size" variable with auto detecting the size from memory, leaving no way
>> to fallback to the auto-detection if you set it manually.
>>
>> I think we should split out the auto-memory behavior into another
>> variable, and also make the currently hardcoded 50% of memory
>> configurable.
>>
>> That way you could e.g. say you'd always like to keep 2GB packs, but if
>> you happen to have ended up with a 1GB pack and it's time to repack, and
>> you only have 500MB free memory on that system, it would keep the 1GB
>> one until such time as we have more memory.
>
> I don't calculate based on free memory (it's tricky to get that right
> on linux) but physical memory. If you don't have enough memory
> according to this formula, you won't until you add more memory sticks.

Ah, thanks for the clarification.

>>
>> Actually maybe that should be a "if we're that low on memory, forget
>> about GC for now" config, but urgh, there's a lot of potential
>> complexity to be handled here...
>
> Yeah I think what you want is a hook. You can make custom rules then.
> We already have pre-auto-gc hook and could pretty much do what you
> want without pack-objects memory estimation. But if you want it, maybe
> we can export the info to the hook somehow.

I can do away with that particular thing, but I'd really like to do
without the hook. I can automate it on some machines, but then we also
have un-managed laptops run by users who clone big repos. It's much
easier to tell them to set a few git config variables than have them
install & keep some hook up-to-date.