Re: [RFC/PATCH] gc: run more pre-detach operations under lock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 20 2019, Duy Nguyen wrote:

> On Thu, Jun 20, 2019 at 5:49 AM Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>>
>>
>> On Wed, Jun 19 2019, Jeff King wrote:
>>
>> > On Wed, Jun 19, 2019 at 08:01:55PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >
>> >> > You could sort of avoid the problem here too with
>> >> >
>> >> > parallel 'git fetch --no-auto-gc {}' ::: $(git remote)
>> >> > git gc --auto
>> >> >
>> >> > It's definitely simpler, but of course we have to manually add
>> >> > --no-auto-gc in everywhere we need, so not quite as elegant.
>> >> >
>> >> > Actually you could already do that with 'git -c gc.auto=false fetch', I guess.
>> >>
>> >> The point of the 'parallel' example is to show disconnected git
>> >> commands, think trying to run 'git' in a terminal while your editor
>> >> asynchronously runs a polling 'fetch', or a server with multiple
>> >> concurrent clients running 'gc --auto'.
>> >>
>> >> That's the question my RFC patch raises. As far as I can tell the
>> >> approach in your patch is only needed because our locking for gc is
>> >> buggy, rather than introduce the caveat that an fetch(N) operation won't
>> >> do "gc" until it's finished (we may have hundreds, thousands of remotes,
>> >> I use that for some more obscure use-cases) shouldn't we just fix the
>> >> locking?
>> >
>> > I think there may be room for both approaches. Yours fixes the repeated
>> > message in the more general case, but Duy's suggestion is the most
>> > efficient thing.
>> >
>> > I agree that the "thousands of remotes" case means we might want to gc
>> > in the interim. But we probably ought to do that deterministically
>> > rather than hoping that the pattern of lock contention makes sense.
>>
>> We do it deterministically, when gc.auto thresholds et al are exceeded
>> we kick one off without waiting for other stuff, if we can get the lock.
>>
>> I don't think this desire to just wait a bit until all the fetches are
>> complete makes sense as a special-case.
>>
>> If, as you noted in <20190619190845.GD28145@xxxxxxxxxxxxxxxxxxxxx>, the
>> desire is to reduce GC CPU use then you're better off just tweaking the
>> limits upwards. Then you get that with everything, like when you run
>> "commit" in a for-loop, not just this one special case of "fetch".
>>
>> We have existing potentially long-running operations like "fetch",
>> "rebase" and "git svn fetch" that run "gc --auto" for their incremental
>> steps, and that's a feature.
>
> gc --auto is added at arbitrary points to help garbage collection. I
> don't think it's ever intended to "do gc at this and that exact
> moment", just "hey this command has taken a lot of time already (i.e.
> no instant response needed) and it may have added a bit more garbage,
> let's just check real quick".

I don't mean we can't ever change the algorithm, but that we've
documented:

    When common porcelain operations that create objects are run, they
    will check whether the repository has grown substantially since the
    last maintenance[...]

The "fetch" command is a common porcelain operation, when it fetches
from N remotes it just runs an invocation of itself, so thus far it's
both worked & been intuitive that if we needed (potentially multiple)
gc's while doing that we'd just go ahead and run it then, even if
something concurrent was happening.

No that's not optimal in many cases, but at least doesn't create caveats
we don't have now where we have runaway object growth.

>> It keeps "gc --auto" dumb enough to avoid a pathological case where
>> we'll have a ballooning objects dir because we figure we can run
>> something "at the end", when "the end" could be hours away, and we're
>> adding a new pack or hundreds of loose objects every second.
>
> Are we optimizing for a rare (large scale) case? Such setup requires
> tuning regardless to me.

At least for me it doesn't require custom tuning before this patch of
yours.

I.e. now "gc --auto" is dumb enough that you can run it on everything
from stuff that just does "commit" from cron, user's laptops, massive
rebases that take forever, and e.g. "stats" like jobs where for
<reasons> I'll add thousands of repos and "fetch --all" them (so e.g. I
can run "log --author=<x> --all").

Yeah of course I'm an advanced user and I can just grumble and manually
invoke fetch, actually I'll probably submit a follow-up patch to add a
gc.* config to disable this thing.

But I think even if the use-case is rather obscure it's important that
if at all possible we keep "gc" elastic enough to work for pretty much
all combinations of object-adding porcelain commands, and I think in
this case we're better off doing things differently...

>> So I don't think Duy's patch is a good way to go.
>
> This reminds me of being perfect is the enemy of the good. A normal
> user has a couple remotes at most, finishing fast (enough) and in such
> case it's a good idea to wait until everything is in before running
> gc.

Even a user with two remotes will run into issues with your patch where
"gc" will print things twice, or outright error due to concurrent access
by another process, which as has been discussed here on-list is *very*
common e.g. with editor integration.

So the extent of my complains about this is:

 1) The edge case with runaway object growth (obscure)

 2) It doesn't really fix the bug except for the narrow case of users
    who invoke things one-terminal-at-a-time and don't e.g. have an
    editor with "git" integration & a terminal (not obscure).

Maybe I should have led with #2 :)

Anyway, I'm not *just* complaining. I have patches too, but so far I'm
up to 8 patches on what's probably just the first one-third of
it. *Sigh*.

But the 2/3 of that if you want to dig through my crappy WIP code on
GitHub is instrumenting the test suite to demonstrate that with an
approach like what your patch does we still get these GC race
conditions, because our locking still sucks, which brings me to...

> Of course making git-gc more robust wrt. parallel access is great, but
> it's hard work. Dealing with locks is always tricky, especially when
> new locks can come up any time.

I've poked at it a bit now, and it's really not hard, I think it's just
that nobody looked at it hard enough before.

The issue is that currently we do:

    1. parent: do_we_need_gc();
    2. parent: say_way_will_gc();
    3. parent: lock();
    4. parent: do_a_bit_of_work();
    5. parent: unlock();
    6. parent: fork();
    7. child: lock();
    8. child: do_a_lot_of_work();
    9. child: unlock();

My RFC patch currently changes that to:

    1. parent: do_we_need_gc();
    2. parent: lock_or_silently_exit();
    3. parent: say_way_will_gc();
    4. parent: do_a_bit_of_work();
    5. parent: unlock();
    6. parent: fork();
    7. child: lock();
    8. child: do_a_lot_of_work();
    9. child: unlock();

I.e. we won't duplicate the message, but *do* introduce the caveat that
it's in principle possible nobody gc's, but in practice when we fail to
get the lock in lock_or_silently_exit() it's because we lost the race to
a sister process that's going to do the actual GC, so all is well.

But we are left with the brief race when we fork. My RFC patch proposed
some elaborate PID hand-over dance to deal with this. But having looked
at it again I think we can easily get rid of that race too with just:

    1. parent: do_we_need_gc();
    2. parent: lock_or_silently_exit();
    3. parent: say_way_will_gc();
    4. parent: do_a_bit_of_work();
    5. parent: fork();
    6. parent: <writes child pid to gc.lock>
    7. parent: <exits without unlocking gc.lock>
    8. child: do_a_lot_of_work();
    9. child: unlock();

Which can fail in cases where the child segfaults, or manages to exit
earlier than the parent etc, hence my earlier proposed elaborate pid
hand-over dance.

But looking at it again we only usurp an existing gc.lock if the mtime
is >12hrs, so it's OK if we have very rare cases where the PID info got
corrupted, we can still back ourselves out of it, which is what I was
paranoid about.

Furthermore the gc.lock contains the <hostname><pid> of the working
process, but we can in a backwards-compatible way add new entries to
that file, i.e. list both the child & parent pid. Older clients will
just read whichever one comes first, but if we make new versions check
both we can be more paranoid going forward.

> Having said that, I don't mind if my patch gets dropped. It was just a
> "hey that multiple gc output looks strange, hah the fix is quite
> simple" moment for me.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux