Re: PGs stuck in active+clean+replay

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxx> · Mon, 9 Nov 2015 21:20:12 +0000

Hi Greg,

I’ve tested the patch below on top of the 0.94.5 hammer sources, and it
works beautifully.  No more active+clean+replay stuck PGs.

Thanks!

Andras

On 10/27/15, 4:46 PM, "Andras Pataki" <apataki@xxxxxxxxxxxxxxxxxxxx> wrote:

>Yes, this definitely sounds plausible (the peering/activating process does
>take a long time).  At the moment I’m trying to get our cluster back to a
>more working state.  Once everything works, I could try building a patched
>set of ceph processes from source (currently I’m using the pre-built
>centos RPMs) before a planned larger rebalance.
>
>Andras
>
>
>On 10/27/15, 2:36 PM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote:
>
>>On Tue, Oct 27, 2015 at 11:22 AM, Andras Pataki
>><apataki@xxxxxxxxxxxxxxxxxxxx> wrote:
>>> Hi Greg,
>>>
>>> No, unfortunately I haven¹t found any resolution to it.  We are using
>>> cephfs, the whole installation is on 0.94.4.  What I did notice is that
>>> performance is extremely poor when backfilling is happening.  I wonder
>>>if
>>> timeouts of some kind could cause PG¹s to get stuck in replay.  I
>>>lowered
>>> the Œosd max backfills¹ parameter today from the default 10 all the way
>>> down to 1 to see if it improves things.  Client read/write performance
>>>has
>>> definitely improved since then, whether this improves the
>>> Œstuck-in-replay¹ situation, I¹m still waiting to see.
>>
>>Argh. Looks like known bug http://tracker.ceph.com/issues/13116. I've
>>pushed a new branch hammer-pg-replay to the gitbuilders which
>>backports that patch and ought to improve things if you're able to
>>install that to test. (It's untested but I don't foresee any issues
>>arising.) I've also added it to the backport queue.
>>-Greg
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com