Re: PGs stuck in active+clean+replay

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxx> · Tue, 27 Oct 2015 20:46:48 +0000

Yes, this definitely sounds plausible (the peering/activating process does
take a long time).  At the moment I’m trying to get our cluster back to a
more working state.  Once everything works, I could try building a patched
set of ceph processes from source (currently I’m using the pre-built
centos RPMs) before a planned larger rebalance.

Andras

On 10/27/15, 2:36 PM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote:

>On Tue, Oct 27, 2015 at 11:22 AM, Andras Pataki
><apataki@xxxxxxxxxxxxxxxxxxxx> wrote:
>> Hi Greg,
>>
>> No, unfortunately I haven¹t found any resolution to it.  We are using
>> cephfs, the whole installation is on 0.94.4.  What I did notice is that
>> performance is extremely poor when backfilling is happening.  I wonder
>>if
>> timeouts of some kind could cause PG¹s to get stuck in replay.  I
>>lowered
>> the Œosd max backfills¹ parameter today from the default 10 all the way
>> down to 1 to see if it improves things.  Client read/write performance
>>has
>> definitely improved since then, whether this improves the
>> Œstuck-in-replay¹ situation, I¹m still waiting to see.
>
>Argh. Looks like known bug http://tracker.ceph.com/issues/13116. I've
>pushed a new branch hammer-pg-replay to the gitbuilders which
>backports that patch and ought to improve things if you're able to
>install that to test. (It's untested but I don't foresee any issues
>arising.) I've also added it to the backport queue.
>-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com