Hi Greg, I’ve tested the patch below on top of the 0.94.5 hammer sources, and it works beautifully. No more active+clean+replay stuck PGs. Thanks! Andras On 10/27/15, 4:46 PM, "Andras Pataki" <apataki@xxxxxxxxxxxxxxxxxxxx> wrote: >Yes, this definitely sounds plausible (the peering/activating process does >take a long time). At the moment I’m trying to get our cluster back to a >more working state. Once everything works, I could try building a patched >set of ceph processes from source (currently I’m using the pre-built >centos RPMs) before a planned larger rebalance. > >Andras > > >On 10/27/15, 2:36 PM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote: > >>On Tue, Oct 27, 2015 at 11:22 AM, Andras Pataki >><apataki@xxxxxxxxxxxxxxxxxxxx> wrote: >>> Hi Greg, >>> >>> No, unfortunately I haven¹t found any resolution to it. We are using >>> cephfs, the whole installation is on 0.94.4. What I did notice is that >>> performance is extremely poor when backfilling is happening. I wonder >>>if >>> timeouts of some kind could cause PG¹s to get stuck in replay. I >>>lowered >>> the Œosd max backfills¹ parameter today from the default 10 all the way >>> down to 1 to see if it improves things. Client read/write performance >>>has >>> definitely improved since then, whether this improves the >>> Œstuck-in-replay¹ situation, I¹m still waiting to see. >> >>Argh. Looks like known bug http://tracker.ceph.com/issues/13116. I've >>pushed a new branch hammer-pg-replay to the gitbuilders which >>backports that patch and ought to improve things if you're able to >>install that to test. (It's untested but I don't foresee any issues >>arising.) I've also added it to the backport queue. >>-Greg > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com