Yes, this definitely sounds plausible (the peering/activating process does take a long time). At the moment I’m trying to get our cluster back to a more working state. Once everything works, I could try building a patched set of ceph processes from source (currently I’m using the pre-built centos RPMs) before a planned larger rebalance. Andras On 10/27/15, 2:36 PM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote: >On Tue, Oct 27, 2015 at 11:22 AM, Andras Pataki ><apataki@xxxxxxxxxxxxxxxxxxxx> wrote: >> Hi Greg, >> >> No, unfortunately I haven¹t found any resolution to it. We are using >> cephfs, the whole installation is on 0.94.4. What I did notice is that >> performance is extremely poor when backfilling is happening. I wonder >>if >> timeouts of some kind could cause PG¹s to get stuck in replay. I >>lowered >> the Œosd max backfills¹ parameter today from the default 10 all the way >> down to 1 to see if it improves things. Client read/write performance >>has >> definitely improved since then, whether this improves the >> Œstuck-in-replay¹ situation, I¹m still waiting to see. > >Argh. Looks like known bug http://tracker.ceph.com/issues/13116. I've >pushed a new branch hammer-pg-replay to the gitbuilders which >backports that patch and ought to improve things if you're able to >install that to test. (It's untested but I don't foresee any issues >arising.) I've also added it to the backport queue. >-Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com