Re: Ceph fails to recover

David Turner <drakonstein@xxxxxxxxx> · Tue, 19 Sep 2017 20:15:28 +0000

Can you please provide the output of `ceph status`, `ceph osd tree`, and `ceph health detail`?  Thank you.

On Tue, Sep 19, 2017 at 2:59 PM Jonas Jaszkowic <jonasjaszkowic.work@xxxxxxxxx> wrote:
Hi all, 
I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD of size 320GB per host) and 16 clients which are reading
and writing to the cluster. I have one erasure coded pool (shec plugin) with k=8, m=4, c=3 and pg_num=256. Failure domain is host.
I am able to reach a HEALTH_OK state and everything is working as expected. The pool was populated with
114048 files of different sizes ranging from 1kB to 4GB. Total amount of data in the pool was around 3TB. The capacity of the
pool was around 10TB.

I want to evaluate how Ceph is rebalancing data in case of an OSD loss while clients are still reading. To do so, I am killing one OSD on purpose
via ceph osd out <osd-id> without adding a new one, i.e. I have 31 OSDs left. Ceph seems to notice this failure and starts to rebalance data
which I can observe with the ceph -w command.

However, Ceph failed to rebalance the data. The recovering process seemed to be stuck at a random point. I waited more than 12h but the
number of degraded objects did not reduce and some PGs were stuck. Why is this happening? Based on the number of OSDs and the k,m,c values 
there should be enough hosts and OSDs to be able to recover from a single OSD failure?

Thank you in advance!
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com