Not recovering completely on OSD failure

Niklas Goerke <ceph-users@xxxxxxxxxxxxxxx> · Fri, 08 Nov 2013 14:44:43 +0100

Hi guys

This is probably a configuration error, but I just can't find it.
The following reproduceable happens on my cluster [1].

15:52:15 On Host1 one disk is being removed on the RAID Controller (to 
ceph it looks as if the disk died)
15:52:52 OSD Reported missing (osd.47)
15:52:53 osdmap eXXX: 60 osds: 59 up, 60 in; 1,781% degraded, 436 PGs 
stuck unclean, 436 PGs degraded; not recovering yet
15:57:54 osdmap eXXX: 60 osds: 59 up, 59 in; start recovering
15:58:00 2,502% degraded
15:58:01 3,413% degraded; recovering at about 1GB/s --> recovering 
speed decreasing to about 40MB/s
17:02:10 10 PGs active+remapped, 218 PGs active+degraded, 0.898% 
degraded, stopped recovering
18:12 Still not recovering
few days later: OSD removed [2], now recovering completely

I would like my cluster to recover completely without me interfering. 
Can anyone give an educated guess what went wrong here? I can't find the 
reason why the cluster would just stop recovering.

Thank you for any hints!
Niklas

[1] 4 OSD Hosts with 15 disks each. On each of the 60 identical disks 
there is one OSD. I have one large pool with 6000 PGs and a replica size 
of 4, and 3 (default) pools with 64 PGs each
[2] 
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com