Hi guys
This is probably a configuration error, but I just can't find it.
The following reproduceable happens on my cluster [1].
15:52:15 On Host1 one disk is being removed on the RAID Controller (to
ceph it looks as if the disk died)
15:52:52 OSD Reported missing (osd.47)
15:52:53 osdmap eXXX: 60 osds: 59 up, 60 in; 1,781% degraded, 436 PGs
stuck unclean, 436 PGs degraded; not recovering yet
15:57:54 osdmap eXXX: 60 osds: 59 up, 59 in; start recovering
15:58:00 2,502% degraded
15:58:01 3,413% degraded; recovering at about 1GB/s --> recovering
speed decreasing to about 40MB/s
17:02:10 10 PGs active+remapped, 218 PGs active+degraded, 0.898%
degraded, stopped recovering
18:12 Still not recovering
few days later: OSD removed [2], now recovering completely
I would like my cluster to recover completely without me interfering.
Can anyone give an educated guess what went wrong here? I can't find the
reason why the cluster would just stop recovering.
Thank you for any hints!
Niklas
[1] 4 OSD Hosts with 15 disks each. On each of the 60 identical disks
there is one OSD. I have one large pool with 6000 PGs and a replica size
of 4, and 3 (default) pools with 64 PGs each
[2]
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com