Not recovering completely on OSD failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi guys

This is probably a configuration error, but I just can't find it.
The following reproduceable happens on my cluster [1].

15:52:15 On Host1 one disk is being removed on the RAID Controller (to ceph it looks as if the disk died)
15:52:52 OSD Reported missing (osd.47)
15:52:53 osdmap eXXX: 60 osds: 59 up, 60 in; 1,781% degraded, 436 PGs stuck unclean, 436 PGs degraded; not recovering yet
15:57:54 osdmap eXXX: 60 osds: 59 up, 59 in; start recovering
15:58:00 2,502% degraded
15:58:01 3,413% degraded; recovering at about 1GB/s --> recovering speed decreasing to about 40MB/s 17:02:10 10 PGs active+remapped, 218 PGs active+degraded, 0.898% degraded, stopped recovering
18:12 Still not recovering
few days later: OSD removed [2], now recovering completely

I would like my cluster to recover completely without me interfering. Can anyone give an educated guess what went wrong here? I can't find the reason why the cluster would just stop recovering.

Thank you for any hints!
Niklas


[1] 4 OSD Hosts with 15 disks each. On each of the 60 identical disks there is one OSD. I have one large pool with 6000 PGs and a replica size of 4, and 3 (default) pools with 64 PGs each [2] http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux