Recovering from multiple OSD failures

Aaron Ten Clay <aarontc@xxxxxxxxxxx> · Thu, 4 Jun 2015 21:26:04 -0700

Hi Cephers,

I recently had a power problem and the entire cluster was brought down, came up, went down, and came up again. Afterword, 3 OSDs were mostly dead (HDD failures). Luckily (I think) the drives were alive enough that I could copy the data off and leave the journal alone.

Since my pool "data" size is 3... of course a couple of placement groups were only on those three drives.

Now I've added 4 new OSDs, and everything has recovered, except pg 0.f3. When I query the pg, I see the cluster is looking for OSD 14 or 23 because one of them maybe_went_rw. (5, 14, and 23 are now kaput and "ceph osd lost --yes-i-really-mean-it")

Ceph indicates OSD 29 is now the primary for pg 0.f3. I copied all the data to the appropriate directory, started OSD.29 again, and here is where my question comes in:

How do I convince the cluster that it's okay to bring 0.f3 'up' and backfill to the other OSDs from 29? (I could even manually backfill 15 and 22, but I suspect the cluster will still think there's a problem)

'ceph health detail' shows this about 0.f3:

pg 0.f3 is incomplete, acting [29,22,15] (reducing pool data min_size from 2 may help; search ceph.com/docs for 'incomplete')

Thanks in advance!
-Aaron

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com