> [SNIP - bad drives]
Generally when a disk is displaying bad blocks to the OS, the drive have
been remapping blocks for ages in the background. and the disk is really
on it's last legs. a bit unlikely that you get so many disks dying at
the same time tho. but the problem can have been silently worsening and
was not realy noticed until the osd had to restart due to the powerloss.
if this is _very_ important data i would recomend you start by taking
the bad drives out of operation, and cloning the bad drive block by
block onto a good one. by using dd_rescue. also a good idea to store a
image of the disk so you can try the different rescue methods several
times. in the very worst case send the disk to a professional data
recovery company.
once that is done, you have 2 options:
try to make the osd run again, by. xfs_fsck, + manually finding corrupt
objects. (find + md5sum (look for read errors)) and deleting them have
helped me in the past. if you manage to get the osd to run, drain it, by
setting crush weight to 0. and eventualy remove the disk from the cluster.
alternativly if you can not get the osd running again:
use ceph objectstoretool to extract objects and inject them using a
clean node and osd like described in
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ read the man page
and help for the tool i think the arguments have changed slightly since
that blogpost.
you may also run into read errors on corrupt objects, stopping your
export. in that case rm the offending object and rerun the export.
repeat for all bad drives.
when doing the inject it is important that your cluster is operational
and able to accept objects from the draining drive, so either set
minimal replication type to OSD, or even better. add more osd nodes to
make a operational cluster (with missing objects)
also i see in your log you have os-prober testing all partitions. i tend
to remove os-prober on machines that does not dualboot with another os.
rules of thumb for future ceph clusters:
min_size =2 for a reason it should never be 1 unless dataloss is wanted.
size=3 f you need the cluster to be operating with a drive or node in a
error state. size=2 gives you more space but the cluster will block on
errors until the recovery is done. better to be blocking then loosing data.
if you have size=3 and 3 nodes and you loose a node, then your cluster
can not self heal. you should have more nodes then you have set size to.
have free space on drives, this is where data is replicated to in case
of a down node. if you have 4 nodes and you want to be able to loose
one, and still operate. you need leftover room on your 3 remaining nodes
to cover for the lost one. the more nodes you have the less the impact
of a node failure is. and the less spare room is needed for a 4 node
cluster you should not fill more then 66% if you want to be able to
self-heal + operate.
good luck
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com