Re: Power outages!!! help!

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Mon, 28 Aug 2017 23:36:21 +0200

> [SNIP - bad drives]

Generally when a disk is displaying bad blocks to the OS, the drive have 
been remapping blocks for ages in the background. and the disk is really 
on it's last legs.  a bit unlikely that you get so many disks dying at 
the same time tho. but the problem can have been silently worsening and 
was not realy noticed until the osd had to restart due to the powerloss.

if this is _very_ important data i would recomend you start by taking 
the bad drives out of operation, and cloning the bad drive block by 
block onto a good one. by using dd_rescue. also a good idea to store a 
image of the disk so you can try the different rescue methods several 
times.  in the very worst case send the disk to a professional data 
recovery company.

once that is done, you have 2 options:
try to make the osd run again, by. xfs_fsck, + manually finding corrupt 
objects. (find + md5sum (look for read errors)) and deleting them have 
helped me in the past. if you manage to get the osd to run, drain it, by 
setting crush weight to 0. and eventualy remove the disk from the cluster.
alternativly if you can not get the osd running again:
use ceph objectstoretool to extract objects and inject them using a 
clean node and osd like described in 
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/   read the man page 
and help for the tool i think the arguments have changed slightly since 
that blogpost.

you may also run into read errors on corrupt objects, stopping your 
export.  in that case rm the offending object and rerun the export.
repeat for all bad drives.

when doing the inject it is important that your cluster is operational 
and able to accept objects from the draining drive, so either set 
minimal replication type to OSD, or even better. add more osd nodes to 
make a operational cluster (with missing objects)

also i see in your log you have os-prober testing all partitions. i tend 
to remove os-prober on machines that does not dualboot with another os.

rules of thumb for future ceph clusters:
min_size =2 for a reason it should never be 1 unless dataloss is wanted.
size=3 f you need the cluster to be operating with a drive or node in a 
error state. size=2 gives you more space but the cluster will block on 
errors until the recovery is done. better to be blocking then loosing data.
if you have size=3 and 3 nodes and you loose a node, then your cluster 
can not self heal. you should have more nodes then you have set size to.
have free space on drives, this is where data is replicated to in case 
of a down node. if you have 4 nodes and you want to be able to loose 
one, and still operate. you need leftover room on your 3 remaining nodes 
to cover for the lost one. the more nodes you have the less the impact 
of a node failure is.  and the less spare room is needed  for a 4 node 
cluster you should not fill more then 66% if you want to be able to 
self-heal + operate.

good luck
Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com