lots of files deleted from OSD

Linux Chips <linux.chips@xxxxxxxxx> · Wed, 30 May 2018 12:42:54 +0300

Hi,

our production cluster have been sabotaged. someone ran

# rm -rf /var/lib/ceph/*

on all our cluster servers :(

we got the command killed, but it deleted lots of files from one OSD on 
each node. on some nodes it finished the first OSD and started on the 
second.
all in all, we had about 50 OSD with no data to partial data.

we replaced all the touched disks with new disks on the hope that we can 
recover some data out of them. and we let the cluster recover.

all disks use XFS file system. we have replicated and erasure pools.
almost all the affected disks are missing the "omap" dir. lots of pgs on 
those disks have correct file counts and overall size, but we are unable 
to export them due to the missing omap

we had 43 replicated pg with "stale" state, we recreated them, and we 
plan to extract the objects from the disks and do "rados put". not sure 
if that is going to be flawless or not. I only tried this on failed 
disks in the past with few missing objects. not an empty pg.

we also have 156 incomplete pgs, those are erasure pgs. and most of them 
have the missing shards intact in those extracted disks, but I have not 
been successful to find a proper way to put them back in the new OSDs.

using recovery tools, we were unable to recover the omap files. 
unfortunately, as I understand, xfs is not the friendliest fs in terms 
of undelete

any ideas are appreciated.

and I was wondering, can we recreate those omaps? at least for that pg 
we need to export? or put it in some form that the objectstore tool can 
do an import?

thanks
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html