Re: xfs corruption, data disaster!

Christopher Kunz <chrislist@xxxxxxxxxxx> · Mon, 04 May 2015 15:11:46 +0200

Am 04.05.15 um 09:00 schrieb Yujian Peng:
> Hi,
> I'm encountering a data disaster. I have a ceph cluster with 145 osd. The
> data center had a power problem yesterday, and all of the ceph nodes were down.
> But now I find that 6 disks(xfs) in 4 nodes have data corruption. Some disks
> are unable to mount, and some disks have IO errors in syslog.
> 	mount: Structure needs cleaning
> 	xfs_log_forece: error 5 returned
> I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd
> reported a leveldb error:
> 	Error initializing leveldb: Corruption: checksum mismatch
> I cannot start the 6 osds and 22 pgs is down.
> This is really a tragedy for me. Can you give me some idea to recovery the
> xfs? Thanks very much!

We had a similar issue last year. We ended up building a new Ceph
cluster and manually importing all objects in a tedious, one-week process.

The folks at Inktank were invaluable, providing us with the tools to
recover every object from the broken cluster (we did not lose one single
object due to corruption!), but without a support contract, we would
have been lost.

I know this is a community list and usually, commercial offers would be
frowned upon, but this is the best advice I can give: If what you are
running is a production cluster, you should seek contact with an
Inktank/Redhat representative and negotiate if and how they can assist
you with recovery. I am not sure that there are a lot of other options
to get your data back.

On the upside, however, you can be fairly sure that, although your
cluster is totally lost now, most if not all objects will be able to
recover.

Sorry I couldn't be of more help, but that's how we experienced this issue.

Regards,

--ck
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com