Re: How can we repair OSD leveldb?

Sean Sullivan <lookcrabs@xxxxxxxxx> · Thu, 18 Aug 2016 14:08:49 -0500

We have a hammer cluster that experienced a similar power failure and ended up corrupting our monitors leveldb stores. I am still trying to repair ours but I can give you a few tips that seem to help.

1.)  I would copy the database off to somewhere safe right away. Just opening it seems to change it.  

2.) check out ceph-test tools (ceph-objectstore-tool, ceph-kvstore-tool, ceph-osdmap-tool, etc).  It lets you list the keys/data in your osd leveldb, possibly export them and get some barings on what you need to do to recover your map. 

           3.) I am making a few assumptions here. a.) You are using replication for your pools. b.) you are using either S3 or rbd, not cephFS. From here worse case chances are your data is recoverable sans the osd and monitor leveldb store so long as the rest of the data is okay. (The actual rados objects spread across each osd in '/var/lib/ceph/osd/ceph-*/current/blah_head)

If you use RBD there is a tool out there that lets you recover your RBD images:: https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool
We only use S3 but this seems to be doable as well:

As an example we have a 9MB file that was stored in ceph::
I ran a find across all of the osds in my cluster and compiled a list of files::

find /var/lib/ceph/osd/ceph-*/current/ -type f -iname \*this_is_my_File\.gzip\*

From here I resulted in a list that looks like the following::

This is the head. It's usually the bucket.id\file__head__

default.20283.1\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam__head_CA57D598__1
[__A________]\[_B________________________________________________________________________].[__C______________]

default.20283.1\u\umultipart\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1__head_C338075C__1
[__A________]\[_D_______]\[__B______________________________________________________________________].[__C__________________________________________________]

And for each of those you'll have matching shadow files::
default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u1__head_02F05634__1
[__A________]\[_E______]\[__B_______________________________________________________________________].[__C____________________________________________________]

Here is another part of the multipart (this file only had 1 multipart and we use multipart for all files larger than 5MB irrespective of size)::

default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u2__head_1EA07BDF__1
[__A________]\[_E______]\[__B_______________________________________________________________________].[__C____________________________________________________]
                                                                                                                                                                                                                                                                           ^^ notice the different part number here. 

A is the bucket.id and is the same for every object in the same bucket. Even if you don't know what the bucket id for your bucket is, you should be able to assume with good certainty after you review your list which is which

B is our object name. We generate uuids for each object so I can not be certain how much of this is ceph or us but the tail of your object name should exist and be the same across all of your parts. 

C.) Is their suffix for each object. From here you may have suffix' like the above

D.) Is your upload chunks

E.) Is your shadow chunks for each part of the multipart (i think)

I'm sure it's much more complicated than that but that's what worked for me.  From here I just scanned through all of my osds and slowly pulled all of the individual parts via ssh and concatinated them all to their respective files. So far the md5 sums match our md5 of the file prior to uploading them to ceph in the first place. 

We have a python tool to do this but it's kind of specific to us. I can ask the author and see if I can post a gist of the code if that helps. Please let me know. 

I can't speak for CephFS unfortunately as we do not use it but I wouldn't be surprised if it is similar. So if you set up ssh-keys across all of your osd nodes you should be able to export all of the data to another server/cluster/etc.

I am working on trying to rebuild leveldb for our monitors with the correct keys/values but I have a feeling this is going to be a long way off. I wouldn't be surprised if the leveldb structure for the mon databse is similar to the osd omap database. 

On Wed, Aug 17, 2016 at 4:54 PM, Dan Jakubiec <dan.jakubiec@xxxxxxxxx> wrote:
Hi Wido,

Thank you for the response:

> On Aug 17, 2016, at 16:25, Wido den Hollander <wido@xxxxxxxx> wrote:

>

>

>> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec <dan.jakubiec@xxxxxxxxx>:

>>

>>

>> Hello, we have a Ceph cluster with 8 OSD that recently lost power to all 8 machines.  We've managed to recover the XFS filesystems on 7 of the machines, but the OSD service is only starting on 1 of them.

>>

>> The other 5 machines all have complaints similar to the following:

>>

>>      2016-08-17 09:32:15.549588 7fa2f4666800 -1 filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb : Corruption: 6 missing files; e.g.: /var/lib/ceph/osd/ceph-1/current/omap/042421.ldb

>>

>> How can we repair the leveldb to allow the OSDs to startup?

>>

>

> My first question would be: How did this happen?

>

> What hardware are you using underneath? Is there a RAID controller which is not flushing properly? Since this should not happen during a power failure.

>

Each OSD drive is connected to an onboard hardware RAID controller and configured in RAID 0 mode as individual virtual disks.  The RAID controller is an LSI 3108.

I agree -- I am finding it bizarre that 7 of our 8 OSDs (one per machine) did not survive the power outage.

We did have some problems with the stock Ubunut xfs_repair (3.1.9) seg faulting, which eventually we overcame by building a newer version of xfs_repair (4.7.0).  But it did finally repair clean.

We actually have some different errors on other OSDs.  A few of them are failing with "Missing map in load_pgs" errors.  But generally speaking it appears to be missing files of various types causing different kinds of failures.

I'm really nervous now about the OSD's inability to start with any inconsistencies and no repair utilities (that I can find).  Any advice on how to recover?

> I don't know the answer to your question, but lost files are not good.

>

> You might find them in a lost+found directory if XFS repair worked?

>

Sadly this directory is empty.

-- Dan

> Wido

>

>> Thanks,

>>

>> -- Dan J_______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com