It is not more than a three line script. You will also need leveldb's code in your working directory: ``` #!/usr/bin/python2 import leveldb leveldb.RepairDB('./omap') ``` I totally agree that we need more repair tools to be officially available and also tools that provide better insight to components that are a "black box" for the operator right now ie the journal On 24 October 2016 at 19:36, Dan Jakubiec <dan.jakubiec@xxxxxxxxx> wrote: > Thanks Kostis, great read. > > We also had a Ceph disaster back in August and a lot of this experience looked familiar. Sadly, in the end we were not able to recover our cluster but glad to hear that you were successful. > > LevelDB corruptions were one of our big problems. Your note below about running RepairDB from Python is interesting. At the time we were looking for a Ceph tool to run LevelDB repairs in order to get our OSDs back up and couldn't find one. I felt like this is something that should be in the standard toolkit. > > Would be great to see this added some day, but in the meantime I will remember this option exists. If you still have the Python script, perhaps you could post it as an example? > > Thanks! > > -- Dan > > >> On Oct 20, 2016, at 01:42, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >> >> We pulled leveldb from upstream and fired leveldb.RepairDB against the >> OSD omap directory using a simple python script. Ultimately, that >> didn't make things forward. We resorted to check every object's >> timestamp/md5sum/attributes on the crashed OSD against the replicas in >> the cluster and at last took the way of discarding the journal, when >> we concluded with as much confidence as possible that we would not >> lose data. >> >> It would be really useful at that moment if we had a tool to inspect >> the journal's contents of the crashed OSD and limit the scope of the >> verification process. >> >> On 20 October 2016 at 08:15, Goncalo Borges >> <goncalo.borges@xxxxxxxxxxxxx> wrote: >>> Hi Kostis... >>> That is a tale from the dark side. Glad you recover it and that you were willing to doc it all up, and share it. Thank you for that, >>> Can I also ask which tool did you use to recover the leveldb? >>> Cheers >>> Goncalo >>> ________________________________________ >>> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx] >>> Sent: 20 October 2016 09:09 >>> To: ceph-users >>> Subject: Surviving a ceph cluster outage: the hard way >>> >>> Hello cephers, >>> this is the blog post on our Ceph cluster's outage we experienced some >>> weeks ago and about how we managed to revive the cluster and our >>> clients's data. >>> >>> I hope it will prove useful for anyone who will find himself/herself >>> in a similar position. Thanks for everyone on the ceph-users and >>> ceph-devel lists who contributed to our inquiries during >>> troubleshooting. >>> >>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/ >>> >>> Regards, >>> Kostis >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com