On Thu, Oct 27, 2016 at 1:26 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: > It is not more than a three line script. You will also need leveldb's > code in your working directory: > > ``` > #!/usr/bin/python2 > > import leveldb > leveldb.RepairDB('./omap') > ``` > > I totally agree that we need more repair tools to be officially > available and also tools that provide better insight to components > that are a "black box" for the operator right now ie the journal > > On 24 October 2016 at 19:36, Dan Jakubiec <dan.jakubiec@xxxxxxxxx> wrote: >> Thanks Kostis, great read. >> >> We also had a Ceph disaster back in August and a lot of this experience looked familiar. Sadly, in the end we were not able to recover our cluster but glad to hear that you were successful. >> >> LevelDB corruptions were one of our big problems. Your note below about running RepairDB from Python is interesting. At the time we were looking for a Ceph tool to run LevelDB repairs in order to get our OSDs back up and couldn't find one. I felt like this is something that should be in the standard toolkit. >> >> Would be great to see this added some day, but in the meantime I will remember this option exists. If you still have the Python script, perhaps you could post it as an example? i just logged this feature on http://tracker.ceph.com/issues/17730, so we don't forgot it! >> >> Thanks! >> >> -- Dan >> >> >>> On Oct 20, 2016, at 01:42, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>> >>> We pulled leveldb from upstream and fired leveldb.RepairDB against the >>> OSD omap directory using a simple python script. Ultimately, that >>> didn't make things forward. We resorted to check every object's >>> timestamp/md5sum/attributes on the crashed OSD against the replicas in >>> the cluster and at last took the way of discarding the journal, when >>> we concluded with as much confidence as possible that we would not >>> lose data. >>> >>> It would be really useful at that moment if we had a tool to inspect >>> the journal's contents of the crashed OSD and limit the scope of the >>> verification process. >>> >>> On 20 October 2016 at 08:15, Goncalo Borges >>> <goncalo.borges@xxxxxxxxxxxxx> wrote: >>>> Hi Kostis... >>>> That is a tale from the dark side. Glad you recover it and that you were willing to doc it all up, and share it. Thank you for that, >>>> Can I also ask which tool did you use to recover the leveldb? >>>> Cheers >>>> Goncalo >>>> ________________________________________ >>>> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx] >>>> Sent: 20 October 2016 09:09 >>>> To: ceph-users >>>> Subject: Surviving a ceph cluster outage: the hard way >>>> >>>> Hello cephers, >>>> this is the blog post on our Ceph cluster's outage we experienced some >>>> weeks ago and about how we managed to revive the cluster and our >>>> clients's data. >>>> >>>> I hope it will prove useful for anyone who will find himself/herself >>>> in a similar position. Thanks for everyone on the ceph-users and >>>> ceph-devel lists who contributed to our inquiries during >>>> troubleshooting. >>>> >>>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/ >>>> >>>> Regards, >>>> Kostis >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards Kefu Chai _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com