We pulled leveldb from upstream and fired leveldb.RepairDB against the OSD omap directory using a simple python script. Ultimately, that didn't make things forward. We resorted to check every object's timestamp/md5sum/attributes on the crashed OSD against the replicas in the cluster and at last took the way of discarding the journal, when we concluded with as much confidence as possible that we would not lose data. It would be really useful at that moment if we had a tool to inspect the journal's contents of the crashed OSD and limit the scope of the verification process. On 20 October 2016 at 08:15, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Hi Kostis... > That is a tale from the dark side. Glad you recover it and that you were willing to doc it all up, and share it. Thank you for that, > Can I also ask which tool did you use to recover the leveldb? > Cheers > Goncalo > ________________________________________ > From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx] > Sent: 20 October 2016 09:09 > To: ceph-users > Subject: Surviving a ceph cluster outage: the hard way > > Hello cephers, > this is the blog post on our Ceph cluster's outage we experienced some > weeks ago and about how we managed to revive the cluster and our > clients's data. > > I hope it will prove useful for anyone who will find himself/herself > in a similar position. Thanks for everyone on the ceph-users and > ceph-devel lists who contributed to our inquiries during > troubleshooting. > > https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/ > > Regards, > Kostis > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com