Re: Surviving a ceph cluster outage: the hard way

Dan Jakubiec <dan.jakubiec@xxxxxxxxx> · Mon, 24 Oct 2016 11:36:16 -0500

Thanks Kostis, great read.  

We also had a Ceph disaster back in August and a lot of this experience looked familiar.  Sadly, in the end we were not able to recover our cluster but glad to hear that you were successful.

LevelDB corruptions were one of our big problems.  Your note below about running RepairDB from Python is interesting.  At the time we were looking for a Ceph tool to run LevelDB repairs in order to get our OSDs back up and couldn't find one.  I felt like this is something that should be in the standard toolkit.

Would be great to see this added some day, but in the meantime I will remember this option exists.  If you still have the Python script, perhaps you could post it as an example?

Thanks!

-- Dan

> On Oct 20, 2016, at 01:42, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
> 
> We pulled leveldb from upstream and fired leveldb.RepairDB against the
> OSD omap directory using a simple python script. Ultimately, that
> didn't make things forward. We resorted to check every object's
> timestamp/md5sum/attributes on the crashed OSD against the replicas in
> the cluster and at last took the way of discarding the journal, when
> we concluded with as much confidence as possible that we would not
> lose data.
> 
> It would be really useful at that moment if we had a tool to inspect
> the journal's contents of the crashed OSD and limit the scope of the
> verification process.
> 
> On 20 October 2016 at 08:15, Goncalo Borges
> <goncalo.borges@xxxxxxxxxxxxx> wrote:
>> Hi Kostis...
>> That is a tale from the dark side. Glad you recover it and that you were willing to doc it all up, and share it. Thank you for that,
>> Can I also ask which tool did you use to recover the leveldb?
>> Cheers
>> Goncalo
>> ________________________________________
>> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx]
>> Sent: 20 October 2016 09:09
>> To: ceph-users
>> Subject:  Surviving a ceph cluster outage: the hard way
>> 
>> Hello cephers,
>> this is the blog post on our Ceph cluster's outage we experienced some
>> weeks ago and about how we managed to revive the cluster and our
>> clients's data.
>> 
>> I hope it will prove useful for anyone who will find himself/herself
>> in a similar position. Thanks for everyone on the ceph-users and
>> ceph-devel lists who contributed to our inquiries during
>> troubleshooting.
>> 
>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>> 
>> Regards,
>> Kostis
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com