Re: ceph-osd not starting after network related issues

Paul Emmerich <paul.emmerich@xxxxxxxx> · Wed, 3 Jul 2019 15:47:32 +0200

For anyone reading this in the future from a google search: please don't set osd_find_best_info_ignore_history_les unless you know exactly what you are doing.
That's a really dangerous option and should be a last resort. It will almost definitely lead to some data loss or inconsistencies (lost writes).

However, it is unfortunately sometimes required to do that when running with min_size 1 (which you also should never do if you care about your data).

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Jul 3, 2019 at 8:52 AM Ian Coetzee <ceph@xxxxxxxxxxxxxxxxx> wrote:
Hi All,

Some feedback on my end. I managed
 to recover the "lost data" from one of the other OSDs. Seems like my 
initial summary was a bit off, in that the PG's was replicated, CEPH 
just wanted to confirm that the objects were still relevant.

For future reference, I basically marked the OSD as lost

> ceph osd lost <id>

Then the PGs went into an incomplete state

After that I temporarily set an option on the OSDs to ignore the history (osd_find_best_info_ignore_history_les). Got the info from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/017270.html

After that CEPH was happy and started to rebalance the cluster, pheew, crisis averted.

This failure did however convince me to increase our cluster size from 2:1 to 3:2. Sacrificing usable space for reliability.

Now
 I need to give feedback on what happened, this is what I am still not 
sure about as SMART does not show any sector errors. I might as well 
start a badblocks and see if I detect anything in there.

As always, I am open to other suggestion as to where to look for other clues on what went wrong.

Kind regards

On Mon, 1 Jul 2019 at 09:31, Ian Coetzee <ceph@xxxxxxxxxxxxxxxxx> wrote:
Hi Guys,

This is a cross-post from the proxmox ML.

This morning I have a bit of a big boo-boo on our production system.

After a very sudden network outage somewhere during the night, one of my ceph-osd's is no longer starting up.

If I try and start it manually, I get a very spectacular failure, see link.
https://www.jacklin.co.za/zerobin/?04e2dcd13ab8dfc8#zKCISUvAm4o/6mnLmyu+8fSS1VumC65XaETt/dD7rn0=

As
 near as I can tell, it seems to be asserting whether a file exsists, I 
have yet to determine which file that would be. Any pointers are 
welcome, as well as any other ideas to get the osd back. For some reason
 there is data on the osd that was not replicated to my other osd's, as 
such I can not just re-init this osd as some of the posts I could find 
suggests

Kind regards

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com