Re: ceph-osd not starting after network related issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



For anyone reading this in the future from a google search: please don't set osd_find_best_info_ignore_history_les unless you know exactly what you are doing.
That's a really dangerous option and should be a last resort. It will almost definitely lead to some data loss or inconsistencies (lost writes).

However, it is unfortunately sometimes required to do that when running with min_size 1 (which you also should never do if you care about your data).


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 3, 2019 at 8:52 AM Ian Coetzee <ceph@xxxxxxxxxxxxxxxxx> wrote:
Hi All,

Some feedback on my end. I managed to recover the "lost data" from one of the other OSDs. Seems like my initial summary was a bit off, in that the PG's was replicated, CEPH just wanted to confirm that the objects were still relevant.

For future reference, I basically marked the OSD as lost

> ceph osd lost <id>

Then the PGs went into an incomplete state

After that I temporarily set an option on the OSDs to ignore the history (osd_find_best_info_ignore_history_les). Got the info from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/017270.html

After that CEPH was happy and started to rebalance the cluster, pheew, crisis averted.

This failure did however convince me to increase our cluster size from 2:1 to 3:2. Sacrificing usable space for reliability.

Now I need to give feedback on what happened, this is what I am still not sure about as SMART does not show any sector errors. I might as well start a badblocks and see if I detect anything in there.

As always, I am open to other suggestion as to where to look for other clues on what went wrong.

Kind regards

On Mon, 1 Jul 2019 at 09:31, Ian Coetzee <ceph@xxxxxxxxxxxxxxxxx> wrote:
Hi Guys,

This is a cross-post from the proxmox ML.

This morning I have a bit of a big boo-boo on our production system.

After a very sudden network outage somewhere during the night, one of my ceph-osd's is no longer starting up.

If I try and start it manually, I get a very spectacular failure, see link.

As near as I can tell, it seems to be asserting whether a file exsists, I have yet to determine which file that would be. Any pointers are welcome, as well as any other ideas to get the osd back. For some reason there is data on the osd that was not replicated to my other osd's, as such I can not just re-init this osd as some of the posts I could find suggests

Kind regards
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux