Hi All,
Some feedback on my end. I managed
to recover the "lost data" from one of the other OSDs. Seems like my
initial summary was a bit off, in that the PG's was replicated, CEPH
just wanted to confirm that the objects were still relevant.
For future reference, I basically marked the OSD as lost
> ceph osd lost <id>
Then the PGs went into an incomplete state
After that I temporarily set an option on the OSDs to ignore the history (osd_find_best_info_ignore_history_les). Got the info from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/017270.html
After that CEPH was happy and started to rebalance the cluster, pheew, crisis averted.
This failure did however convince me to increase our cluster size from 2:1 to 3:2. Sacrificing usable space for reliability.
Now
I need to give feedback on what happened, this is what I am still not
sure about as SMART does not show any sector errors. I might as well
start a badblocks and see if I detect anything in there.
As always, I am open to other suggestion as to where to look for other clues on what went wrong.
Kind regards
On Mon, 1 Jul 2019 at 09:31, Ian Coetzee <ceph@xxxxxxxxxxxxxxxxx> wrote:
Hi Guys,This is a cross-post from the proxmox ML.This morning I have a bit of a big boo-boo on our production system.After a very sudden network outage somewhere during the night, one of my ceph-osd's is no longer starting up.If I try and start it manually, I get a very spectacular failure, see link.As near as I can tell, it seems to be asserting whether a file exsists, I have yet to determine which file that would be. Any pointers are welcome, as well as any other ideas to get the osd back. For some reason there is data on the osd that was not replicated to my other osd's, as such I can not just re-init this osd as some of the posts I could find suggestsKind regards
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com