Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



While I'm still unsure how this happened, this is what was done to solve this.

Started OSD in foreground with debug 10, watched for the most recent osdmap epoch mentioned before abort(). For example, if it mentioned that it just tried to load 80896 and then crashed

# ceph osd getmap -o osdmap.80896 80896
# ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-77/ --file osdmap.80896

Then I restarted the osd in foreground/debug, and repeated for the next osdmap epoch until it got past the first few seconds. This process worked for all but two OSDs. For the ones that succeeded I'd ^C and then start the osd via systemd

For the remaining two, it would try loading the incremental map and then crash. I had presence of mind to make dd images of every OSD before starting this process, so I reverted these two to the state before injecting the osdmaps.

I then injected the last 15 or so epochs of the osdmap in sequential order before starting the osd, with success.

This leads me to believe that the step-wise injection didn't work because the osd had more subtle corruption that it got past, but it was confused when it requested the next incremental delta.

Thanks again to Brad/badone for the guidance!

Tracker issue updated.

Here's the closing IRC dialogue re this issue (UTC-0700)

2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out yesterday, you've helped a ton, twice now :) I'm still concerned because I don't know how this happened. I'll feel better once everything's active+clean, but it's all at least active.

2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with Josh earlier and he shares my opinion this is likely somehow related to these drives or perhaps controllers, or at least specific to these machines

2019-08-19 16:31:04 < badone> however, there is a possibility you are seeing some extremely rare race that no one up to this point has seen before

2019-08-19 16:31:20 < badone> that is less likely though

2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire successfully but wrote it out to disk in a format that it could not then read back in (unlikely) or...

2019-08-19 16:33:21 < badone> the map "changed" after it had been written to disk

2019-08-19 16:33:46 < badone> the second is considered most likely by us but I recognise you may not share that opinion
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux