Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Troy Ablan <tablan@xxxxxxxxx> · Thu, 20 Feb 2020 14:28:57 -0700

I hope I don't sound too happy to hear that you've run into this same 
problem, but still I'm glad to see that it's not just a one-off problem 
with us. :)

We're still running Mimic.  I haven't yet deployed Nautilus anywhere.

Thanks
-Troy

On 2/20/20 2:14 PM, Dan van der Ster wrote:
Thanks Troy for the quick response.
Are you still running mimic on that cluster? Seeing the crashes in nautilus too?

Our cluster is also quite old -- so it could very well be memory or
network gremlins.

Cheers, Dan

On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan <tablan@xxxxxxxxx> wrote:

Dan,

Yes, I have had this happen several times since, but fortunately the
last couple of times has only happened to one or two OSDs at a time so
it didn't take down entire pools.  Remedy has been the same.

I had been holding off on too much further investigation because I
thought the source of the issue may have been some old hardware
gremlins, and we're waiting on some new hardware.

-Troy

On 2/20/20 1:40 PM, Dan van der Ster wrote:
Hi Troy,

Looks like we hit the same today -- Sage posted some observations
here: https://tracker.ceph.com/issues/39525#note-6

Did it happen again in your cluster?

Cheers, Dan

On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan@xxxxxxxxx> wrote:

While I'm still unsure how this happened, this is what was done to solve
this.

Started OSD in foreground with debug 10, watched for the most recent
osdmap epoch mentioned before abort().  For example, if it mentioned
that it just tried to load 80896 and then crashed

# ceph osd getmap -o osdmap.80896 80896
# ceph-objectstore-tool --op set-osdmap --data-path
/var/lib/ceph/osd/ceph-77/ --file osdmap.80896

Then I restarted the osd in foreground/debug, and repeated for the next
osdmap epoch until it got past the first few seconds.  This process
worked for all but two OSDs.  For the ones that succeeded I'd ^C and
then start the osd via systemd

For the remaining two, it would try loading the incremental map and then
crash.  I had presence of mind to make dd images of every OSD before
starting this process, so I reverted these two to the state before
injecting the osdmaps.

I then injected the last 15 or so epochs of the osdmap in sequential
order before starting the osd, with success.

This leads me to believe that the step-wise injection didn't work
because the osd had more subtle corruption that it got past, but it was
confused when it requested the next incremental delta.

Thanks again to Brad/badone for the guidance!

Tracker issue updated.

Here's the closing IRC dialogue re this issue (UTC-0700)

2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out
yesterday, you've helped a ton, twice now :)  I'm still concerned
because I don't know how this happened.  I'll feel better once
everything's active+clean, but it's all at least active.

2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with
Josh earlier and he shares my opinion this is likely somehow related to
these drives or perhaps controllers, or at least specific to these machines

2019-08-19 16:31:04 < badone> however, there is a possibility you are
seeing some extremely rare race that no one up to this point has seen before

2019-08-19 16:31:20 < badone> that is less likely though

2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire
successfully but wrote it out to disk in a format that it could not then
read back in (unlikely) or...

2019-08-19 16:33:21 < badone> the map "changed" after it had been
written to disk

2019-08-19 16:33:46 < badone> the second is considered most likely by us
but I recognise you may not share that opinion
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx