Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 21 Feb 2020 01:28:59 +0100

Another thing... in your thread that you said that only the *SSDs* in
your cluster had crashed, but not the HDDs.
Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently?
Which OS/kernel do you run? We're CentOS 7 with quite some uptime.

On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan <tablan@xxxxxxxxx> wrote:
>
> I hope I don't sound too happy to hear that you've run into this same
> problem, but still I'm glad to see that it's not just a one-off problem
> with us. :)
>
> We're still running Mimic.  I haven't yet deployed Nautilus anywhere.
>
> Thanks
> -Troy
>
> On 2/20/20 2:14 PM, Dan van der Ster wrote:
> > Thanks Troy for the quick response.
> > Are you still running mimic on that cluster? Seeing the crashes in nautilus too?
> >
> > Our cluster is also quite old -- so it could very well be memory or
> > network gremlins.
> >
> > Cheers, Dan
> >
> > On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan <tablan@xxxxxxxxx> wrote:
> >>
> >> Dan,
> >>
> >> Yes, I have had this happen several times since, but fortunately the
> >> last couple of times has only happened to one or two OSDs at a time so
> >> it didn't take down entire pools.  Remedy has been the same.
> >>
> >> I had been holding off on too much further investigation because I
> >> thought the source of the issue may have been some old hardware
> >> gremlins, and we're waiting on some new hardware.
> >>
> >> -Troy
> >>
> >>
> >> On 2/20/20 1:40 PM, Dan van der Ster wrote:
> >>> Hi Troy,
> >>>
> >>> Looks like we hit the same today -- Sage posted some observations
> >>> here: https://tracker.ceph.com/issues/39525#note-6
> >>>
> >>> Did it happen again in your cluster?
> >>>
> >>> Cheers, Dan
> >>>
> >>>
> >>>
> >>> On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan@xxxxxxxxx> wrote:
> >>>>
> >>>> While I'm still unsure how this happened, this is what was done to solve
> >>>> this.
> >>>>
> >>>> Started OSD in foreground with debug 10, watched for the most recent
> >>>> osdmap epoch mentioned before abort().  For example, if it mentioned
> >>>> that it just tried to load 80896 and then crashed
> >>>>
> >>>> # ceph osd getmap -o osdmap.80896 80896
> >>>> # ceph-objectstore-tool --op set-osdmap --data-path
> >>>> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896
> >>>>
> >>>> Then I restarted the osd in foreground/debug, and repeated for the next
> >>>> osdmap epoch until it got past the first few seconds.  This process
> >>>> worked for all but two OSDs.  For the ones that succeeded I'd ^C and
> >>>> then start the osd via systemd
> >>>>
> >>>> For the remaining two, it would try loading the incremental map and then
> >>>> crash.  I had presence of mind to make dd images of every OSD before
> >>>> starting this process, so I reverted these two to the state before
> >>>> injecting the osdmaps.
> >>>>
> >>>> I then injected the last 15 or so epochs of the osdmap in sequential
> >>>> order before starting the osd, with success.
> >>>>
> >>>> This leads me to believe that the step-wise injection didn't work
> >>>> because the osd had more subtle corruption that it got past, but it was
> >>>> confused when it requested the next incremental delta.
> >>>>
> >>>> Thanks again to Brad/badone for the guidance!
> >>>>
> >>>> Tracker issue updated.
> >>>>
> >>>> Here's the closing IRC dialogue re this issue (UTC-0700)
> >>>>
> >>>> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out
> >>>> yesterday, you've helped a ton, twice now :)  I'm still concerned
> >>>> because I don't know how this happened.  I'll feel better once
> >>>> everything's active+clean, but it's all at least active.
> >>>>
> >>>> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with
> >>>> Josh earlier and he shares my opinion this is likely somehow related to
> >>>> these drives or perhaps controllers, or at least specific to these machines
> >>>>
> >>>> 2019-08-19 16:31:04 < badone> however, there is a possibility you are
> >>>> seeing some extremely rare race that no one up to this point has seen before
> >>>>
> >>>> 2019-08-19 16:31:20 < badone> that is less likely though
> >>>>
> >>>> 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire
> >>>> successfully but wrote it out to disk in a format that it could not then
> >>>> read back in (unlikely) or...
> >>>>
> >>>> 2019-08-19 16:33:21 < badone> the map "changed" after it had been
> >>>> written to disk
> >>>>
> >>>> 2019-08-19 16:33:46 < badone> the second is considered most likely by us
> >>>> but I recognise you may not share that opinion
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx