Another thing... in your thread that you said that only the *SSDs* in your cluster had crashed, but not the HDDs. Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently? Which OS/kernel do you run? We're CentOS 7 with quite some uptime. On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan <tablan@xxxxxxxxx> wrote: > > I hope I don't sound too happy to hear that you've run into this same > problem, but still I'm glad to see that it's not just a one-off problem > with us. :) > > We're still running Mimic. I haven't yet deployed Nautilus anywhere. > > Thanks > -Troy > > On 2/20/20 2:14 PM, Dan van der Ster wrote: > > Thanks Troy for the quick response. > > Are you still running mimic on that cluster? Seeing the crashes in nautilus too? > > > > Our cluster is also quite old -- so it could very well be memory or > > network gremlins. > > > > Cheers, Dan > > > > On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan <tablan@xxxxxxxxx> wrote: > >> > >> Dan, > >> > >> Yes, I have had this happen several times since, but fortunately the > >> last couple of times has only happened to one or two OSDs at a time so > >> it didn't take down entire pools. Remedy has been the same. > >> > >> I had been holding off on too much further investigation because I > >> thought the source of the issue may have been some old hardware > >> gremlins, and we're waiting on some new hardware. > >> > >> -Troy > >> > >> > >> On 2/20/20 1:40 PM, Dan van der Ster wrote: > >>> Hi Troy, > >>> > >>> Looks like we hit the same today -- Sage posted some observations > >>> here: https://tracker.ceph.com/issues/39525#note-6 > >>> > >>> Did it happen again in your cluster? > >>> > >>> Cheers, Dan > >>> > >>> > >>> > >>> On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan@xxxxxxxxx> wrote: > >>>> > >>>> While I'm still unsure how this happened, this is what was done to solve > >>>> this. > >>>> > >>>> Started OSD in foreground with debug 10, watched for the most recent > >>>> osdmap epoch mentioned before abort(). For example, if it mentioned > >>>> that it just tried to load 80896 and then crashed > >>>> > >>>> # ceph osd getmap -o osdmap.80896 80896 > >>>> # ceph-objectstore-tool --op set-osdmap --data-path > >>>> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > >>>> > >>>> Then I restarted the osd in foreground/debug, and repeated for the next > >>>> osdmap epoch until it got past the first few seconds. This process > >>>> worked for all but two OSDs. For the ones that succeeded I'd ^C and > >>>> then start the osd via systemd > >>>> > >>>> For the remaining two, it would try loading the incremental map and then > >>>> crash. I had presence of mind to make dd images of every OSD before > >>>> starting this process, so I reverted these two to the state before > >>>> injecting the osdmaps. > >>>> > >>>> I then injected the last 15 or so epochs of the osdmap in sequential > >>>> order before starting the osd, with success. > >>>> > >>>> This leads me to believe that the step-wise injection didn't work > >>>> because the osd had more subtle corruption that it got past, but it was > >>>> confused when it requested the next incremental delta. > >>>> > >>>> Thanks again to Brad/badone for the guidance! > >>>> > >>>> Tracker issue updated. > >>>> > >>>> Here's the closing IRC dialogue re this issue (UTC-0700) > >>>> > >>>> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > >>>> yesterday, you've helped a ton, twice now :) I'm still concerned > >>>> because I don't know how this happened. I'll feel better once > >>>> everything's active+clean, but it's all at least active. > >>>> > >>>> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > >>>> Josh earlier and he shares my opinion this is likely somehow related to > >>>> these drives or perhaps controllers, or at least specific to these machines > >>>> > >>>> 2019-08-19 16:31:04 < badone> however, there is a possibility you are > >>>> seeing some extremely rare race that no one up to this point has seen before > >>>> > >>>> 2019-08-19 16:31:20 < badone> that is less likely though > >>>> > >>>> 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > >>>> successfully but wrote it out to disk in a format that it could not then > >>>> read back in (unlikely) or... > >>>> > >>>> 2019-08-19 16:33:21 < badone> the map "changed" after it had been > >>>> written to disk > >>>> > >>>> 2019-08-19 16:33:46 < badone> the second is considered most likely by us > >>>> but I recognise you may not share that opinion > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx