Thanks Troy for the quick response. Are you still running mimic on that cluster? Seeing the crashes in nautilus too? Our cluster is also quite old -- so it could very well be memory or network gremlins. Cheers, Dan On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan <tablan@xxxxxxxxx> wrote: > > Dan, > > Yes, I have had this happen several times since, but fortunately the > last couple of times has only happened to one or two OSDs at a time so > it didn't take down entire pools. Remedy has been the same. > > I had been holding off on too much further investigation because I > thought the source of the issue may have been some old hardware > gremlins, and we're waiting on some new hardware. > > -Troy > > > On 2/20/20 1:40 PM, Dan van der Ster wrote: > > Hi Troy, > > > > Looks like we hit the same today -- Sage posted some observations > > here: https://tracker.ceph.com/issues/39525#note-6 > > > > Did it happen again in your cluster? > > > > Cheers, Dan > > > > > > > > On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan <tablan@xxxxxxxxx> wrote: > >> > >> While I'm still unsure how this happened, this is what was done to solve > >> this. > >> > >> Started OSD in foreground with debug 10, watched for the most recent > >> osdmap epoch mentioned before abort(). For example, if it mentioned > >> that it just tried to load 80896 and then crashed > >> > >> # ceph osd getmap -o osdmap.80896 80896 > >> # ceph-objectstore-tool --op set-osdmap --data-path > >> /var/lib/ceph/osd/ceph-77/ --file osdmap.80896 > >> > >> Then I restarted the osd in foreground/debug, and repeated for the next > >> osdmap epoch until it got past the first few seconds. This process > >> worked for all but two OSDs. For the ones that succeeded I'd ^C and > >> then start the osd via systemd > >> > >> For the remaining two, it would try loading the incremental map and then > >> crash. I had presence of mind to make dd images of every OSD before > >> starting this process, so I reverted these two to the state before > >> injecting the osdmaps. > >> > >> I then injected the last 15 or so epochs of the osdmap in sequential > >> order before starting the osd, with success. > >> > >> This leads me to believe that the step-wise injection didn't work > >> because the osd had more subtle corruption that it got past, but it was > >> confused when it requested the next incremental delta. > >> > >> Thanks again to Brad/badone for the guidance! > >> > >> Tracker issue updated. > >> > >> Here's the closing IRC dialogue re this issue (UTC-0700) > >> > >> 2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out > >> yesterday, you've helped a ton, twice now :) I'm still concerned > >> because I don't know how this happened. I'll feel better once > >> everything's active+clean, but it's all at least active. > >> > >> 2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with > >> Josh earlier and he shares my opinion this is likely somehow related to > >> these drives or perhaps controllers, or at least specific to these machines > >> > >> 2019-08-19 16:31:04 < badone> however, there is a possibility you are > >> seeing some extremely rare race that no one up to this point has seen before > >> > >> 2019-08-19 16:31:20 < badone> that is less likely though > >> > >> 2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire > >> successfully but wrote it out to disk in a format that it could not then > >> read back in (unlikely) or... > >> > >> 2019-08-19 16:33:21 < badone> the map "changed" after it had been > >> written to disk > >> > >> 2019-08-19 16:33:46 < badone> the second is considered most likely by us > >> but I recognise you may not share that opinion > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx