FTR, the root cause is now understood: https://tracker.ceph.com/issues/39525#note-21 -- dan On Thu, Feb 20, 2020 at 9:24 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander <wido@xxxxxxxx> wrote: > > > > > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster <dan@xxxxxxxxxxxxxx> het volgende geschreven: > > > > > > For those following along, the issue is here: > > > https://tracker.ceph.com/issues/39525#note-6 > > > > > > Somehow single bits are getting flipped in the osdmaps -- maybe > > > network, maybe memory, maybe a bug. > > > > > > > Weird! > > > > But I did see things like this happen before. This was under Hammer and Jewel where I needed to these kind of things. Crashes looked very similar. > > > > > To get an osd starting, we have to extract the full osdmap from the > > > mon, and set it into the crashing osd. So for the osd.666: > > > > > > # ceph osd getmap 2982809 -o 2982809 > > > # ceph-objectstore-tool --op set-osdmap --data-path > > > /var/lib/ceph/osd/ceph-666/ --file 2982809 > > > > > > Some osds had multiple corrupted osdmaps -- so we scriptified the above. > > > > Were those corrupted onces in sequence? > > Yes, usually 1 to 3 osdmaps corrupted in sequence. > > There's a theory that this might be related > (https://tracker.ceph.com/issues/43903) > but the backports to mimic or even nautilus look challenging. > > -- dan > > > > > > As of now our PGs are all active, but we're not confident that this > > > > > > Awesome! > > > > Wido > > > > > won't happen again (without knowing why the maps were corrupting). > > > > > > Thanks to all who helped! > > > > > > dan > > > > > > > > > > > >> On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > >> > > >> 680 is epoch 2983572 > > >> 666 crashes at 2982809 or 2982808 > > >> > > >> -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl > > >> 2982809 612069 bytes > > >> -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** > > >> in thread 7f4d931b5b80 thread_name:ceph-osd > > >> > > >> So I grabbed 2982809 and 2982808 and am setting them. > > >> > > >> Checking if the osds will start with that. > > >> > > >> -- dan > > >> > > >> > > >> > > >>> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander <wido@xxxxxxxx> wrote: > > >>> On 2/20/20 12:40 PM, Dan van der Ster wrote: > > >>>> Hi, > > >>>> > > >>>> My turn. > > >>>> We suddenly have a big outage which is similar/identical to > > >>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > >>>> > > >>>> Some of the osds are runnable, but most crash when they start -- crc > > >>>> error in osdmap::decode. > > >>>> I'm able to extract an osd map from a good osd and it decodes well > > >>>> with osdmaptool: > > >>>> > > >>>> # ceph-objectstore-tool --op get-osdmap --data-path > > >>>> /var/lib/ceph/osd/ceph-680/ --file osd.680.map > > >>>> > > >>>> But when I try on one of the bad osds I get: > > >>>> > > >>>> # ceph-objectstore-tool --op get-osdmap --data-path > > >>>> /var/lib/ceph/osd/ceph-666/ --file osd.666.map > > >>>> terminate called after throwing an instance of 'ceph::buffer::malformed_input' > > >>>> what(): buffer::malformed_input: bad crc, actual 822724616 != > > >>>> expected 2334082500 > > >>>> *** Caught signal (Aborted) ** > > >>>> in thread 7f600aa42d00 thread_name:ceph-objectstor > > >>>> ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable) > > >>>> 1: (()+0xf5f0) [0x7f5ffefc45f0] > > >>>> 2: (gsignal()+0x37) [0x7f5ffdbae337] > > >>>> 3: (abort()+0x148) [0x7f5ffdbafa28] > > >>>> 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] > > >>>> 5: (()+0x5e746) [0x7f5ffe4bc746] > > >>>> 6: (()+0x5e773) [0x7f5ffe4bc773] > > >>>> 7: (()+0x5e993) [0x7f5ffe4bc993] > > >>>> 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e] > > >>>> 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] > > >>>> 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, > > >>>> ceph::buffer::list&)+0x1d0) [0x55d30a489190] > > >>>> 11: (main()+0x5340) [0x55d30a3aae70] > > >>>> 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] > > >>>> 13: (()+0x3a0f40) [0x55d30a483f40] > > >>>> Aborted (core dumped) > > >>>> > > >>>> > > >>>> > > >>>> I think I want to inject the osdmap, but can't: > > >>>> > > >>>> # ceph-objectstore-tool --op set-osdmap --data-path > > >>>> /var/lib/ceph/osd/ceph-666/ --file osd.680.map > > >>>> osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. > > >>>> > > >>> > > >>> Have you tried to list which epoch osd.680 is at and which one osd.666 > > >>> is at? And which one the MONs are at? > > >>> > > >>> Maybe there is a difference there? > > >>> > > >>> Wido > > >>> > > >>>> > > >>>> How do I do this? > > >>>> > > >>>> Thanks for any help! > > >>>> > > >>>> dan > > >>>> _______________________________________________ > > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > >>>> > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx