For those following along, the issue is here: https://tracker.ceph.com/issues/39525#note-6 Somehow single bits are getting flipped in the osdmaps -- maybe network, maybe memory, maybe a bug. To get an osd starting, we have to extract the full osdmap from the mon, and set it into the crashing osd. So for the osd.666: # ceph osd getmap 2982809 -o 2982809 # ceph-objectstore-tool --op set-osdmap --data-path /var/lib/ceph/osd/ceph-666/ --file 2982809 Some osds had multiple corrupted osdmaps -- so we scriptified the above. As of now our PGs are all active, but we're not confident that this won't happen again (without knowing why the maps were corrupting). Thanks to all who helped! dan On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > 680 is epoch 2983572 > 666 crashes at 2982809 or 2982808 > > -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl > 2982809 612069 bytes > -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** > in thread 7f4d931b5b80 thread_name:ceph-osd > > So I grabbed 2982809 and 2982808 and am setting them. > > Checking if the osds will start with that. > > -- dan > > > > On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander <wido@xxxxxxxx> wrote: > > On 2/20/20 12:40 PM, Dan van der Ster wrote: > > > Hi, > > > > > > My turn. > > > We suddenly have a big outage which is similar/identical to > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > > > > > Some of the osds are runnable, but most crash when they start -- crc > > > error in osdmap::decode. > > > I'm able to extract an osd map from a good osd and it decodes well > > > with osdmaptool: > > > > > > # ceph-objectstore-tool --op get-osdmap --data-path > > > /var/lib/ceph/osd/ceph-680/ --file osd.680.map > > > > > > But when I try on one of the bad osds I get: > > > > > > # ceph-objectstore-tool --op get-osdmap --data-path > > > /var/lib/ceph/osd/ceph-666/ --file osd.666.map > > > terminate called after throwing an instance of 'ceph::buffer::malformed_input' > > > what(): buffer::malformed_input: bad crc, actual 822724616 != > > > expected 2334082500 > > > *** Caught signal (Aborted) ** > > > in thread 7f600aa42d00 thread_name:ceph-objectstor > > > ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable) > > > 1: (()+0xf5f0) [0x7f5ffefc45f0] > > > 2: (gsignal()+0x37) [0x7f5ffdbae337] > > > 3: (abort()+0x148) [0x7f5ffdbafa28] > > > 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5] > > > 5: (()+0x5e746) [0x7f5ffe4bc746] > > > 6: (()+0x5e773) [0x7f5ffe4bc773] > > > 7: (()+0x5e993) [0x7f5ffe4bc993] > > > 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e] > > > 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31] > > > 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, > > > ceph::buffer::list&)+0x1d0) [0x55d30a489190] > > > 11: (main()+0x5340) [0x55d30a3aae70] > > > 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505] > > > 13: (()+0x3a0f40) [0x55d30a483f40] > > > Aborted (core dumped) > > > > > > > > > > > > I think I want to inject the osdmap, but can't: > > > > > > # ceph-objectstore-tool --op set-osdmap --data-path > > > /var/lib/ceph/osd/ceph-666/ --file osd.680.map > > > osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist. > > > > > > > Have you tried to list which epoch osd.680 is at and which one osd.666 > > is at? And which one the MONs are at? > > > > Maybe there is a difference there? > > > > Wido > > > > > > > > How do I do this? > > > > > > Thanks for any help! > > > > > > dan > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx