Re: osdmap::decode crc error -- 13.2.7 -- most osds down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



For those following along, the issue is here:
https://tracker.ceph.com/issues/39525#note-6

Somehow single bits are getting flipped in the osdmaps -- maybe
network, maybe memory, maybe a bug.

To get an osd starting, we have to extract the full osdmap from the
mon, and set it into the crashing osd. So for the osd.666:

# ceph osd getmap 2982809 -o 2982809
# ceph-objectstore-tool --op set-osdmap --data-path
/var/lib/ceph/osd/ceph-666/ --file 2982809

Some osds had multiple corrupted osdmaps -- so we scriptified the above.

As of now our PGs are all active, but we're not confident that this
won't happen again (without knowing why the maps were corrupting).

Thanks to all who helped!

dan



On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> 680 is epoch 2983572
> 666 crashes at 2982809 or 2982808
>
>   -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl
> 2982809 612069 bytes
>   -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) **
>  in thread 7f4d931b5b80 thread_name:ceph-osd
>
> So I grabbed 2982809 and 2982808 and am setting them.
>
> Checking if the osds will start with that.
>
> -- dan
>
>
>
> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander <wido@xxxxxxxx> wrote:
> > On 2/20/20 12:40 PM, Dan van der Ster wrote:
> > > Hi,
> > >
> > > My turn.
> > > We suddenly have a big outage which is similar/identical to
> > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
> > >
> > > Some of the osds are runnable, but most crash when they start -- crc
> > > error in osdmap::decode.
> > > I'm able to extract an osd map from a good osd and it decodes well
> > > with osdmaptool:
> > >
> > > # ceph-objectstore-tool --op get-osdmap --data-path
> > > /var/lib/ceph/osd/ceph-680/ --file osd.680.map
> > >
> > > But when I try on one of the bad osds I get:
> > >
> > > # ceph-objectstore-tool --op get-osdmap --data-path
> > > /var/lib/ceph/osd/ceph-666/ --file osd.666.map
> > > terminate called after throwing an instance of 'ceph::buffer::malformed_input'
> > >   what():  buffer::malformed_input: bad crc, actual 822724616 !=
> > > expected 2334082500
> > > *** Caught signal (Aborted) **
> > >  in thread 7f600aa42d00 thread_name:ceph-objectstor
> > >  ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
> > >  1: (()+0xf5f0) [0x7f5ffefc45f0]
> > >  2: (gsignal()+0x37) [0x7f5ffdbae337]
> > >  3: (abort()+0x148) [0x7f5ffdbafa28]
> > >  4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5]
> > >  5: (()+0x5e746) [0x7f5ffe4bc746]
> > >  6: (()+0x5e773) [0x7f5ffe4bc773]
> > >  7: (()+0x5e993) [0x7f5ffe4bc993]
> > >  8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e]
> > >  9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31]
> > >  10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
> > > ceph::buffer::list&)+0x1d0) [0x55d30a489190]
> > >  11: (main()+0x5340) [0x55d30a3aae70]
> > >  12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505]
> > >  13: (()+0x3a0f40) [0x55d30a483f40]
> > > Aborted (core dumped)
> > >
> > >
> > >
> > > I think I want to inject the osdmap, but can't:
> > >
> > > # ceph-objectstore-tool --op set-osdmap --data-path
> > > /var/lib/ceph/osd/ceph-666/ --file osd.680.map
> > > osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist.
> > >
> >
> > Have you tried to list which epoch osd.680 is at and which one osd.666
> > is at? And which one the MONs are at?
> >
> > Maybe there is a difference there?
> >
> > Wido
> >
> > >
> > > How do I do this?
> > >
> > > Thanks for any help!
> > >
> > > dan
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux