Re: osdmap::decode crc error -- 13.2.7 -- most osds down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster <dan@xxxxxxxxxxxxxx> het volgende geschreven:
> 
> For those following along, the issue is here:
> https://tracker.ceph.com/issues/39525#note-6
> 
> Somehow single bits are getting flipped in the osdmaps -- maybe
> network, maybe memory, maybe a bug.
> 

Weird!

But I did see things like this happen before. This was under Hammer and Jewel where I needed to these kind of things. Crashes looked very similar.

> To get an osd starting, we have to extract the full osdmap from the
> mon, and set it into the crashing osd. So for the osd.666:
> 
> # ceph osd getmap 2982809 -o 2982809
> # ceph-objectstore-tool --op set-osdmap --data-path
> /var/lib/ceph/osd/ceph-666/ --file 2982809
> 
> Some osds had multiple corrupted osdmaps -- so we scriptified the above.

Were those corrupted onces in sequence?

> As of now our PGs are all active, but we're not confident that this


Awesome!

Wido

> won't happen again (without knowing why the maps were corrupting).
> 
> Thanks to all who helped!
> 
> dan
> 
> 
> 
>> On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>> 
>> 680 is epoch 2983572
>> 666 crashes at 2982809 or 2982808
>> 
>>  -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl
>> 2982809 612069 bytes
>>  -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) **
>> in thread 7f4d931b5b80 thread_name:ceph-osd
>> 
>> So I grabbed 2982809 and 2982808 and am setting them.
>> 
>> Checking if the osds will start with that.
>> 
>> -- dan
>> 
>> 
>> 
>>> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>>> On 2/20/20 12:40 PM, Dan van der Ster wrote:
>>>> Hi,
>>>> 
>>>> My turn.
>>>> We suddenly have a big outage which is similar/identical to
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
>>>> 
>>>> Some of the osds are runnable, but most crash when they start -- crc
>>>> error in osdmap::decode.
>>>> I'm able to extract an osd map from a good osd and it decodes well
>>>> with osdmaptool:
>>>> 
>>>> # ceph-objectstore-tool --op get-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-680/ --file osd.680.map
>>>> 
>>>> But when I try on one of the bad osds I get:
>>>> 
>>>> # ceph-objectstore-tool --op get-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-666/ --file osd.666.map
>>>> terminate called after throwing an instance of 'ceph::buffer::malformed_input'
>>>>  what():  buffer::malformed_input: bad crc, actual 822724616 !=
>>>> expected 2334082500
>>>> *** Caught signal (Aborted) **
>>>> in thread 7f600aa42d00 thread_name:ceph-objectstor
>>>> ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
>>>> 1: (()+0xf5f0) [0x7f5ffefc45f0]
>>>> 2: (gsignal()+0x37) [0x7f5ffdbae337]
>>>> 3: (abort()+0x148) [0x7f5ffdbafa28]
>>>> 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5]
>>>> 5: (()+0x5e746) [0x7f5ffe4bc746]
>>>> 6: (()+0x5e773) [0x7f5ffe4bc773]
>>>> 7: (()+0x5e993) [0x7f5ffe4bc993]
>>>> 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e]
>>>> 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31]
>>>> 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
>>>> ceph::buffer::list&)+0x1d0) [0x55d30a489190]
>>>> 11: (main()+0x5340) [0x55d30a3aae70]
>>>> 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505]
>>>> 13: (()+0x3a0f40) [0x55d30a483f40]
>>>> Aborted (core dumped)
>>>> 
>>>> 
>>>> 
>>>> I think I want to inject the osdmap, but can't:
>>>> 
>>>> # ceph-objectstore-tool --op set-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-666/ --file osd.680.map
>>>> osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist.
>>>> 
>>> 
>>> Have you tried to list which epoch osd.680 is at and which one osd.666
>>> is at? And which one the MONs are at?
>>> 
>>> Maybe there is a difference there?
>>> 
>>> Wido
>>> 
>>>> 
>>>> How do I do this?
>>>> 
>>>> Thanks for any help!
>>>> 
>>>> dan
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux