Re: OSD crash with end_of_buffer + bad crc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Some more information about our issue (I work with Wissem).

As the OSD are crashing only on one node, we focus on it.
We found that it's the only node where we also see taht kind of error in the OSD logs :

2022-04-08T11:38:26.464+0200 7fadaf877700 0 bad crc in data 3052515915 != exp 3884508842 from v1:100.69.103.56:0/1469415910 2022-04-08T11:38:26.468+0200 7fadaf877700 0 bad crc in data 1163505564 != exp 3884508842 from v1:100.69.103.56:0/1469415910 2022-04-08T11:39:19.265+0200 7fadaf877700 0 bad crc in data 496783366 != exp 3355185897 from v1:100.69.103.52:0/108609770

Corrupted network packets could certainly mess the authentication challenge we see in the stack trace below.

These errors always come from client machines (gateway VMware VMs for Veeam Backup, using RBD to access Ceph).


The problem is that I can't see anywhere some sign of network card problem (kernel logs, interface stats), neither on the switch...

I don't know if a Fiber or SFP+ problem can be unseen on the switch/kernel driver and still corrupt packets ?

The server is brand new, everything identical on our 10 nodes.

Does someone know how I can make network tests that will show me when packets are corrupted ? iperf just shows me retries, but I also have some between non-impacted nodes.



Le 2022-03-30 17:49, Wissem MIMOUNA a écrit :
Dear all,

We noticed that the issue we encounter happen exclusivly on one host
amount global of 10 hosts (almost the 8 osds on this host crashes
periodically => ~3 times a week).

Is there any idea/suggestion ??

Thanks


ZjQcmQRYFpfptBannerEnd

Hi ,

I found more information in the OSD logs about this assertion , may be
it could help =>

ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
in thread 7f8002357700 thread_name:msgr-worker-2
*** Caught signal (Aborted) **
what():  buffer::end_of_buffer
terminate called after throwing an instance of
'ceph::buffer::v15_2_0::end_of_buffer'

Thank for your help

Objet :  OSD crush on a new ceph cluster

Dear All,

We recently installed a new ceph cluster with ceph-ansible .
Everything works fine exepct we noticed last few days that some OSDs
crashed .

Here below the log for more information.
Thanks for your help.

"crash_id": "2022-03-23T08:27:05.085966Z_xxxxxx",
    "timestamp": "2022-03-23T08:27:05.085966Z",
    "process_name": "ceph-osd",
    "entity_name": "osd.xx",
    "ceph_version": "15.2.16",
    "utsname_hostname": "xxxx",
    "utsname_sysname": "Linux",
    "utsname_release": "4.15.0-169-generic",
    "utsname_version": "#177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022",
    "utsname_machine": "x86_64",
    "os_name": "Ubuntu",
    "os_id": "ubuntu",
    "os_version_id": "18.04",
    "os_version": "18.04.6 LTS (Bionic Beaver)",
    "backtrace": [
        "(()+0x12980) [0x7f557c3f8980]",
        "(gsignal()+0xc7) [0x7f557b0aae87]",
        "(abort()+0x141) [0x7f557b0ac7f1]",
        "(()+0x8c957) [0x7f557ba9f957]",
        "(()+0x92ae6) [0x7f557baa5ae6]",
        "(()+0x92b21) [0x7f557baa5b21]",
        "(()+0x92d54) [0x7f557baa5d54]",
        "(()+0x964eda) [0x555f1a9e9eda]",
        "(()+0x11f3e87) [0x555f1b278e87]",
"(ceph::buffer::v15_2_0::list::iterator_impl<true>::copy_deep(unsigned
int, ceph::buffer::v15_2_0::ptr&)+0x77) [0x555f1b2799d7]",
"(CryptoKey::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x7a)
[0x555f1b07e52a]",
        "(void
decode_decrypt_enc_bl<CephXServiceTicketInfo>(ceph::common::CephContext*,
CephXServiceTicketInfo&, CryptoKey, ceph::buffer::v15_2_0::list
const&, std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >&)+0x7ed) [0x555f1b3e364d]",
        "(cephx_verify_authorizer(ceph::common::CephContext*, KeyStore
const&, ceph::buffer::v15_2_0::list::iterator_impl<true>&, unsigned
long, CephXServiceTicketInfo&,
std::unique_ptr<AuthAuthorizerChallenge,
std::default_delete<AuthAuthorizerChallenge> >*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >*, ceph::buffer::v15_2_0::list*)+0x519)
[0x555f1b3ddaa9]",
"(CephxAuthorizeHandler::verify_authorizer(ceph::common::CephContext*,
KeyStore const&, ceph::buffer::v15_2_0::list const&, unsigned long,
ceph::buffer::v15_2_0::list*, EntityName*, unsigned long*,
AuthCapsInfo*, CryptoKey*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >*,
std::unique_ptr<AuthAuthorizerChallenge,
std::default_delete<AuthAuthorizerChallenge> >*)+0x74b)
[0x555f1b3d1ccb]",
        "(MonClient::handle_auth_request(Connection*,
AuthConnectionMeta*, bool, unsigned int, ceph::buffer::v15_2_0::list
const&, ceph::buffer::v15_2_0::list*)+0x284) [0x555f1b2a02e4]",
"(ProtocolV1::handle_connect_message_2()+0x7d7) [0x555f1b426167]",
        "(ProtocolV1::handle_connect_message_auth(char*, int)+0x80)
[0x555f1b429430]",
        "(()+0x138869d) [0x555f1b40d69d]",
        "(AsyncConnection::process()+0x5fc) [0x555f1b40a4bc]",
        "(EventCenter::process_events(unsigned int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l>
>*)+0x7dd) [0x555f1b25a6dd]",
        "(()+0x11db258) [0x555f1b260258]",
        "(()+0xbd6df) [0x7f557bad06df]",
        "(()+0x76db) [0x7f557c3ed6db]",
        "(clone()+0x3f) [0x7f557b18d61f]"
    ]

Best Regards
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux