Re: OSD crash with end_of_buffer + bad crc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just a follow-up.

I've found that a specific  network interface is causing this.
We have bonds :
- 1 management bond0
- 1 storage access on bond1
- 1 storage replication on bond2

As the crc errors are all between clients on the storage access network, I focus on bond1. => I set one interface down, and immediately I have many errors and some OSDs crash ! => I set it up and the other down, no errors. A weeked after, still no errors and no OSD crash.

I now have to understand what is going on with that interface, because I have no errors anywhere. I will first try to change the AOC cable (SFP+ + Fibre).

But, it's not a Ceph problem. Just a hardware one, that only Ceph has caught !


Le 2022-04-08 11:53, Gilles Mocellin a écrit :
Some more information about our issue (I work with Wissem).

As the OSD are crashing only on one node, we focus on it.
We found that it's the only node where we also see taht kind of error
in the OSD logs :

2022-04-08T11:38:26.464+0200 7fadaf877700  0 bad crc in data
3052515915 != exp 3884508842 from v1:100.69.103.56:0/1469415910
2022-04-08T11:38:26.468+0200 7fadaf877700  0 bad crc in data
1163505564 != exp 3884508842 from v1:100.69.103.56:0/1469415910
2022-04-08T11:39:19.265+0200 7fadaf877700  0 bad crc in data 496783366
!= exp 3355185897 from v1:100.69.103.52:0/108609770

Corrupted network packets could certainly mess the authentication
challenge we see in the stack trace below.

These errors always come from client machines (gateway VMware VMs for
Veeam Backup, using RBD to access Ceph).


The problem is that I can't see anywhere some sign of network card
problem (kernel logs, interface stats), neither on the switch...

I don't know if a Fiber or SFP+ problem can be unseen on the
switch/kernel driver and still corrupt packets ?

The server is brand new, everything identical on our 10 nodes.

Does someone know how I can make network tests that will show me when
packets are corrupted ?
iperf just shows me retries, but I also have some between non-impacted nodes.



Le 2022-03-30 17:49, Wissem MIMOUNA a écrit :
Dear all,

We noticed that the issue we encounter happen exclusivly on one host
amount global of 10 hosts (almost the 8 osds on this host crashes
periodically => ~3 times a week).

Is there any idea/suggestion ??

Thanks


ZjQcmQRYFpfptBannerEnd

Hi ,

I found more information in the OSD logs about this assertion , may be
it could help =>

ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
in thread 7f8002357700 thread_name:msgr-worker-2
*** Caught signal (Aborted) **
what():  buffer::end_of_buffer
terminate called after throwing an instance of
'ceph::buffer::v15_2_0::end_of_buffer'

Thank for your help

Objet :  OSD crush on a new ceph cluster

Dear All,

We recently installed a new ceph cluster with ceph-ansible .
Everything works fine exepct we noticed last few days that some OSDs
crashed .

Here below the log for more information.
Thanks for your help.

"crash_id": "2022-03-23T08:27:05.085966Z_xxxxxx",
    "timestamp": "2022-03-23T08:27:05.085966Z",
    "process_name": "ceph-osd",
    "entity_name": "osd.xx",
    "ceph_version": "15.2.16",
    "utsname_hostname": "xxxx",
    "utsname_sysname": "Linux",
    "utsname_release": "4.15.0-169-generic",
    "utsname_version": "#177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022",
    "utsname_machine": "x86_64",
    "os_name": "Ubuntu",
    "os_id": "ubuntu",
    "os_version_id": "18.04",
    "os_version": "18.04.6 LTS (Bionic Beaver)",
    "backtrace": [
        "(()+0x12980) [0x7f557c3f8980]",
        "(gsignal()+0xc7) [0x7f557b0aae87]",
        "(abort()+0x141) [0x7f557b0ac7f1]",
        "(()+0x8c957) [0x7f557ba9f957]",
        "(()+0x92ae6) [0x7f557baa5ae6]",
        "(()+0x92b21) [0x7f557baa5b21]",
        "(()+0x92d54) [0x7f557baa5d54]",
        "(()+0x964eda) [0x555f1a9e9eda]",
        "(()+0x11f3e87) [0x555f1b278e87]",
"(ceph::buffer::v15_2_0::list::iterator_impl<true>::copy_deep(unsigned
int, ceph::buffer::v15_2_0::ptr&)+0x77) [0x555f1b2799d7]",
"(CryptoKey::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x7a)
[0x555f1b07e52a]",
        "(void
decode_decrypt_enc_bl<CephXServiceTicketInfo>(ceph::common::CephContext*,
CephXServiceTicketInfo&, CryptoKey, ceph::buffer::v15_2_0::list
const&, std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >&)+0x7ed) [0x555f1b3e364d]",
        "(cephx_verify_authorizer(ceph::common::CephContext*, KeyStore
const&, ceph::buffer::v15_2_0::list::iterator_impl<true>&, unsigned
long, CephXServiceTicketInfo&,
std::unique_ptr<AuthAuthorizerChallenge,
std::default_delete<AuthAuthorizerChallenge> >*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >*, ceph::buffer::v15_2_0::list*)+0x519)
[0x555f1b3ddaa9]",
"(CephxAuthorizeHandler::verify_authorizer(ceph::common::CephContext*,
KeyStore const&, ceph::buffer::v15_2_0::list const&, unsigned long,
ceph::buffer::v15_2_0::list*, EntityName*, unsigned long*,
AuthCapsInfo*, CryptoKey*, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >*,
std::unique_ptr<AuthAuthorizerChallenge,
std::default_delete<AuthAuthorizerChallenge> >*)+0x74b)
[0x555f1b3d1ccb]",
        "(MonClient::handle_auth_request(Connection*,
AuthConnectionMeta*, bool, unsigned int, ceph::buffer::v15_2_0::list
const&, ceph::buffer::v15_2_0::list*)+0x284) [0x555f1b2a02e4]",
"(ProtocolV1::handle_connect_message_2()+0x7d7) [0x555f1b426167]",
        "(ProtocolV1::handle_connect_message_auth(char*, int)+0x80)
[0x555f1b429430]",
        "(()+0x138869d) [0x555f1b40d69d]",
        "(AsyncConnection::process()+0x5fc) [0x555f1b40a4bc]",
        "(EventCenter::process_events(unsigned int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l>
>*)+0x7dd) [0x555f1b25a6dd]",
        "(()+0x11db258) [0x555f1b260258]",
        "(()+0xbd6df) [0x7f557bad06df]",
        "(()+0x76db) [0x7f557c3ed6db]",
        "(clone()+0x3f) [0x7f557b18d61f]"
    ]

Best Regards
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux