Re: Sporadic mgr segmentation fault

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Turns out those patches are not in 14.2.9, sorry.

On Sun, Apr 26, 2020 at 10:53 AM XuYun <yunxu@xxxxxx> wrote:
>
> Hi Brad,
>
> We got the same crash even after upgrading to 14.2.9, the crash log is:
>
>    -13> 2020-04-26 05:56:02.642 7f8cea975700  4 mgr send_beacon active
>    -12> 2020-04-26 05:56:02.643 7f8cea975700 10 monclient: _send_mon_message to mon.111.111.121.2 at v2:111.111.121.2:3300/0
>    -11> 2020-04-26 05:56:03.238 7f8cdd690700  0 log_channel(cluster) log [DBG] : pgmap v5548: 256 pgs: 1 active+clean+scrubbing, 255 active+clean; 162 GiB data, 921 GiB used, 32 TiB / 33 TiB avail; 194 KiB/s rd, 840 KiB/s wr, 375 op/s
>    -10> 2020-04-26 05:56:03.238 7f8cdd690700 10 monclient: _send_mon_message to mon.111.111.121.2 at v2:111.111.121.2:3300/0
>     -9> 2020-04-26 05:56:03.296 7f8cee17c700  4 mgr ms_dispatch active service_map(e4936 1 svc) v1
>     -8> 2020-04-26 05:56:03.296 7f8cee17c700  4 mgr ms_dispatch service_map(e4936 1 svc) v1
>     -7> 2020-04-26 05:56:03.391 7f8cde692700  4 mgr.server handle_open from mon,111.111.121.1 0x55d5d96c2000
>     -6> 2020-04-26 05:56:03.391 7f8cde692700  4 mgr.server handle_report from 0x55d5d96c2000 mon,111.111.121.1
>     -5> 2020-04-26 05:56:03.393 7f8cde692700  4 mgr.server handle_open from mgr,control 0x55d5d96c2400
>     -4> 2020-04-26 05:56:03.393 7f8cde692700  4 mgr.server handle_report from 0x55d5d96c2400 mgr,control
>     -3> 2020-04-26 05:56:03.397 7f8cde692700  4 mgr.server handle_open from mon,111.111.121.3 0x55d5d96c2800
>     -2> 2020-04-26 05:56:03.398 7f8cde692700  4 mgr.server handle_open from mgr,computer01 0x55d5d96c2c00
>     -1> 2020-04-26 05:56:03.399 7f8cde692700  4 mgr.server handle_report from 0x55d5d96c2c00 mgr,computer01
>      0> 2020-04-26 05:56:03.400 7f8cf3186700 -1 *** Caught signal (Segmentation fault) **
>  in thread 7f8cf3186700 thread_name:msgr-worker-1
>
>  ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
>  1: (()+0xf5f0) [0x7f8cf74d45f0]
>  2: (bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x1fc) [0x7f8cf9ef785c]
>  3: (ProtocolV2::write_message(Message*, bool)+0x4d9) [0x7f8cf9edb929]
>  4: (ProtocolV2::write_event()+0x37d) [0x7f8cf9ef025d]
>  5: (AsyncConnection::handle_write()+0x40) [0x7f8cf9eb27e0]
>  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1397) [0x7f8cf9f02ab7]
>  7: (()+0x57fa97) [0x7f8cf9f08a97]
>  8: (()+0x80f12f) [0x7f8cfa19812f]
>  9: (()+0x7e65) [0x7f8cf74cce65]
>  10: (clone()+0x6d) [0x7f8cf617a88d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> Is there an issue opened for it?
>
> BR,
> Xu Yun
>
> 2020年4月23日 上午10:28,XuYun <yunxu@xxxxxx> 写道:
>
> Thank you, Brad. We’ll try to upgrade 14.2.9 today.
>
> 2020年4月23日 上午7:21,Brad Hubbard <bhubbard@xxxxxxxxxx> 写道:
>
> On Tue, Apr 21, 2020 at 11:39 PM XuYun <yunxu@xxxxxx> wrote:
>
>
> Dear ceph users,
>
> We are experiencing sporadic mgr crash in all three ceph clusters (version 14.2.6 and version 14.2.8), the crash log is:
>
> 2020-04-17 23:10:08.986 7fed7fe07700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc: In function 'const char* ceph::buffer::v14_2_0::ptr::c_str() const' thread 7fed7fe07700 time 2020-04-17 23:10:08.984887
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc: 578: FAILED ceph_assert(_raw)
>
> ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7fed8605c325]
> 2: (()+0x2534ed) [0x7fed8605c4ed]
> 3: (()+0x5a21ed) [0x7fed863ab1ed]
> 4: (PosixConnectedSocketImpl::send(ceph::buffer::v14_2_0::list&, bool)+0xbd) [0x7fed863840ed]
> 5: (AsyncConnection::_try_send(bool)+0xb6) [0x7fed8632fc76]
> 6: (ProtocolV2::write_message(Message*, bool)+0x832) [0x7fed8635bf52]
> 7: (ProtocolV2::write_event()+0x175) [0x7fed863718c5]
> 8: (AsyncConnection::handle_write()+0x40) [0x7fed86332600]
> 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1397) [0x7fed8637f997]
> 10: (()+0x57c977) [0x7fed86385977]
> 11: (()+0x80bdaf) [0x7fed86614daf]
> 12: (()+0x7e65) [0x7fed8394ce65]
> 13: (clone()+0x6d) [0x7fed825fa88d]
>
> 2020-04-17 23:10:08.990 7fed7ee05700 -1 *** Caught signal (Segmentation fault) **
> in thread 7fed7ee05700 thread_name:msgr-worker-2
>
> ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
> 1: (()+0xf5f0) [0x7fed839545f0]
> 2: (ceph::buffer::v14_2_0::ptr::release()+0x8) [0x7fed863aafd8]
> 3: (ceph::crypto::onwire::AES128GCM_OnWireTxHandler::~AES128GCM_OnWireTxHandler()+0x59) [0x7fed86388669]
> 4: (ProtocolV2::reset_recv_state()+0x11f) [0x7fed8635f5af]
> 5: (ProtocolV2::stop()+0x77) [0x7fed8635f857]
> 6: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection>)+0x5ef) [0x7fed86374f8f]
> 7: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xd9c) [0x7fed8637673c]
> 8: (ProtocolV2::handle_frame_payload()+0x1fb) [0x7fed86376c1b]
> 9: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x7fed86376e70]
> 10: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x44d) [0x7fed863773cd]
> 11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x7fed86360534]
> 12: (AsyncConnection::process()+0x186) [0x7fed86330656]
> 13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa15) [0x7fed8637f015]
> 14: (()+0x57c977) [0x7fed86385977]
> 15: (()+0x80bdaf) [0x7fed86614daf]
> 16: (()+0x7e65) [0x7fed8394ce65]
> 17: (clone()+0x6d) [0x7fed825fa88d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> Any thoughts about this issue?
>
>
> Looks like https://tracker.ceph.com/issues/42026 which was recently
> backported to the Nautilus branch via
> https://github.com/ceph/ceph/pull/33820
>
> You could try a build with those patches or wait for 14.2.9
>
> --
> Cheers,
> Brad
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>


-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux