Re: The ceph monitor crashes every few days

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 10 Oct 2024 09:17:02 -0700

On Wed, Oct 9, 2024 at 7:28 AM 李明 <limingzju@xxxxxxxxx> wrote:

> Hello,
>
> ceph version  is 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)
>
> and the rbd info command is also slow, some times it needs 6 seconds. rbd
> snap create command takes 17 seconds. There is  another cluster with the
> same configuration it only takes less than 1 second.
>
>
> crash log
>     -2> 2024-10-09 09:52:50.782 7f30ddcb4700  5 prioritycache tune_memory
> target: 2147483648 mapped: 1286012928 unmapped: 1695842304 heap: 2981855232
> old mem: 1020054731 new mem: 1020054731
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/mon/OSDMonitor.cc:
> 4112: ceph_abort_msg("abort() called")
>
>  ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus
> (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*, std::string
> const&)+0xdd) [0x7f30ea6c2a66]
>  2: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned
> long)+0x89f) [0x563b0175382f]
>  3: (OSDMonitor::send_incremental(unsigned int, MonSession*, bool,
> boost::intrusive_ptr<MonOpRequest>)+0x11d) [0x563b01753e6d]
>  4: (OSDMonitor::check_osdmap_sub(Subscription*)+0xca) [0x563b0175a53a]
>  5: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xbbe)
> [0x563b0162c9ee]
>  6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x525)
> [0x563b0164b0a5]
>  7: (Monitor::_ms_dispatch(Message*)+0xcdb) [0x563b0164cd9b]
>  8: (Monitor::ms_dispatch(Message*)+0x26) [0x563b0167ad96]
>  9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
> [0x563b01677776]
>  10: (DispatchQueue::entry()+0x1699) [0x7f30ea8e7699]
>  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f30ea99564d]
>  12: (()+0x7ea5) [0x7f30e74acea5]
>  13: (clone()+0x6d) [0x7f30e636fb0d]
>
>      0> 2024-10-09 09:52:55.260 7f30db4af700 -1 *** Caught signal (Aborted)
> **
>  in thread 7f30db4af700 thread_name:ms_dispatch
>

A quick skim of the code suggests that the build_incremental() function
only aborts if it can't find an OSDMap that should be present on disk.
That's...that's very bad. Barring bugs in this code (which is old, stable,
and critical, and not something I recall hearing about), that suggests
you're somehow losing entries out of rocksdb.
-Greg

>
>  ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus
> (stable)
>  1: (()+0xf630) [0x7f30e74b4630]
>  2: (gsignal()+0x37) [0x7f30e62a7387]
>  3: (abort()+0x148) [0x7f30e62a8a78]
>  4: (ceph::__ceph_abort(char const*, int, char const*, std::string
> const&)+0x1a5) [0x7f30ea6c2b2e]
>  5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned
> long)+0x89f) [0x563b0175382f]
>  6: (OSDMonitor::send_incremental(unsigned int, MonSession*, bool,
> boost::intrusive_ptr<MonOpRequest>)+0x11d) [0x563b01753e6d]
>  7: (OSDMonitor::check_osdmap_sub(Subscription*)+0xca) [0x563b0175a53a]
>  8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xbbe)
> [0x563b0162c9ee]
>  9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x525)
> [0x563b0164b0a5]
>  10: (Monitor::_ms_dispatch(Message*)+0xcdb) [0x563b0164cd9b]
>  11: (Monitor::ms_dispatch(Message*)+0x26) [0x563b0167ad96]
>  12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
> [0x563b01677776]
>  13: (DispatchQueue::entry()+0x1699) [0x7f30ea8e7699]
>  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f30ea99564d]
>  15: (()+0x7ea5) [0x7f30e74acea5]
>  16: (clone()+0x6d) [0x7f30e636fb0d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this
>
> ceph -s
>   cluster:
>     id:     6e36443f-326b-4344-b013-c03a4b7bb8d7
>     health: HEALTH_WARN
>             noout flag(s) set
>             314 pgs not deep-scrubbed in time
>             660 pgs not scrubbed in time
>             23 daemons have recently crashed
>
>   services:
>     mon: 3 daemons, quorum rbd02,rbd01,rbd03 (age 81m)
>     mgr: rbd03(active, since 9d), standbys: rbd01, rbd02
>     osd: 360 osds: 360 up (since 9d), 360 in (since 12M)
>          flags noout
>
>   data:
>     pools:   1 pools, 8192 pgs
>     objects: 109.45M objects, 274 TiB
>     usage:   488 TiB used, 3.4 PiB / 3.8 PiB avail
>     pgs:     8116 active+clean
>              76   active+clean+scrubbing+deep
>
>   io:
>     client:   61 MiB/s rd, 34 MiB/s wr, 74.51k op/s rd, 2.06k op/s wr
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx