Hi, thanks for the pointer. I'll definitely look into upgrading our cluster and patching it. As a temporary fix, as stated at line -3 of the dump, the client 'client.96913903:2156912' was causing the crash. When we evicted it, connected to the machine running this client, and rebooted it, the problem disappeared. It appears that the client was automatically reconnecting and retrying the operation that crashed the MDS in the first place. I was surprised to see that a client could crash the server. Thanks for your help, Emmanuel On Fri, May 5, 2023 at 3:01 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > Hi Emmanuel, > > This should be one known issue as https://tracker.ceph.com/issues/58392 > and there is one fix in https://github.com/ceph/ceph/pull/49652. > > Could you just stop all the clients first and then set the 'max_mds' to > 1 and then restart the MDS daemons ? > > Thanks > > On 5/3/23 16:01, Emmanuel Jaep wrote: > > Hi, > > > > I just inherited a ceph storage. Therefore, my level of confidence with > the tool is certainly less than ideal. > > > > We currently have an mds server that refuses to come back online. While > reviewing the logs, I can see that, upon mds start, the recovery goes well: > > ``` > > -10> 2023-05-03T08:12:43.632+0200 7f345d00b700 1 mds.4.2638711 > cluster recovered. > > 12: (MDCache::_open_ino_traverse_dir(inodeno_t, > MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df] > > ``` > > > > However, rights after this message, ceph handles a couple of clients > request: > > ``` > > -9> 2023-05-03T08:12:43.632+0200 7f345d00b700 4 mds.4.2638711 > set_osd_epoch_barrier: epoch=249241 > > -8> 2023-05-03T08:12:43.632+0200 7f3459003700 2 mds.4.cache Memory > usage: total 2739784, rss 2321188, heap 348412, baseline 315644, 0 / > 765023 inodes have caps, 0 caps, 0 caps per inode > > -7> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.108396030:57271 lookup > #0x70001516236/012385530.npy 2023-05-02T20:37:19.675666+0200 RETRY=6 > caller_uid=135551, caller_gid=11157{0,4,27,11157,}) v5 > > -6> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.104073212:5109945 readdir > #0x70001516236 2023-05-02T20:36:29.517066+0200 RETRY=6 caller_uid=180090, > caller_gid=11157{0,4,27,11157,}) v5 > > -5> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.104288735:3008344 readdir > #0x70001516236 2023-05-02T20:36:29.520801+0200 RETRY=6 caller_uid=135551, > caller_gid=11157{0,4,27,11157,}) v5 > > -4> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.8558540:46306346 readdir > #0x700019ba15e 2023-05-01T21:26:34.303697+0200 RETRY=49 caller_uid=0, > caller_gid=0{}) v2 > > -3> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.96913903:2156912 create > #0x1000b37db9a/street-photo-3.png 2023-05-01T17:27:37.454042+0200 RETRY=59 > caller_uid=271932, caller_gid=30034{}) v2 > > -2> 2023-05-03T08:12:43.688+0200 7f345d00b700 5 mds.icadmin006 > handle_mds_map old map epoch 2638715 <= 2638715, discarding > > ``` > > > > and crashes: > > ``` > > -1> 2023-05-03T08:12:43.692+0200 7f345d00b700 -1 > /build/ceph-16.2.10/src/mds/Server.cc: In function 'void > Server::handle_client_open(MDRequestRef&)' thread 7f345d00b700 time > 2023-05-03T08:12:43.694660+0200 > > /build/ceph-16.2.10/src/mds/Server.cc: 4240: FAILED > ceph_assert(cur->is_auth()) > > > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) > pacific (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x152) [0x7f3462533d65] > > 2: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d] > > 3: > (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) > [0x558323c89f04] > > 4: > (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) > [0x558323c925ef] > > 5: > (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) > [0x558323cc3575] > > 6: > (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) > [0x558323d7460d] > > 7: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 8: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, > MDSContext*, bool, int)+0x3e) [0x558323d3edce] > > 9: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce] > > 10: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, > int)+0xcf) [0x558323d5ff2f] > > 12: (MDCache::_open_ino_traverse_dir(inodeno_t, > MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df] > > 13: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 14: (MDSRank::_advance_queues()+0x88) [0x558323c23c38] > > 15: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, > bool)+0x1fa) [0x558323c24a1a] > > 16: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message > const> const&)+0x5e) [0x558323c254fe] > > 17: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> > const&)+0x1d6) [0x558323bfd906] > > 18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> > const&)+0x460) [0x7f34627854e0] > > 19: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f] > > 20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1] > > 21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609] > > 22: clone() > > > > 0> 2023-05-03T08:12:43.700+0200 7f345d00b700 -1 *** Caught signal > (Aborted) ** > > in thread 7f345d00b700 thread_name:ms_dispatch > > > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) > pacific (stable) > > 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0) [0x7f34622843c0] > > 2: gsignal() > > 3: abort() > > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x1ad) [0x7f3462533dc0] > > 5: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d] > > 6: > (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) > [0x558323c89f04] > > 7: > (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) > [0x558323c925ef] > > 8: > (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) > [0x558323cc3575] > > 9: > (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) > [0x558323d7460d] > > 10: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 11: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, > MDSContext*, bool, int)+0x3e) [0x558323d3edce] > > 12: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce] > > 13: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 14: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, > int)+0xcf) [0x558323d5ff2f] > > 15: (MDCache::_open_ino_traverse_dir(inodeno_t, > MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df] > > 16: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 17: (MDSRank::_advance_queues()+0x88) [0x558323c23c38] > > 18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, > bool)+0x1fa) [0x558323c24a1a] > > 19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message > const> const&)+0x5e) [0x558323c254fe] > > 20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> > const&)+0x1d6) [0x558323bfd906] > > 21: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> > const&)+0x460) [0x7f34627854e0] > > 22: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f] > > 23: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1] > > 24: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609] > > 25: clone() > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > > --- logging levels --- > > 0/ 5 none > > 0/ 1 lockdep > > 0/ 1 context > > 1/ 1 crush > > 1/ 5 mds > > 1/ 5 mds_balancer > > 1/ 5 mds_locker > > 1/ 5 mds_log > > 1/ 5 mds_log_expire > > 1/ 5 mds_migrator > > 0/ 1 buffer > > 0/ 1 timer > > 0/ 1 filer > > 0/ 1 striper > > 0/ 1 objecter > > 0/ 5 rados > > 0/ 5 rbd > > 0/ 5 rbd_mirror > > 0/ 5 rbd_replay > > 0/ 5 rbd_pwl > > 0/ 5 journaler > > 0/ 5 objectcacher > > 0/ 5 immutable_obj_cache > > 0/ 5 client > > 1/ 5 osd > > 0/ 5 optracker > > 0/ 5 objclass > > 1/ 3 filestore > > 1/ 3 journal > > 0/ 0 ms > > 1/ 5 mon > > 0/10 monc > > 1/ 5 paxos > > 0/ 5 tp > > 1/ 5 auth > > 1/ 5 crypto > > 1/ 1 finisher > > 1/ 1 reserver > > 1/ 5 heartbeatmap > > 1/ 5 perfcounter > > 1/ 5 rgw > > 1/ 5 rgw_sync > > 1/10 civetweb > > 1/ 5 javaclient > > 1/ 5 asok > > 1/ 1 throttle > > 0/ 0 refs > > 1/ 5 compressor > > 1/ 5 bluestore > > 1/ 5 bluefs > > 1/ 3 bdev > > 1/ 5 kstore > > 4/ 5 rocksdb > > 4/ 5 leveldb > > 4/ 5 memdb > > 1/ 5 fuse > > 2/ 5 mgr > > 1/ 5 mgrc > > 1/ 5 dpdk > > 1/ 5 eventtrace > > 1/ 5 prioritycache > > 0/ 5 test > > 0/ 5 cephfs_mirror > > 0/ 5 cephsqlite > > -2/-2 (syslog threshold) > > -1/-1 (stderr threshold) > > --- pthread ID / name mapping for recent threads --- > > 139862749464320 / > > 139862757857024 / md_submit > > 139862766249728 / > > 139862774642432 / MR_Finisher > > 139862791427840 / PQ_Finisher > > 139862799820544 / mds_rank_progr > > 139862808213248 / ms_dispatch > > 139862841784064 / ceph-mds > > 139862858569472 / safe_timer > > 139862875354880 / ms_dispatch > > 139862892140288 / io_context_pool > > 139862908925696 / admin_socket > > 139862917318400 / msgr-worker-2 > > 139862925711104 / msgr-worker-1 > > 139862934103808 / msgr-worker-0 > > 139862951257984 / ceph-mds > > max_recent 10000 > > max_new 10000 > > log_file /var/log/ceph/floki-mds.icadmin006.log > > --- end dump of recent events --- > > > > ``` > > > > How could I troubleshoot that further? > > > > Thanks in advance for your help, > > > > Emmanuel > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx