Apart from PR mentioned by xiubo, #49691 <https://github.com/ceph/ceph/pull/49691> also contains a good fix for this issue. - Dhairya On Fri, May 5, 2023 at 6:32 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > Hi Emmanuel, > > This should be one known issue as https://tracker.ceph.com/issues/58392 > and there is one fix in https://github.com/ceph/ceph/pull/49652. > > Could you just stop all the clients first and then set the 'max_mds' to > 1 and then restart the MDS daemons ? > > Thanks > > On 5/3/23 16:01, Emmanuel Jaep wrote: > > Hi, > > > > I just inherited a ceph storage. Therefore, my level of confidence with > the tool is certainly less than ideal. > > > > We currently have an mds server that refuses to come back online. While > reviewing the logs, I can see that, upon mds start, the recovery goes well: > > ``` > > -10> 2023-05-03T08:12:43.632+0200 7f345d00b700 1 mds.4.2638711 > cluster recovered. > > 12: (MDCache::_open_ino_traverse_dir(inodeno_t, > MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df] > > ``` > > > > However, rights after this message, ceph handles a couple of clients > request: > > ``` > > -9> 2023-05-03T08:12:43.632+0200 7f345d00b700 4 mds.4.2638711 > set_osd_epoch_barrier: epoch=249241 > > -8> 2023-05-03T08:12:43.632+0200 7f3459003700 2 mds.4.cache Memory > usage: total 2739784, rss 2321188, heap 348412, baseline 315644, 0 / > 765023 inodes have caps, 0 caps, 0 caps per inode > > -7> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.108396030:57271 lookup > #0x70001516236/012385530.npy 2023-05-02T20:37:19.675666+0200 RETRY=6 > caller_uid=135551, caller_gid=11157{0,4,27,11157,}) v5 > > -6> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.104073212:5109945 readdir > #0x70001516236 2023-05-02T20:36:29.517066+0200 RETRY=6 caller_uid=180090, > caller_gid=11157{0,4,27,11157,}) v5 > > -5> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.104288735:3008344 readdir > #0x70001516236 2023-05-02T20:36:29.520801+0200 RETRY=6 caller_uid=135551, > caller_gid=11157{0,4,27,11157,}) v5 > > -4> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.8558540:46306346 readdir > #0x700019ba15e 2023-05-01T21:26:34.303697+0200 RETRY=49 caller_uid=0, > caller_gid=0{}) v2 > > -3> 2023-05-03T08:12:43.688+0200 7f3458802700 4 mds.4.server > handle_client_request client_request(client.96913903:2156912 create > #0x1000b37db9a/street-photo-3.png 2023-05-01T17:27:37.454042+0200 RETRY=59 > caller_uid=271932, caller_gid=30034{}) v2 > > -2> 2023-05-03T08:12:43.688+0200 7f345d00b700 5 mds.icadmin006 > handle_mds_map old map epoch 2638715 <= 2638715, discarding > > ``` > > > > and crashes: > > ``` > > -1> 2023-05-03T08:12:43.692+0200 7f345d00b700 -1 > /build/ceph-16.2.10/src/mds/Server.cc: In function 'void > Server::handle_client_open(MDRequestRef&)' thread 7f345d00b700 time > 2023-05-03T08:12:43.694660+0200 > > /build/ceph-16.2.10/src/mds/Server.cc: 4240: FAILED > ceph_assert(cur->is_auth()) > > > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) > pacific (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x152) [0x7f3462533d65] > > 2: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d] > > 3: > (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) > [0x558323c89f04] > > 4: > (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) > [0x558323c925ef] > > 5: > (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) > [0x558323cc3575] > > 6: > (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) > [0x558323d7460d] > > 7: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 8: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, > MDSContext*, bool, int)+0x3e) [0x558323d3edce] > > 9: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce] > > 10: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, > int)+0xcf) [0x558323d5ff2f] > > 12: (MDCache::_open_ino_traverse_dir(inodeno_t, > MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df] > > 13: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 14: (MDSRank::_advance_queues()+0x88) [0x558323c23c38] > > 15: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, > bool)+0x1fa) [0x558323c24a1a] > > 16: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message > const> const&)+0x5e) [0x558323c254fe] > > 17: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> > const&)+0x1d6) [0x558323bfd906] > > 18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> > const&)+0x460) [0x7f34627854e0] > > 19: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f] > > 20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1] > > 21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609] > > 22: clone() > > > > 0> 2023-05-03T08:12:43.700+0200 7f345d00b700 -1 *** Caught signal > (Aborted) ** > > in thread 7f345d00b700 thread_name:ms_dispatch > > > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) > pacific (stable) > > 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0) [0x7f34622843c0] > > 2: gsignal() > > 3: abort() > > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x1ad) [0x7f3462533dc0] > > 5: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d] > > 6: > (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) > [0x558323c89f04] > > 7: > (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) > [0x558323c925ef] > > 8: > (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) > [0x558323cc3575] > > 9: > (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) > [0x558323d7460d] > > 10: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 11: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, > MDSContext*, bool, int)+0x3e) [0x558323d3edce] > > 12: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce] > > 13: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 14: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, > int)+0xcf) [0x558323d5ff2f] > > 15: (MDCache::_open_ino_traverse_dir(inodeno_t, > MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df] > > 16: (MDSContext::complete(int)+0x61) [0x558323f68681] > > 17: (MDSRank::_advance_queues()+0x88) [0x558323c23c38] > > 18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, > bool)+0x1fa) [0x558323c24a1a] > > 19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message > const> const&)+0x5e) [0x558323c254fe] > > 20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> > const&)+0x1d6) [0x558323bfd906] > > 21: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> > const&)+0x460) [0x7f34627854e0] > > 22: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f] > > 23: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1] > > 24: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609] > > 25: clone() > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > > --- logging levels --- > > 0/ 5 none > > 0/ 1 lockdep > > 0/ 1 context > > 1/ 1 crush > > 1/ 5 mds > > 1/ 5 mds_balancer > > 1/ 5 mds_locker > > 1/ 5 mds_log > > 1/ 5 mds_log_expire > > 1/ 5 mds_migrator > > 0/ 1 buffer > > 0/ 1 timer > > 0/ 1 filer > > 0/ 1 striper > > 0/ 1 objecter > > 0/ 5 rados > > 0/ 5 rbd > > 0/ 5 rbd_mirror > > 0/ 5 rbd_replay > > 0/ 5 rbd_pwl > > 0/ 5 journaler > > 0/ 5 objectcacher > > 0/ 5 immutable_obj_cache > > 0/ 5 client > > 1/ 5 osd > > 0/ 5 optracker > > 0/ 5 objclass > > 1/ 3 filestore > > 1/ 3 journal > > 0/ 0 ms > > 1/ 5 mon > > 0/10 monc > > 1/ 5 paxos > > 0/ 5 tp > > 1/ 5 auth > > 1/ 5 crypto > > 1/ 1 finisher > > 1/ 1 reserver > > 1/ 5 heartbeatmap > > 1/ 5 perfcounter > > 1/ 5 rgw > > 1/ 5 rgw_sync > > 1/10 civetweb > > 1/ 5 javaclient > > 1/ 5 asok > > 1/ 1 throttle > > 0/ 0 refs > > 1/ 5 compressor > > 1/ 5 bluestore > > 1/ 5 bluefs > > 1/ 3 bdev > > 1/ 5 kstore > > 4/ 5 rocksdb > > 4/ 5 leveldb > > 4/ 5 memdb > > 1/ 5 fuse > > 2/ 5 mgr > > 1/ 5 mgrc > > 1/ 5 dpdk > > 1/ 5 eventtrace > > 1/ 5 prioritycache > > 0/ 5 test > > 0/ 5 cephfs_mirror > > 0/ 5 cephsqlite > > -2/-2 (syslog threshold) > > -1/-1 (stderr threshold) > > --- pthread ID / name mapping for recent threads --- > > 139862749464320 / > > 139862757857024 / md_submit > > 139862766249728 / > > 139862774642432 / MR_Finisher > > 139862791427840 / PQ_Finisher > > 139862799820544 / mds_rank_progr > > 139862808213248 / ms_dispatch > > 139862841784064 / ceph-mds > > 139862858569472 / safe_timer > > 139862875354880 / ms_dispatch > > 139862892140288 / io_context_pool > > 139862908925696 / admin_socket > > 139862917318400 / msgr-worker-2 > > 139862925711104 / msgr-worker-1 > > 139862934103808 / msgr-worker-0 > > 139862951257984 / ceph-mds > > max_recent 10000 > > max_new 10000 > > log_file /var/log/ceph/floki-mds.icadmin006.log > > --- end dump of recent events --- > > > > ``` > > > > How could I troubleshoot that further? > > > > Thanks in advance for your help, > > > > Emmanuel > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx