Re: Unable to restart mds - mds crashes almost immediately after finishing recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Apart from PR mentioned by xiubo, #49691
<https://github.com/ceph/ceph/pull/49691> also contains a good fix for this
issue.
- Dhairya


On Fri, May 5, 2023 at 6:32 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:

> Hi Emmanuel,
>
> This should be one known issue as https://tracker.ceph.com/issues/58392
> and there is one fix in https://github.com/ceph/ceph/pull/49652.
>
> Could you just stop all the clients first and then set the 'max_mds' to
> 1 and then restart the MDS daemons ?
>
> Thanks
>
> On 5/3/23 16:01, Emmanuel Jaep wrote:
> > Hi,
> >
> > I just inherited a ceph storage. Therefore, my level of confidence with
> the tool is certainly less than ideal.
> >
> > We currently have an mds server that refuses to come back online. While
> reviewing the logs, I can see that, upon mds start, the recovery goes well:
> > ```
> >     -10> 2023-05-03T08:12:43.632+0200 7f345d00b700  1 mds.4.2638711
> cluster recovered.
> >   12: (MDCache::_open_ino_traverse_dir(inodeno_t,
> MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df]
> > ```
> >
> > However, rights after this message, ceph handles a couple of clients
> request:
> > ```
> >      -9> 2023-05-03T08:12:43.632+0200 7f345d00b700  4 mds.4.2638711
> set_osd_epoch_barrier: epoch=249241
> >      -8> 2023-05-03T08:12:43.632+0200 7f3459003700  2 mds.4.cache Memory
> usage:  total 2739784, rss 2321188, heap 348412, baseline 315644, 0 /
> 765023 inodes have caps, 0 caps, 0 caps per inode
> >      -7> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server
> handle_client_request client_request(client.108396030:57271 lookup
> #0x70001516236/012385530.npy 2023-05-02T20:37:19.675666+0200 RETRY=6
> caller_uid=135551, caller_gid=11157{0,4,27,11157,}) v5
> >      -6> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server
> handle_client_request client_request(client.104073212:5109945 readdir
> #0x70001516236 2023-05-02T20:36:29.517066+0200 RETRY=6 caller_uid=180090,
> caller_gid=11157{0,4,27,11157,}) v5
> >      -5> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server
> handle_client_request client_request(client.104288735:3008344 readdir
> #0x70001516236 2023-05-02T20:36:29.520801+0200 RETRY=6 caller_uid=135551,
> caller_gid=11157{0,4,27,11157,}) v5
> >      -4> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server
> handle_client_request client_request(client.8558540:46306346 readdir
> #0x700019ba15e 2023-05-01T21:26:34.303697+0200 RETRY=49 caller_uid=0,
> caller_gid=0{}) v2
> >      -3> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server
> handle_client_request client_request(client.96913903:2156912 create
> #0x1000b37db9a/street-photo-3.png 2023-05-01T17:27:37.454042+0200 RETRY=59
> caller_uid=271932, caller_gid=30034{}) v2
> >      -2> 2023-05-03T08:12:43.688+0200 7f345d00b700  5 mds.icadmin006
> handle_mds_map old map epoch 2638715 <= 2638715, discarding
> > ```
> >
> > and crashes:
> > ```
> >      -1> 2023-05-03T08:12:43.692+0200 7f345d00b700 -1
> /build/ceph-16.2.10/src/mds/Server.cc: In function 'void
> Server::handle_client_open(MDRequestRef&)' thread 7f345d00b700 time
> 2023-05-03T08:12:43.694660+0200
> > /build/ceph-16.2.10/src/mds/Server.cc: 4240: FAILED
> ceph_assert(cur->is_auth())
> >
> >   ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
> pacific (stable)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x152) [0x7f3462533d65]
> >   2: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d]
> >   3:
> (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834)
> [0x558323c89f04]
> >   4:
> (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f)
> [0x558323c925ef]
> >   5:
> (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45)
> [0x558323cc3575]
> >   6:
> (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d)
> [0x558323d7460d]
> >   7: (MDSContext::complete(int)+0x61) [0x558323f68681]
> >   8: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t,
> MDSContext*, bool, int)+0x3e) [0x558323d3edce]
> >   9: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce]
> >   10: (MDSContext::complete(int)+0x61) [0x558323f68681]
> >   11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&,
> int)+0xcf) [0x558323d5ff2f]
> >   12: (MDCache::_open_ino_traverse_dir(inodeno_t,
> MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df]
> >   13: (MDSContext::complete(int)+0x61) [0x558323f68681]
> >   14: (MDSRank::_advance_queues()+0x88) [0x558323c23c38]
> >   15: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&,
> bool)+0x1fa) [0x558323c24a1a]
> >   16: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message
> const> const&)+0x5e) [0x558323c254fe]
> >   17: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message>
> const&)+0x1d6) [0x558323bfd906]
> >   18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message>
> const&)+0x460) [0x7f34627854e0]
> >   19: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f]
> >   20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1]
> >   21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609]
> >   22: clone()
> >
> >       0> 2023-05-03T08:12:43.700+0200 7f345d00b700 -1 *** Caught signal
> (Aborted) **
> >   in thread 7f345d00b700 thread_name:ms_dispatch
> >
> >   ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
> pacific (stable)
> >   1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0) [0x7f34622843c0]
> >   2: gsignal()
> >   3: abort()
> >   4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1ad) [0x7f3462533dc0]
> >   5: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d]
> >   6:
> (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834)
> [0x558323c89f04]
> >   7:
> (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f)
> [0x558323c925ef]
> >   8:
> (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45)
> [0x558323cc3575]
> >   9:
> (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d)
> [0x558323d7460d]
> >   10: (MDSContext::complete(int)+0x61) [0x558323f68681]
> >   11: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t,
> MDSContext*, bool, int)+0x3e) [0x558323d3edce]
> >   12: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce]
> >   13: (MDSContext::complete(int)+0x61) [0x558323f68681]
> >   14: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&,
> int)+0xcf) [0x558323d5ff2f]
> >   15: (MDCache::_open_ino_traverse_dir(inodeno_t,
> MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df]
> >   16: (MDSContext::complete(int)+0x61) [0x558323f68681]
> >   17: (MDSRank::_advance_queues()+0x88) [0x558323c23c38]
> >   18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&,
> bool)+0x1fa) [0x558323c24a1a]
> >   19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message
> const> const&)+0x5e) [0x558323c254fe]
> >   20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message>
> const&)+0x1d6) [0x558323bfd906]
> >   21: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message>
> const&)+0x460) [0x7f34627854e0]
> >   22: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f]
> >   23: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1]
> >   24: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609]
> >   25: clone()
> >   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> >
> > --- logging levels ---
> >     0/ 5 none
> >     0/ 1 lockdep
> >     0/ 1 context
> >     1/ 1 crush
> >     1/ 5 mds
> >     1/ 5 mds_balancer
> >     1/ 5 mds_locker
> >     1/ 5 mds_log
> >     1/ 5 mds_log_expire
> >     1/ 5 mds_migrator
> >     0/ 1 buffer
> >     0/ 1 timer
> >     0/ 1 filer
> >     0/ 1 striper
> >     0/ 1 objecter
> >     0/ 5 rados
> >     0/ 5 rbd
> >     0/ 5 rbd_mirror
> >     0/ 5 rbd_replay
> >     0/ 5 rbd_pwl
> >     0/ 5 journaler
> >     0/ 5 objectcacher
> >     0/ 5 immutable_obj_cache
> >     0/ 5 client
> >     1/ 5 osd
> >     0/ 5 optracker
> >     0/ 5 objclass
> >     1/ 3 filestore
> >     1/ 3 journal
> >     0/ 0 ms
> >     1/ 5 mon
> >     0/10 monc
> >     1/ 5 paxos
> >     0/ 5 tp
> >     1/ 5 auth
> >     1/ 5 crypto
> >     1/ 1 finisher
> >     1/ 1 reserver
> >     1/ 5 heartbeatmap
> >     1/ 5 perfcounter
> >     1/ 5 rgw
> >     1/ 5 rgw_sync
> >     1/10 civetweb
> >     1/ 5 javaclient
> >     1/ 5 asok
> >     1/ 1 throttle
> >     0/ 0 refs
> >     1/ 5 compressor
> >     1/ 5 bluestore
> >     1/ 5 bluefs
> >     1/ 3 bdev
> >     1/ 5 kstore
> >     4/ 5 rocksdb
> >     4/ 5 leveldb
> >     4/ 5 memdb
> >     1/ 5 fuse
> >     2/ 5 mgr
> >     1/ 5 mgrc
> >     1/ 5 dpdk
> >     1/ 5 eventtrace
> >     1/ 5 prioritycache
> >     0/ 5 test
> >     0/ 5 cephfs_mirror
> >     0/ 5 cephsqlite
> >    -2/-2 (syslog threshold)
> >    -1/-1 (stderr threshold)
> > --- pthread ID / name mapping for recent threads ---
> >    139862749464320 /
> >    139862757857024 / md_submit
> >    139862766249728 /
> >    139862774642432 / MR_Finisher
> >    139862791427840 / PQ_Finisher
> >    139862799820544 / mds_rank_progr
> >    139862808213248 / ms_dispatch
> >    139862841784064 / ceph-mds
> >    139862858569472 / safe_timer
> >    139862875354880 / ms_dispatch
> >    139862892140288 / io_context_pool
> >    139862908925696 / admin_socket
> >    139862917318400 / msgr-worker-2
> >    139862925711104 / msgr-worker-1
> >    139862934103808 / msgr-worker-0
> >    139862951257984 / ceph-mds
> >    max_recent     10000
> >    max_new        10000
> >    log_file /var/log/ceph/floki-mds.icadmin006.log
> > --- end dump of recent events ---
> >
> > ```
> >
> > How could I troubleshoot that further?
> >
> > Thanks in advance for your help,
> >
> > Emmanuel
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux