Re: Unable to restart mds - mds crashes almost immediately after finishing recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Emmanuel,

This should be one known issue as https://tracker.ceph.com/issues/58392 and there is one fix in https://github.com/ceph/ceph/pull/49652.

Could you just stop all the clients first and then set the 'max_mds' to 1 and then restart the MDS daemons ?

Thanks

On 5/3/23 16:01, Emmanuel Jaep wrote:
Hi,

I just inherited a ceph storage. Therefore, my level of confidence with the tool is certainly less than ideal.

We currently have an mds server that refuses to come back online. While reviewing the logs, I can see that, upon mds start, the recovery goes well:
```
    -10> 2023-05-03T08:12:43.632+0200 7f345d00b700  1 mds.4.2638711 cluster recovered.
  12: (MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df]
```

However, rights after this message, ceph handles a couple of clients request:
```
     -9> 2023-05-03T08:12:43.632+0200 7f345d00b700  4 mds.4.2638711 set_osd_epoch_barrier: epoch=249241
     -8> 2023-05-03T08:12:43.632+0200 7f3459003700  2 mds.4.cache Memory usage:  total 2739784, rss 2321188, heap 348412, baseline 315644, 0 / 765023 inodes have caps, 0 caps, 0 caps per inode
     -7> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server handle_client_request client_request(client.108396030:57271 lookup #0x70001516236/012385530.npy 2023-05-02T20:37:19.675666+0200 RETRY=6 caller_uid=135551, caller_gid=11157{0,4,27,11157,}) v5
     -6> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server handle_client_request client_request(client.104073212:5109945 readdir #0x70001516236 2023-05-02T20:36:29.517066+0200 RETRY=6 caller_uid=180090, caller_gid=11157{0,4,27,11157,}) v5
     -5> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server handle_client_request client_request(client.104288735:3008344 readdir #0x70001516236 2023-05-02T20:36:29.520801+0200 RETRY=6 caller_uid=135551, caller_gid=11157{0,4,27,11157,}) v5
     -4> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server handle_client_request client_request(client.8558540:46306346 readdir #0x700019ba15e 2023-05-01T21:26:34.303697+0200 RETRY=49 caller_uid=0, caller_gid=0{}) v2
     -3> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server handle_client_request client_request(client.96913903:2156912 create #0x1000b37db9a/street-photo-3.png 2023-05-01T17:27:37.454042+0200 RETRY=59 caller_uid=271932, caller_gid=30034{}) v2
     -2> 2023-05-03T08:12:43.688+0200 7f345d00b700  5 mds.icadmin006 handle_mds_map old map epoch 2638715 <= 2638715, discarding
```

and crashes:
```
     -1> 2023-05-03T08:12:43.692+0200 7f345d00b700 -1 /build/ceph-16.2.10/src/mds/Server.cc: In function 'void Server::handle_client_open(MDRequestRef&)' thread 7f345d00b700 time 2023-05-03T08:12:43.694660+0200
/build/ceph-16.2.10/src/mds/Server.cc: 4240: FAILED ceph_assert(cur->is_auth())

  ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f3462533d65]
  2: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d]
  3: (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) [0x558323c89f04]
  4: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) [0x558323c925ef]
  5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) [0x558323cc3575]
  6: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) [0x558323d7460d]
  7: (MDSContext::complete(int)+0x61) [0x558323f68681]
  8: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, MDSContext*, bool, int)+0x3e) [0x558323d3edce]
  9: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce]
  10: (MDSContext::complete(int)+0x61) [0x558323f68681]
  11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0xcf) [0x558323d5ff2f]
  12: (MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df]
  13: (MDSContext::complete(int)+0x61) [0x558323f68681]
  14: (MDSRank::_advance_queues()+0x88) [0x558323c23c38]
  15: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1fa) [0x558323c24a1a]
  16: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5e) [0x558323c254fe]
  17: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) [0x558323bfd906]
  18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f34627854e0]
  19: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f]
  20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1]
  21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609]
  22: clone()

      0> 2023-05-03T08:12:43.700+0200 7f345d00b700 -1 *** Caught signal (Aborted) **
  in thread 7f345d00b700 thread_name:ms_dispatch

  ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0) [0x7f34622843c0]
  2: gsignal()
  3: abort()
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ad) [0x7f3462533dc0]
  5: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d]
  6: (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) [0x558323c89f04]
  7: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) [0x558323c925ef]
  8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) [0x558323cc3575]
  9: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) [0x558323d7460d]
  10: (MDSContext::complete(int)+0x61) [0x558323f68681]
  11: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, MDSContext*, bool, int)+0x3e) [0x558323d3edce]
  12: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce]
  13: (MDSContext::complete(int)+0x61) [0x558323f68681]
  14: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0xcf) [0x558323d5ff2f]
  15: (MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, int)+0xbf) [0x558323d602df]
  16: (MDSContext::complete(int)+0x61) [0x558323f68681]
  17: (MDSRank::_advance_queues()+0x88) [0x558323c23c38]
  18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1fa) [0x558323c24a1a]
  19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5e) [0x558323c254fe]
  20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) [0x558323bfd906]
  21: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f34627854e0]
  22: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f]
  23: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1]
  24: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609]
  25: clone()
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 1 lockdep
    0/ 1 context
    1/ 1 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 1 buffer
    0/ 1 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 5 rbd_pwl
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 immutable_obj_cache
    0/ 5 client
    1/ 5 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 3 filestore
    1/ 3 journal
    0/ 0 ms
    1/ 5 mon
    0/10 monc
    1/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 crypto
    1/ 1 finisher
    1/ 1 reserver
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/ 5 rgw_sync
    1/10 civetweb
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
    0/ 0 refs
    1/ 5 compressor
    1/ 5 bluestore
    1/ 5 bluefs
    1/ 3 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 fuse
    2/ 5 mgr
    1/ 5 mgrc
    1/ 5 dpdk
    1/ 5 eventtrace
    1/ 5 prioritycache
    0/ 5 test
    0/ 5 cephfs_mirror
    0/ 5 cephsqlite
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
   139862749464320 /
   139862757857024 / md_submit
   139862766249728 /
   139862774642432 / MR_Finisher
   139862791427840 / PQ_Finisher
   139862799820544 / mds_rank_progr
   139862808213248 / ms_dispatch
   139862841784064 / ceph-mds
   139862858569472 / safe_timer
   139862875354880 / ms_dispatch
   139862892140288 / io_context_pool
   139862908925696 / admin_socket
   139862917318400 / msgr-worker-2
   139862925711104 / msgr-worker-1
   139862934103808 / msgr-worker-0
   139862951257984 / ceph-mds
   max_recent     10000
   max_new        10000
   log_file /var/log/ceph/floki-mds.icadmin006.log
--- end dump of recent events ---

```

How could I troubleshoot that further?

Thanks in advance for your help,

Emmanuel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux