MDS "newly corrupt dentry" after patch version upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS fails start start. After replaying the journal, it just crashes with

[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/storage [2,head] auth (dversion lock)

Immediately after the upgrade, I had it running shortly, but then it decided to crash for unknown reasons and I cannot get it back up.

We have five ranks in total, the other four seem to be fine. I backed up the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0 event recover_dentries summary, but it never finishes only eats up a lot of RAM. I stopped it after an hour and 50GB RAM.

Resetting the journal makes the MDS crash with a missing inode error on another top-level directory, so I re-imported the backed-up journal. Is there any way to recover from this without rebuilding the whole file system?

Thanks
Janek


Here's the full crash log:


May 02 16:16:53 xxx077 ceph-mds[3047358]:    -29> 2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 Finished replaying journal May 02 16:16:53 xxx077 ceph-mds[3047358]:    -28> 2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 making mds journal writeable May 02 16:16:53 xxx077 ceph-mds[3047358]:    -27> 2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.journaler.mdlog(ro) set_writeable May 02 16:16:53 xxx077 ceph-mds[3047358]:    -26> 2023-05-02T16:16:52.761+0200 7f51f878b700  2 mds.0.1711712 i am not alone, moving to state resolve May 02 16:16:53 xxx077 ceph-mds[3047358]:    -25> 2023-05-02T16:16:52.761+0200 7f51f878b700  3 mds.0.1711712 request_state up:resolve May 02 16:16:53 xxx077 ceph-mds[3047358]:    -24> 2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 set_want_state: up:replay -> up:resolve May 02 16:16:53 xxx077 ceph-mds[3047358]:    -23> 2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 Sending beacon up:resolve seq 15 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -22> 2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient: _send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -21> 2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick May 02 16:16:53 xxx077 ceph-mds[3047358]:    -20> 2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-05-02T16:16:23.118186+0200) May 02 16:16:53 xxx077 ceph-mds[3047358]:    -19> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.xxx077 Updating MDS map to version 1711713 from mon.1 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -18> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 handle_mds_map i am now mds.0.1711712 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -17> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 handle_mds_map state change up:replay --> up:resolve May 02 16:16:53 xxx077 ceph-mds[3047358]:    -16> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 resolve_start May 02 16:16:53 xxx077 ceph-mds[3047358]:    -15> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 reopen_log May 02 16:16:53 xxx077 ceph-mds[3047358]:    -14> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set is 1,2,3,4 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -13> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set is 1,2,3,4 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -12> 2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request con 0x5574fe74c400 auth_method 0 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -11> 2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request con 0x5574fe40fc00 auth_method 0 May 02 16:16:53 xxx077 ceph-mds[3047358]:    -10> 2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request con 0x5574f932fc00 auth_method 0 May 02 16:16:53 xxx077 ceph-mds[3047358]:     -9> 2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request con 0x5574ffce2000 auth_method 0 May 02 16:16:53 xxx077 ceph-mds[3047358]:     -8> 2023-05-02T16:16:53.377+0200 7f5202fa0700  5 mds.beacon.xxx077 received beacon reply up:resolve seq 15 rtt 0.616008 May 02 16:16:53 xxx077 ceph-mds[3047358]:     -7> 2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map old map epoch 1711713 <= 1711713, discarding May 02 16:16:53 xxx077 ceph-mds[3047358]:     -6> 2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map old map epoch 1711713 <= 1711713, discarding May 02 16:16:53 xxx077 ceph-mds[3047358]:     -5> 2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map old map epoch 1711713 <= 1711713, discarding May 02 16:16:53 xxx077 ceph-mds[3047358]:     -4> 2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map old map epoch 1711713 <= 1711713, discarding May 02 16:16:53 xxx077 ceph-mds[3047358]:     -3> 2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 mds.0.cache.den(0x1 storage) newly corrupt dentry to be committed: [dentry #0x1/storage [2,head] auth (dversion lock) v=78956500 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x5574f932db80] May 02 16:16:53 xxx077 ceph-mds[3047358]:     -2> 2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/storage [2,head] auth (dversion lock) v=78956500 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x5574f932db80] May 02 16:16:53 xxx077 ceph-mds[3047358]:     -1> 2023-05-02T16:16:53.549+0200 7f51fff9a700 -1 /build/ceph-16.2.12/src/mds/CDentry.cc: In function 'bool CDentry::check_corruption(bool)' thread 7f51fff9a700 time 2023-05-02T16:16:53.549536+0200 /build/ceph-16.2.12/src/mds/CDentry.cc: 697: ceph_abort_msg("abort() called")

                                                ceph version 16.2.12 (5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable)                                                 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x7f52054c3495]                                                 2: (CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b]                                                 3: (EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7]                                                 4: (MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1]                                                 5: (MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d]                                                 6: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866]                                                 7: (MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7]                                                 8: (MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7]                                                 9: (MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358]                                                 10: (MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> const&)+0x1e02) [0x5574f7d454c2]                                                 11: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144) [0x5574f7d47bd4]                                                 12: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x733) [0x5574f7bac793]                                                 13: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x643) [0x5574f7bcacb3]                                                 14: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x5574f7bcb34c]                                                 15: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) [0x5574f7b9f226]                                                 16: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f5205714020]                                                 17: (DispatchQueue::entry()+0x58f) [0x7f52057118bf]                                                 18: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261]                                                 19: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609]
                                                20: clone()
May 02 16:16:53 xxx077 ceph-mds[3047358]:      0> 2023-05-02T16:16:53.553+0200 7f51fff9a700 -1 *** Caught signal (Aborted) **                                                 in thread 7f51fff9a700 thread_name:ms_dispatch

                                                ceph version 16.2.12 (5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable)                                                 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f5205211420]
                                                2: gsignal()
                                                3: abort()
                                                4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1af) [0x7f52054c3564]                                                 5: (CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b]                                                 6: (EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7]                                                 7: (MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1]                                                 8: (MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d]                                                 9: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866]                                                 10: (MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7]                                                 11: (MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7]                                                 12: (MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358]                                                 13: (MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> const&)+0x1e02) [0x5574f7d454c2]                                                 14: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144) [0x5574f7d47bd4]                                                 15: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x733) [0x5574f7bac793]                                                 16: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x643) [0x5574f7bcacb3]                                                 17: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x5574f7bcb34c]                                                 18: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) [0x5574f7b9f226]                                                 19: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f5205714020]                                                 20: (DispatchQueue::entry()+0x58f) [0x7f52057118bf]                                                 21: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261]                                                 22: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609]
                                                23: clone()
                                                NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- logging levels ---
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 none
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 lockdep
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 context
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 crush
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_balancer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_locker
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_log
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_log_expire
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_migrator
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 buffer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 timer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 filer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 striper
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 objecter
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rados
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd_mirror
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd_replay
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd_pwl
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 journaler
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 objectcacher
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 immutable_obj_cache
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 client
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 osd
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 optracker
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 objclass
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 3 filestore
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 3 journal
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 0 ms
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mon
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/10 monc
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 paxos
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 tp
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 auth
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 crypto
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 reserver
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 heartbeatmap
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 perfcounter
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 rgw
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 rgw_sync
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/10 civetweb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 javaclient
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 asok
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 throttle
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 0 refs
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 compressor
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 bluestore
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 bluefs
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 3 bdev
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 kstore
May 02 16:16:53 xxx077 ceph-mds[3047358]:    4/ 5 rocksdb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    4/ 5 leveldb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    4/ 5 memdb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 fuse
May 02 16:16:53 xxx077 ceph-mds[3047358]:    2/ 5 mgr
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mgrc
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 dpdk
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 eventtrace
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 prioritycache
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 test
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 cephfs_mirror
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 cephsqlite
May 02 16:16:53 xxx077 ceph-mds[3047358]:   99/99 (syslog threshold)
May 02 16:16:53 xxx077 ceph-mds[3047358]:   -2/-2 (stderr threshold)
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- pthread ID / name mapping for recent threads ---
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990037739264 /
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990054524672 /
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990062917376 / MR_Finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990079702784 / PQ_Finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990096488192 / ms_dispatch
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990130059008 / ceph-mds
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990146844416 / safe_timer
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990163629824 / ms_dispatch
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990180415232 / io_context_pool
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990197200640 / admin_socket
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990205593344 / msgr-worker-2
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990213986048 / msgr-worker-1
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990222378752 / msgr-worker-0
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990239524736 / ceph-mds
May 02 16:16:53 xxx077 ceph-mds[3047358]:   max_recent     10000
May 02 16:16:53 xxx077 ceph-mds[3047358]:   max_new        10000
May 02 16:16:53 xxx077 ceph-mds[3047358]:   log_file /var/lib/ceph/crash/2023-05-02T14:16:53.555508Z_0b05f5cb-130c-4979-95ca-1ba7f31cf7e5/log
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- end dump of recent events ---



--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux