Hi Janek, That assert is part of a new corruption check added in 16.2.12 -- see https://github.com/ceph/ceph/commit/1771aae8e79b577acde749a292d9965264f20202 The abort is controlled by a new option: + Option("mds_abort_on_newly_corrupt_dentry", Option::TYPE_BOOL, Option::LEVEL_ADVANCED) + .set_default(true) + .set_description("MDS will abort if dentry is detected newly corrupted."), So in theory you could switch that off, but it is concerning that the metadata is corrupted already. I'm cc'ing Patrick who has been working on this issue. Cheers, Dan ______________________________ Clyso GmbH | https://www.clyso.com On Tue, May 2, 2023 at 7:32 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > Hi, > > After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS > fails start start. After replaying the journal, it just crashes with > > [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry > #0x1/storage [2,head] auth (dversion lock) > > Immediately after the upgrade, I had it running shortly, but then it > decided to crash for unknown reasons and I cannot get it back up. > > We have five ranks in total, the other four seem to be fine. I backed up > the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0 > event recover_dentries summary, but it never finishes only eats up a lot > of RAM. I stopped it after an hour and 50GB RAM. > > Resetting the journal makes the MDS crash with a missing inode error on > another top-level directory, so I re-imported the backed-up journal. Is > there any way to recover from this without rebuilding the whole file system? > > Thanks > Janek > > > Here's the full crash log: > > > May 02 16:16:53 xxx077 ceph-mds[3047358]: -29> > 2023-05-02T16:16:52.761+0200 7f51f878b700 1 mds.0.1711712 Finished > replaying journal > May 02 16:16:53 xxx077 ceph-mds[3047358]: -28> > 2023-05-02T16:16:52.761+0200 7f51f878b700 1 mds.0.1711712 making mds > journal writeable > May 02 16:16:53 xxx077 ceph-mds[3047358]: -27> > 2023-05-02T16:16:52.761+0200 7f51f878b700 1 mds.0.journaler.mdlog(ro) > set_writeable > May 02 16:16:53 xxx077 ceph-mds[3047358]: -26> > 2023-05-02T16:16:52.761+0200 7f51f878b700 2 mds.0.1711712 i am not > alone, moving to state resolve > May 02 16:16:53 xxx077 ceph-mds[3047358]: -25> > 2023-05-02T16:16:52.761+0200 7f51f878b700 3 mds.0.1711712 request_state > up:resolve > May 02 16:16:53 xxx077 ceph-mds[3047358]: -24> > 2023-05-02T16:16:52.761+0200 7f51f878b700 5 mds.beacon.xxx077 > set_want_state: up:replay -> up:resolve > May 02 16:16:53 xxx077 ceph-mds[3047358]: -23> > 2023-05-02T16:16:52.761+0200 7f51f878b700 5 mds.beacon.xxx077 Sending > beacon up:resolve seq 15 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -22> > 2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient: > _send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -21> > 2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick > May 02 16:16:53 xxx077 ceph-mds[3047358]: -20> > 2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after > 2023-05-02T16:16:23.118186+0200) > May 02 16:16:53 xxx077 ceph-mds[3047358]: -19> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.xxx077 Updating MDS map > to version 1711713 from mon.1 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -18> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 > handle_mds_map i am now mds.0.1711712 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -17> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 > handle_mds_map state change up:replay --> up:resolve > May 02 16:16:53 xxx077 ceph-mds[3047358]: -16> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 resolve_start > May 02 16:16:53 xxx077 ceph-mds[3047358]: -15> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 reopen_log > May 02 16:16:53 xxx077 ceph-mds[3047358]: -14> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 recovery set > is 1,2,3,4 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -13> > 2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 recovery set > is 1,2,3,4 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -12> > 2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request > con 0x5574fe74c400 auth_method 0 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -11> > 2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request > con 0x5574fe40fc00 auth_method 0 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -10> > 2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request > con 0x5574f932fc00 auth_method 0 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -9> > 2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request > con 0x5574ffce2000 auth_method 0 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -8> > 2023-05-02T16:16:53.377+0200 7f5202fa0700 5 mds.beacon.xxx077 received > beacon reply up:resolve seq 15 rtt 0.616008 > May 02 16:16:53 xxx077 ceph-mds[3047358]: -7> > 2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map > old map epoch 1711713 <= 1711713, discarding > May 02 16:16:53 xxx077 ceph-mds[3047358]: -6> > 2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map > old map epoch 1711713 <= 1711713, discarding > May 02 16:16:53 xxx077 ceph-mds[3047358]: -5> > 2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map > old map epoch 1711713 <= 1711713, discarding > May 02 16:16:53 xxx077 ceph-mds[3047358]: -4> > 2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map > old map epoch 1711713 <= 1711713, discarding > May 02 16:16:53 xxx077 ceph-mds[3047358]: -3> > 2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 mds.0.cache.den(0x1 > storage) newly corrupt dentry to be committed: [dentry #0x1/storage > [2,head] auth (dversion lock) v=78956500 ino=0x10000000000 > state=1610612736 | inodepin=1 dirty=1 0x5574f932db80] > May 02 16:16:53 xxx077 ceph-mds[3047358]: -2> > 2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 log_channel(cluster) log > [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry > #0x1/storage [2,head] auth (dversion lock) v=78956500 ino=0x10000000000 > state=1610612736 | inodepin=1 dirty=1 0x5574f932db80] > May 02 16:16:53 xxx077 ceph-mds[3047358]: -1> > 2023-05-02T16:16:53.549+0200 7f51fff9a700 -1 > /build/ceph-16.2.12/src/mds/CDentry.cc: In function 'bool > CDentry::check_corruption(bool)' thread 7f51fff9a700 time > 2023-05-02T16:16:53.549536+0200 > /build/ceph-16.2.12/src/mds/CDentry.cc: 697: ceph_abort_msg("abort() > called") > > ceph version 16.2.12 > (5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable) > 1: > (ceph::__ceph_abort(char const*, int, char const*, > std::__cxx11::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&)+0xe0) [0x7f52054c3495] > 2: > (CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b] > 3: > (EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7] > 4: > (MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1] > 5: > (MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d] > 6: > (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866] > 7: > (MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7] > 8: > (MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7] > 9: > (MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358] > 10: > (MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> > const&)+0x1e02) [0x5574f7d454c2] > 11: > (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144) > [0x5574f7d47bd4] > 12: > (MDSRank::handle_message(boost::intrusive_ptr<Message const> > const&)+0x733) [0x5574f7bac793] > 13: > (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, > bool)+0x643) [0x5574f7bcacb3] > 14: > (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> > const&)+0x5c) [0x5574f7bcb34c] > 15: > (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) > [0x5574f7b9f226] > 16: > (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> > const&)+0x460) [0x7f5205714020] > 17: > (DispatchQueue::entry()+0x58f) [0x7f52057118bf] > 18: > (DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261] > 19: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609] > 20: clone() > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0> > 2023-05-02T16:16:53.553+0200 7f51fff9a700 -1 *** Caught signal (Aborted) ** > in thread 7f51fff9a700 > thread_name:ms_dispatch > > ceph version 16.2.12 > (5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable) > 1: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f5205211420] > 2: gsignal() > 3: abort() > 4: > (ceph::__ceph_abort(char const*, int, char const*, > std::__cxx11::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&)+0x1af) [0x7f52054c3564] > 5: > (CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b] > 6: > (EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7] > 7: > (MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1] > 8: > (MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d] > 9: > (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866] > 10: > (MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7] > 11: > (MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7] > 12: > (MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358] > 13: > (MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> > const&)+0x1e02) [0x5574f7d454c2] > 14: > (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144) > [0x5574f7d47bd4] > 15: > (MDSRank::handle_message(boost::intrusive_ptr<Message const> > const&)+0x733) [0x5574f7bac793] > 16: > (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, > bool)+0x643) [0x5574f7bcacb3] > 17: > (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> > const&)+0x5c) [0x5574f7bcb34c] > 18: > (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) > [0x5574f7b9f226] > 19: > (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> > const&)+0x460) [0x7f5205714020] > 20: > (DispatchQueue::entry()+0x58f) [0x7f52057118bf] > 21: > (DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261] > 22: > /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609] > 23: clone() > NOTE: a copy of the > executable, or `objdump -rdS <executable>` is needed to interpret this. > May 02 16:16:53 xxx077 ceph-mds[3047358]: --- logging levels --- > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 none > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 lockdep > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 context > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 crush > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_balancer > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_locker > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_log > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_log_expire > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_migrator > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 buffer > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 timer > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 filer > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 striper > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 objecter > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rados > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd_mirror > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd_replay > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd_pwl > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 journaler > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 objectcacher > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 immutable_obj_cache > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 client > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 osd > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 optracker > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 objclass > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 3 filestore > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 3 journal > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 0 ms > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mon > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/10 monc > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 paxos > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 tp > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 auth > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 crypto > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 finisher > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 reserver > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 heartbeatmap > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 perfcounter > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 rgw > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 rgw_sync > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/10 civetweb > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 javaclient > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 asok > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 throttle > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 0 refs > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 compressor > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 bluestore > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 bluefs > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 3 bdev > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 kstore > May 02 16:16:53 xxx077 ceph-mds[3047358]: 4/ 5 rocksdb > May 02 16:16:53 xxx077 ceph-mds[3047358]: 4/ 5 leveldb > May 02 16:16:53 xxx077 ceph-mds[3047358]: 4/ 5 memdb > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 fuse > May 02 16:16:53 xxx077 ceph-mds[3047358]: 2/ 5 mgr > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mgrc > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 dpdk > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 eventtrace > May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 prioritycache > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 test > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 cephfs_mirror > May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 cephsqlite > May 02 16:16:53 xxx077 ceph-mds[3047358]: 99/99 (syslog threshold) > May 02 16:16:53 xxx077 ceph-mds[3047358]: -2/-2 (stderr threshold) > May 02 16:16:53 xxx077 ceph-mds[3047358]: --- pthread ID / name mapping > for recent threads --- > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990037739264 / > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990054524672 / > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990062917376 / MR_Finisher > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990079702784 / PQ_Finisher > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990096488192 / ms_dispatch > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990130059008 / ceph-mds > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990146844416 / safe_timer > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990163629824 / ms_dispatch > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990180415232 / > io_context_pool > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990197200640 / admin_socket > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990205593344 / msgr-worker-2 > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990213986048 / msgr-worker-1 > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990222378752 / msgr-worker-0 > May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990239524736 / ceph-mds > May 02 16:16:53 xxx077 ceph-mds[3047358]: max_recent 10000 > May 02 16:16:53 xxx077 ceph-mds[3047358]: max_new 10000 > May 02 16:16:53 xxx077 ceph-mds[3047358]: log_file > /var/lib/ceph/crash/2023-05-02T14:16:53.555508Z_0b05f5cb-130c-4979-95ca-1ba7f31cf7e5/log > May 02 16:16:53 xxx077 ceph-mds[3047358]: --- end dump of recent events --- > > > > -- > > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx