Hi,
After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS
fails start start. After replaying the journal, it just crashes with
[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/storage [2,head] auth (dversion lock)
Immediately after the upgrade, I had it running shortly, but then it
decided to crash for unknown reasons and I cannot get it back up.
We have five ranks in total, the other four seem to be fine. I backed up
the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0
event recover_dentries summary, but it never finishes only eats up a lot
of RAM. I stopped it after an hour and 50GB RAM.
Resetting the journal makes the MDS crash with a missing inode error on
another top-level directory, so I re-imported the backed-up journal. Is
there any way to recover from this without rebuilding the whole file system?
Thanks
Janek
Here's the full crash log:
May 02 16:16:53 xxx077 ceph-mds[3047358]: -29>
2023-05-02T16:16:52.761+0200 7f51f878b700 1 mds.0.1711712 Finished
replaying journal
May 02 16:16:53 xxx077 ceph-mds[3047358]: -28>
2023-05-02T16:16:52.761+0200 7f51f878b700 1 mds.0.1711712 making mds
journal writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]: -27>
2023-05-02T16:16:52.761+0200 7f51f878b700 1 mds.0.journaler.mdlog(ro)
set_writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]: -26>
2023-05-02T16:16:52.761+0200 7f51f878b700 2 mds.0.1711712 i am not
alone, moving to state resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]: -25>
2023-05-02T16:16:52.761+0200 7f51f878b700 3 mds.0.1711712 request_state
up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]: -24>
2023-05-02T16:16:52.761+0200 7f51f878b700 5 mds.beacon.xxx077
set_want_state: up:replay -> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]: -23>
2023-05-02T16:16:52.761+0200 7f51f878b700 5 mds.beacon.xxx077 Sending
beacon up:resolve seq 15
May 02 16:16:53 xxx077 ceph-mds[3047358]: -22>
2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient:
_send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -21>
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick
May 02 16:16:53 xxx077 ceph-mds[3047358]: -20>
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2023-05-02T16:16:23.118186+0200)
May 02 16:16:53 xxx077 ceph-mds[3047358]: -19>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.xxx077 Updating MDS map
to version 1711713 from mon.1
May 02 16:16:53 xxx077 ceph-mds[3047358]: -18>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712
handle_mds_map i am now mds.0.1711712
May 02 16:16:53 xxx077 ceph-mds[3047358]: -17>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712
handle_mds_map state change up:replay --> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]: -16>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 resolve_start
May 02 16:16:53 xxx077 ceph-mds[3047358]: -15>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 reopen_log
May 02 16:16:53 xxx077 ceph-mds[3047358]: -14>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 recovery set
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]: -13>
2023-05-02T16:16:53.373+0200 7f51fff9a700 1 mds.0.1711712 recovery set
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]: -12>
2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request
con 0x5574fe74c400 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -11>
2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request
con 0x5574fe40fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -10>
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
con 0x5574f932fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -9>
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
con 0x5574ffce2000 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -8>
2023-05-02T16:16:53.377+0200 7f5202fa0700 5 mds.beacon.xxx077 received
beacon reply up:resolve seq 15 rtt 0.616008
May 02 16:16:53 xxx077 ceph-mds[3047358]: -7>
2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -6>
2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -5>
2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -4>
2023-05-02T16:16:53.393+0200 7f51fff9a700 5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -3>
2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 mds.0.cache.den(0x1
storage) newly corrupt dentry to be committed: [dentry #0x1/storage
[2,head] auth (dversion lock) v=78956500 ino=0x10000000000
state=1610612736 | inodepin=1 dirty=1 0x5574f932db80]
May 02 16:16:53 xxx077 ceph-mds[3047358]: -2>
2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 log_channel(cluster) log
[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/storage [2,head] auth (dversion lock) v=78956500 ino=0x10000000000
state=1610612736 | inodepin=1 dirty=1 0x5574f932db80]
May 02 16:16:53 xxx077 ceph-mds[3047358]: -1>
2023-05-02T16:16:53.549+0200 7f51fff9a700 -1
/build/ceph-16.2.12/src/mds/CDentry.cc: In function 'bool
CDentry::check_corruption(bool)' thread 7f51fff9a700 time
2023-05-02T16:16:53.549536+0200
/build/ceph-16.2.12/src/mds/CDentry.cc: 697: ceph_abort_msg("abort()
called")
ceph version 16.2.12
(5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable)
1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe0) [0x7f52054c3495]
2:
(CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b]
3:
(EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7]
4:
(MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1]
5:
(MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d]
6:
(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866]
7:
(MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7]
8:
(MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7]
9:
(MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358]
10:
(MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const>
const&)+0x1e02) [0x5574f7d454c2]
11:
(MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144)
[0x5574f7d47bd4]
12:
(MDSRank::handle_message(boost::intrusive_ptr<Message const>
const&)+0x733) [0x5574f7bac793]
13:
(MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&,
bool)+0x643) [0x5574f7bcacb3]
14:
(MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const>
const&)+0x5c) [0x5574f7bcb34c]
15:
(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6)
[0x5574f7b9f226]
16:
(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message>
const&)+0x460) [0x7f5205714020]
17:
(DispatchQueue::entry()+0x58f) [0x7f52057118bf]
18:
(DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261]
19:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609]
20: clone()
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0>
2023-05-02T16:16:53.553+0200 7f51fff9a700 -1 *** Caught signal (Aborted) **
in thread 7f51fff9a700
thread_name:ms_dispatch
ceph version 16.2.12
(5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable)
1:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f5205211420]
2: gsignal()
3: abort()
4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1af) [0x7f52054c3564]
5:
(CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b]
6:
(EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7]
7:
(MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1]
8:
(MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d]
9:
(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866]
10:
(MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7]
11:
(MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7]
12:
(MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358]
13:
(MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const>
const&)+0x1e02) [0x5574f7d454c2]
14:
(MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144)
[0x5574f7d47bd4]
15:
(MDSRank::handle_message(boost::intrusive_ptr<Message const>
const&)+0x733) [0x5574f7bac793]
16:
(MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&,
bool)+0x643) [0x5574f7bcacb3]
17:
(MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const>
const&)+0x5c) [0x5574f7bcb34c]
18:
(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6)
[0x5574f7b9f226]
19:
(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message>
const&)+0x460) [0x7f5205714020]
20:
(DispatchQueue::entry()+0x58f) [0x7f52057118bf]
21:
(DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261]
22:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609]
23: clone()
NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- logging levels ---
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 none
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 lockdep
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 context
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 crush
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_balancer
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_locker
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_log
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_log_expire
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mds_migrator
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 buffer
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 timer
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 filer
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 striper
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 1 objecter
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rados
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd_mirror
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd_replay
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 rbd_pwl
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 journaler
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 objectcacher
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 immutable_obj_cache
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 client
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 osd
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 optracker
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 objclass
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 3 filestore
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 3 journal
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 0 ms
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mon
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/10 monc
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 paxos
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 tp
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 auth
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 crypto
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 reserver
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 heartbeatmap
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 perfcounter
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 rgw
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 rgw_sync
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/10 civetweb
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 javaclient
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 asok
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 1 throttle
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 0 refs
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 compressor
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 bluestore
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 bluefs
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 3 bdev
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 kstore
May 02 16:16:53 xxx077 ceph-mds[3047358]: 4/ 5 rocksdb
May 02 16:16:53 xxx077 ceph-mds[3047358]: 4/ 5 leveldb
May 02 16:16:53 xxx077 ceph-mds[3047358]: 4/ 5 memdb
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 fuse
May 02 16:16:53 xxx077 ceph-mds[3047358]: 2/ 5 mgr
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 mgrc
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 dpdk
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 eventtrace
May 02 16:16:53 xxx077 ceph-mds[3047358]: 1/ 5 prioritycache
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 test
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 cephfs_mirror
May 02 16:16:53 xxx077 ceph-mds[3047358]: 0/ 5 cephsqlite
May 02 16:16:53 xxx077 ceph-mds[3047358]: 99/99 (syslog threshold)
May 02 16:16:53 xxx077 ceph-mds[3047358]: -2/-2 (stderr threshold)
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- pthread ID / name mapping
for recent threads ---
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990037739264 /
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990054524672 /
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990062917376 / MR_Finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990079702784 / PQ_Finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990096488192 / ms_dispatch
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990130059008 / ceph-mds
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990146844416 / safe_timer
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990163629824 / ms_dispatch
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990180415232 /
io_context_pool
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990197200640 / admin_socket
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990205593344 / msgr-worker-2
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990213986048 / msgr-worker-1
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990222378752 / msgr-worker-0
May 02 16:16:53 xxx077 ceph-mds[3047358]: 139990239524736 / ceph-mds
May 02 16:16:53 xxx077 ceph-mds[3047358]: max_recent 10000
May 02 16:16:53 xxx077 ceph-mds[3047358]: max_new 10000
May 02 16:16:53 xxx077 ceph-mds[3047358]: log_file
/var/lib/ceph/crash/2023-05-02T14:16:53.555508Z_0b05f5cb-130c-4979-95ca-1ba7f31cf7e5/log
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- end dump of recent events ---
--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany
Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx