Re: MDS "newly corrupt dentry" after patch version upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks!

I tried downgrading to 16.2.10 and was able to get it running again, but after a reboot, got a warning that two of the OSDs on that host had broken Bluestore compression. Restarting the two OSDs again got rid of it, but that's still a bit concerning.


On 02/05/2023 16:48, Dan van der Ster wrote:
Hi Janek,

That assert is part of a new corruption check added in 16.2.12 -- see
https://github.com/ceph/ceph/commit/1771aae8e79b577acde749a292d9965264f20202

The abort is controlled by a new option:

+    Option("mds_abort_on_newly_corrupt_dentry", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
+    .set_default(true)
+    .set_description("MDS will abort if dentry is detected newly corrupted."),

So in theory you could switch that off, but it is concerning that the
metadata is corrupted already.
I'm cc'ing Patrick who has been working on this issue.

Cheers, Dan

______________________________
Clyso GmbH | https://www.clyso.com

On Tue, May 2, 2023 at 7:32 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
Hi,

After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS
fails start start. After replaying the journal, it just crashes with

[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/storage [2,head] auth (dversion lock)

Immediately after the upgrade, I had it running shortly, but then it
decided to crash for unknown reasons and I cannot get it back up.

We have five ranks in total, the other four seem to be fine. I backed up
the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0
event recover_dentries summary, but it never finishes only eats up a lot
of RAM. I stopped it after an hour and 50GB RAM.

Resetting the journal makes the MDS crash with a missing inode error on
another top-level directory, so I re-imported the backed-up journal. Is
there any way to recover from this without rebuilding the whole file system?

Thanks
Janek


Here's the full crash log:


May 02 16:16:53 xxx077 ceph-mds[3047358]:    -29>
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 Finished
replaying journal
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -28>
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 making mds
journal writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -27>
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.journaler.mdlog(ro)
set_writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -26>
2023-05-02T16:16:52.761+0200 7f51f878b700  2 mds.0.1711712 i am not
alone, moving to state resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -25>
2023-05-02T16:16:52.761+0200 7f51f878b700  3 mds.0.1711712 request_state
up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -24>
2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077
set_want_state: up:replay -> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -23>
2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 Sending
beacon up:resolve seq 15
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -22>
2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient:
_send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -21>
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -20>
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2023-05-02T16:16:23.118186+0200)
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -19>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.xxx077 Updating MDS map
to version 1711713 from mon.1
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -18>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712
handle_mds_map i am now mds.0.1711712
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -17>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712
handle_mds_map state change up:replay --> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -16>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 resolve_start
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -15>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 reopen_log
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -14>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -13>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -12>
2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request
con 0x5574fe74c400 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -11>
2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request
con 0x5574fe40fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -10>
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
con 0x5574f932fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -9>
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
con 0x5574ffce2000 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -8>
2023-05-02T16:16:53.377+0200 7f5202fa0700  5 mds.beacon.xxx077 received
beacon reply up:resolve seq 15 rtt 0.616008
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -7>
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -6>
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -5>
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -4>
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -3>
2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 mds.0.cache.den(0x1
storage) newly corrupt dentry to be committed: [dentry #0x1/storage
[2,head] auth (dversion lock) v=78956500 ino=0x10000000000
state=1610612736 | inodepin=1 dirty=1 0x5574f932db80]
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -2>
2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 log_channel(cluster) log
[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/storage [2,head] auth (dversion lock) v=78956500 ino=0x10000000000
state=1610612736 | inodepin=1 dirty=1 0x5574f932db80]
May 02 16:16:53 xxx077 ceph-mds[3047358]:     -1>
2023-05-02T16:16:53.549+0200 7f51fff9a700 -1
/build/ceph-16.2.12/src/mds/CDentry.cc: In function 'bool
CDentry::check_corruption(bool)' thread 7f51fff9a700 time
2023-05-02T16:16:53.549536+0200
/build/ceph-16.2.12/src/mds/CDentry.cc: 697: ceph_abort_msg("abort()
called")

                                                  ceph version 16.2.12
(5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable)
                                                  1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe0) [0x7f52054c3495]
                                                  2:
(CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b]
                                                  3:
(EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7]
                                                  4:
(MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1]
                                                  5:
(MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d]
                                                  6:
(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866]
                                                  7:
(MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7]
                                                  8:
(MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7]
                                                  9:
(MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358]
                                                  10:
(MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const>
const&)+0x1e02) [0x5574f7d454c2]
                                                  11:
(MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144)
[0x5574f7d47bd4]
                                                  12:
(MDSRank::handle_message(boost::intrusive_ptr<Message const>
const&)+0x733) [0x5574f7bac793]
                                                  13:
(MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&,
bool)+0x643) [0x5574f7bcacb3]
                                                  14:
(MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const>
const&)+0x5c) [0x5574f7bcb34c]
                                                  15:
(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6)
[0x5574f7b9f226]
                                                  16:
(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message>
const&)+0x460) [0x7f5205714020]
                                                  17:
(DispatchQueue::entry()+0x58f) [0x7f52057118bf]
                                                  18:
(DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261]
                                                  19:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609]
                                                  20: clone()
May 02 16:16:53 xxx077 ceph-mds[3047358]:      0>
2023-05-02T16:16:53.553+0200 7f51fff9a700 -1 *** Caught signal (Aborted) **
                                                  in thread 7f51fff9a700
thread_name:ms_dispatch

                                                  ceph version 16.2.12
(5a2d516ce4b134bfafc80c4274532ac0d56fc1e2) pacific (stable)
                                                  1:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f5205211420]
                                                  2: gsignal()
                                                  3: abort()
                                                  4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1af) [0x7f52054c3564]
                                                  5:
(CDentry::check_corruption(bool)+0x86b) [0x5574f7e3a91b]
                                                  6:
(EMetaBlob::add_dir_context(CDir*, int)+0x507) [0x5574f7f9f9d7]
                                                  7:
(MDCache::create_subtree_map()+0x13e1) [0x5574f7d20dc1]
                                                  8:
(MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x5574f7f2949d]
                                                  9:
(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x206) [0x5574f7f29866]
                                                  10:
(MDCache::log_leader_commit(metareqid_t)+0x277) [0x5574f7ccc3c7]
                                                  11:
(MDCache::finish_committed_leaders()+0x87) [0x5574f7ccd0d7]
                                                  12:
(MDCache::maybe_resolve_finish()+0x78) [0x5574f7d38358]
                                                  13:
(MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const>
const&)+0x1e02) [0x5574f7d454c2]
                                                  14:
(MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x144)
[0x5574f7d47bd4]
                                                  15:
(MDSRank::handle_message(boost::intrusive_ptr<Message const>
const&)+0x733) [0x5574f7bac793]
                                                  16:
(MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&,
bool)+0x643) [0x5574f7bcacb3]
                                                  17:
(MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const>
const&)+0x5c) [0x5574f7bcb34c]
                                                  18:
(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6)
[0x5574f7b9f226]
                                                  19:
(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message>
const&)+0x460) [0x7f5205714020]
                                                  20:
(DispatchQueue::entry()+0x58f) [0x7f52057118bf]
                                                  21:
(DispatchQueue::DispatchThread::entry()+0x11) [0x7f52057df261]
                                                  22:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5205205609]
                                                  23: clone()
                                                  NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- logging levels ---
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 none
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 lockdep
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 context
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 crush
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_balancer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_locker
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_log
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_log_expire
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mds_migrator
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 buffer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 timer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 filer
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 striper
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 1 objecter
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rados
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd_mirror
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd_replay
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 rbd_pwl
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 journaler
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 objectcacher
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 immutable_obj_cache
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 client
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 osd
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 optracker
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 objclass
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 3 filestore
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 3 journal
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 0 ms
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mon
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/10 monc
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 paxos
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 tp
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 auth
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 crypto
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 reserver
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 heartbeatmap
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 perfcounter
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 rgw
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 rgw_sync
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/10 civetweb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 javaclient
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 asok
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 1 throttle
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 0 refs
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 compressor
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 bluestore
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 bluefs
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 3 bdev
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 kstore
May 02 16:16:53 xxx077 ceph-mds[3047358]:    4/ 5 rocksdb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    4/ 5 leveldb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    4/ 5 memdb
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 fuse
May 02 16:16:53 xxx077 ceph-mds[3047358]:    2/ 5 mgr
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 mgrc
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 dpdk
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 eventtrace
May 02 16:16:53 xxx077 ceph-mds[3047358]:    1/ 5 prioritycache
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 test
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 cephfs_mirror
May 02 16:16:53 xxx077 ceph-mds[3047358]:    0/ 5 cephsqlite
May 02 16:16:53 xxx077 ceph-mds[3047358]:   99/99 (syslog threshold)
May 02 16:16:53 xxx077 ceph-mds[3047358]:   -2/-2 (stderr threshold)
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- pthread ID / name mapping
for recent threads ---
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990037739264 /
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990054524672 /
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990062917376 / MR_Finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990079702784 / PQ_Finisher
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990096488192 / ms_dispatch
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990130059008 / ceph-mds
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990146844416 / safe_timer
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990163629824 / ms_dispatch
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990180415232 /
io_context_pool
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990197200640 / admin_socket
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990205593344 / msgr-worker-2
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990213986048 / msgr-worker-1
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990222378752 / msgr-worker-0
May 02 16:16:53 xxx077 ceph-mds[3047358]:   139990239524736 / ceph-mds
May 02 16:16:53 xxx077 ceph-mds[3047358]:   max_recent     10000
May 02 16:16:53 xxx077 ceph-mds[3047358]:   max_new        10000
May 02 16:16:53 xxx077 ceph-mds[3047358]:   log_file
/var/lib/ceph/crash/2023-05-02T14:16:53.555508Z_0b05f5cb-130c-4979-95ca-1ba7f31cf7e5/log
May 02 16:16:53 xxx077 ceph-mds[3047358]: --- end dump of recent events ---



--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux