Aha, I knew it was too short to be true. It seems like a client is trying to delete a file which is triggering all this. There are many many lines looking like -5 and -4 here. -5> 2021-08-24T21:17:38.293+0000 7fe9a5b0b700 0 mds.0.cache.dir(0x609) _fetched badness: got (but i already had) [inode 0x100028e2fe5 [...2,head] ~mds0/stray3/100028e2fe5 auth v265487174 snaprealm=0x5608e16d3a00 s=14 nl=15 n(v0 rc2021-05-25T15:33:03.443023+0000 b14 1=1+0) (iversion lock) 0x5608e16d7180] mode 33188 mtime 2014-07-03T22:35:25.000000+0000 -4> 2021-08-24T21:17:38.293+0000 7fe9a5b0b700 -1 log_channel(cluster) log [ERR] : loaded dup inode 0x100028e2fe5 [2,head] v265540122 at ~mds0/stray9/100028e2fe5, but inode 0x100028e2fe5.head v265487174 already exists at ~mds0/stray3/100028e2fe5 -3> 2021-08-24T21:17:38.307+0000 7fe9a730e700 4 mds.0.server handle_client_request client_request(client.26332595:243938 unlink #0x100028e597b/1213982112.554772-6,128.165.213.35:2945,222.73.254.92:22.pcap 2021-08-23T17:51:55.994745+0000 RETRY=209 caller_uid=0, caller_gid=0{}) v2 -2> 2021-08-24T21:17:38.307+0000 7fe9a730e700 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on single dirfrag 0x100028e597b, inode has n(v1963 rc2021-08-23T17:51:55.994745+0000 b2195090776 5729=5728+1), dirfrag has n(v1963 rc2021-08-23T17:51:55.994745+0000) ________________________________ From: Pickett, Neale T Sent: Tuesday, August 24, 2021 15:20 To: Dan van der Ster Cc: ceph-users@xxxxxxx Subject: Re: [EXTERNAL] Re: mds in death loop with [ERR] loaded dup inode XXX [2,head] XXX at XXX, but inode XXX already exists at XXX Full backtrace below. Seems pretty short for a ceph backtrace! I'll get started on a link scan for the time being. It'll keep it from flapping in and out of CEPH_ERR! -1> 2021-08-24T21:17:38.313+0000 7fe9a730e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/Server.cc: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7fe9a730e700 time 2021-08-24T21:17:38.308917+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/Server.cc: 7503: FAILED ceph_assert(in->first <= straydn->first) ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fe9b4780b24] 2: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fe9b4780d3e] 3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x106a) [0x5608c0c338ba] 4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x4d9) [0x5608c0c38589] 5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xefb) [0x5608c0c4ebab] 6: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x5608c0c59e2c] 7: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x5608c0c6290b] 8: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x5608c0bbec24] 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x5608c0bc129b] 10: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x16) [0x5608c0bc1836] 11: (MDSContext::complete(int)+0x56) [0x5608c0ec0b96] 12: (MDSRank::_advance_queues()+0x84) [0x5608c0bc0304] 13: (MDSRank::ProgressThread::entry()+0xc5) [0x5608c0bc0a25] 14: /lib64/libpthread.so.0(+0x814a) [0x7fe9b352214a] 15: clone() 0> 2021-08-24T21:17:38.320+0000 7fe9a730e700 -1 *** Caught signal (Aborted) ** in thread 7fe9a730e700 thread_name:mds_rank_progr ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) 1: /lib64/libpthread.so.0(+0x12b20) [0x7fe9b352cb20] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7fe9b4780b75] 5: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fe9b4780d3e] 6: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x106a) [0x5608c0c338ba] 7: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x4d9) [0x5608c0c38589] 8: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xefb) [0x5608c0c4ebab] 9: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x3fc) [0x5608c0c59e2c] 10: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x5608c0c6290b] 11: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x5608c0bbec24] 12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x5608c0bc129b] 13: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x16) [0x5608c0bc1836] 14: (MDSContext::complete(int)+0x56) [0x5608c0ec0b96] 15: (MDSRank::_advance_queues()+0x84) [0x5608c0bc0304] 16: (MDSRank::ProgressThread::entry()+0xc5) [0x5608c0bc0a25] 17: /lib64/libpthread.so.0(+0x814a) [0x7fe9b352214a] 18: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 140641483929344 / MR_Finisher 140641509107456 / mds_rank_progr 140641517500160 / ms_dispatch 140641567856384 / safe_timer 140641626605312 / msgr-worker-2 max_recent 10000 max_new 10000 log_file /var/lib/ceph/crash/2021-08-24T21:17:38.319894Z_4561a8f4-f452-45ef-82f5-3a3f2ffe682e/log --- end dump of recent events --- ________________________________ From: Dan van der Ster <dan@xxxxxxxxxxxxxx> Sent: Tuesday, August 24, 2021 01:13 To: Pickett, Neale T Cc: ceph-users@xxxxxxx Subject: [EXTERNAL] Re: mds in death loop with [ERR] loaded dup inode XXX [2,head] XXX at XXX, but inode XXX already exists at XXX Hi, What is the actual backtrace from the crash? We occasionally had dup inode errors like this in the past but they never escalated to a crash. You can see my old thread here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/036294.html The developer at that time suggested some manipulations to remove the omap entries for the older version of the inode from the stray dirs, and also to eventually run cephfs-data-scan scan_links. Cheers, Dan On Mon, Aug 23, 2021 at 9:01 PM Pickett, Neale T <neale@xxxxxxxx> wrote: > > hello, ceph-users! > > > We have an old cephfs that is ten different kinds of broken, which we are attempting to (slowly) pull files from. The most recent issue we've hit is that the mds will start up, log hundreds of messages like below, then crash. This is happening in a loop; we can never actually get an mds to stay up long enough to try and fix it. > > > The most recent thing we tried was to do a extents scan, followed by an inode scan. No luck. > > > Any suggestions? Could I maybe just manually delete the stray inodes? > > > 2021-08-23T12:59:40.803056-0600 mds.dellingr [ERR] loaded dup inode 0x100028e324a [2,head] v265523356 at ~mds0/stray3/100028e324a, but inode 0x100028e324a.head v265540146 already exists at ~mds0/stray1/100028e324a > 2021-08-23T12:59:40.803082-0600 mds.dellingr [ERR] loaded dup inode 0x100028e3249 [2,head] v265523350 at ~mds0/stray3/100028e3249, but inode 0x100028e3249.head v265540138 already exists at ~mds0/stray1/100028e3249 > 2021-08-23T12:59:40.803109-0600 mds.dellingr [ERR] loaded dup inode 0x100028e3247 [2,head] v265523354 at ~mds0/stray3/100028e3247, but inode 0x100028e3247.head v265540140 already exists at ~mds0/stray1/100028e3247 > 2021-08-23T12:59:40.803136-0600 mds.dellingr [ERR] loaded dup inode 0x100028e3246 [2,head] v265523344 at ~mds0/stray3/100028e3246, but inode 0x100028e3246.head v265540130 already exists at ~mds0/stray1/100028e3246 > 2021-08-23T12:59:40.803162-0600 mds.dellingr [ERR] loaded dup inode 0x100028e3245 [2,head] v265523342 at ~mds0/stray3/100028e3245, but inode 0x100028e3245.head v265540128 already exists at ~mds0/stray1/100028e3245 > 2021-08-23T12:59:40.803188-0600 mds.dellingr [ERR] loaded dup inode 0x100028e3244 [2,head] v265523348 at ~mds0/stray3/100028e3244, but inode 0x100028e3244.head v265540126 already exists at ~mds0/stray1/100028e3244 > 2021-08-23T12:59:40.803214-0600 mds.dellingr [ERR] loaded dup inode 0x100028e3243 [2,head] v265523346 at ~mds0/stray3/100028e3243, but inode 0x100028e3243.head v265540136 already exists at ~mds0/stray1/100028e3243 > 2021-08-23T12:59:40.803241-0600 mds.dellingr [ERR] loaded dup inode 0x100028e323f [2,head] v265523338 at ~mds0/stray3/100028e323f, but inode 0x100028e323f.head v265540124 already exists at ~mds0/stray1/100028e323f > > > > Neale Pickett <neale@xxxxxxxx> > A-4: Advanced Research in Cyber Systems > Los Alamos National Laboratory > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx