On Fri, Oct 25, 2019 at 12:11 PM Pickett, Neale T <neale@xxxxxxxx> wrote: > In the last week we have made a few changes to the down filesystem in an attempt to fix what we thought was an inode problem: > > > cephfs-data-scan scan_extents # about 1 day with 64 processes > > cephfs-data-scan scan_inodes # about 1 day with 64 processes > > cephfs-data_scan scan_links # about 1 day Did you reset the journals or perform any other disaster recovery commands? This process likely introduced the duplicate inodes. > After these three, we tried to start an MDS and it stayed up. We then ran: > > ceph tell mds.a scrub start / recursive repair > > > The repair ran about 3 days, spewing logs to `ceph -w` about duplicated inodes, until it stopped. All looked well until we began bringing production services back online, at which point many error messages appeared, the mds went back into damaged, and the fs back to degraded. At this point I removed the objects you suggested, which brought everything back briefly. > > The latest crash is: > > -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time 2019-1... > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: 258: FAILED ceph_assert(!p) This error indicates a duplicate inode loaded into cache. Fixing this probably requires significant intervention and (meta)data loss for recent changes: - Stop/unmount all clients. (Probably already the case if the rank is damaged!) - Reset the MDS journal [1] and optionally recover any dentries first. (This will hopefully resolve the ESubtreeMap errors you pasted.) Note that some metadata may be lost through this command. - `cephfs-data_scan scan_links` again. This should repair any duplicate inodes (by dropping the older dentries). - Then you can try marking the rank as repaired. Good luck! [1] https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#journal-truncation -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com