Hello all, The rank 0 mds is still asserting. Is this duplicate inode situation one that I should be considering using the cephfs-journal-tool to export, recover dentries and reset? Thanks, Adam On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mozes@xxxxxxx> wrote: > > Hello all, > > I've got a 30 node cluster serving up lots of CephFS data. > > We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier > this week. > > We've been running 2 MDS daemons in an active-active setup. Tonight > one of the metadata daemons crashed with the following several times: > > -1> 2019-05-16 00:20:56.775 7f9f22405700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h: > In function 'void CIn > ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16 > 00:20:56.775021 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h: > 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h > ack_allow_loading_invalid_metadata")) > > I made a quick decision to move to a single MDS because I saw > set_primary_parent, and I thought it might be related to auto > balancing between the metadata servers. > > This caused one MDS to fail, the other crashed, and now rank 0 loads, > goes active and then crashes with the following: > -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc: > In function 'void M > DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc: > 258: FAILED ceph_assert(!p) > > It now looks like we somehow have a duplicate inode in the MDS journal? > > https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0 > then became rank one after the crash and attempted drop to one active > MDS > https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0 > and crashed > > Anyone have any thoughts on this? > > Thanks, > Adam _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com