MDS Crashing 14.2.1

Adam Tygart <mozes@xxxxxxx> · Thu, 16 May 2019 05:51:28 +0000

Hello all,

I've got a 30 node cluster serving up lots of CephFS data.

We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
this week.

We've been running 2 MDS daemons in an active-active setup. Tonight
one of the metadata daemons crashed with the following several times:

    -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
In function 'void CIn
ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
00:20:56.775021
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
ack_allow_loading_invalid_metadata"))

I made a quick decision to move to a single MDS because I saw
set_primary_parent, and I thought it might be related to auto
balancing between the metadata servers.

This caused one MDS to fail, the other crashed, and now rank 0 loads,
goes active and then crashes with the following:
    -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
In function 'void M
DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
258: FAILED ceph_assert(!p)

It now looks like we somehow have a duplicate inode in the MDS journal?

https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
then became rank one after the crash and attempted drop to one active
MDS
https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
and crashed

Anyone have any thoughts on this?

Thanks,
Adam
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com