Re: MDS Crashing 14.2.1

Adam Tygart <mozes@xxxxxxx> · Fri, 17 May 2019 01:03:19 +0000

I ended up backing up the journals of the MDS ranks, recover_dentries for both of them, resetting the journals and session table. It is back up. The recover dentries stage didn't show any errors, so I'm not even sure why the MDS was asserting
 about duplicate inodes. 

--
Adam

On Thu, May 16, 2019, 13:52 Adam Tygart <mozes@xxxxxxx> wrote:

Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation

one that I should be considering using the cephfs-journal-tool to

export, recover dentries and reset?

Thanks,

Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mozes@xxxxxxx> wrote:

>

> Hello all,

>

> I've got a 30 node cluster serving up lots of CephFS data.

>

> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier

> this week.

>

> We've been running 2 MDS daemons in an active-active setup. Tonight

> one of the metadata daemons crashed with the following several times:

>

>     -1> 2019-05-16 00:20:56.775 7f9f22405700 -1

> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:

> In function 'void CIn

> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16

> 00:20:56.775021

> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:

> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h

> ack_allow_loading_invalid_metadata"))

>

> I made a quick decision to move to a single MDS because I saw

> set_primary_parent, and I thought it might be related to auto

> balancing between the metadata servers.

>

> This caused one MDS to fail, the other crashed, and now rank 0 loads,

> goes active and then crashes with the following:

>     -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1

> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:

> In function 'void M

> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531

> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:

> 258: FAILED ceph_assert(!p)

>

> It now looks like we somehow have a duplicate inode in the MDS journal?

>

> 
https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0

> then became rank one after the crash and attempted drop to one active

> MDS

> 
https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0

> and crashed

>

> Anyone have any thoughts on this?

>

> Thanks,

> Adam

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com