Re: MDS Crashing 14.2.1

Adam Tygart <mozes@xxxxxxx> · Thu, 16 May 2019 18:52:04 +0000

Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?

Thanks,
Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mozes@xxxxxxx> wrote:
>
> Hello all,
>
> I've got a 30 node cluster serving up lots of CephFS data.
>
> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> this week.
>
> We've been running 2 MDS daemons in an active-active setup. Tonight
> one of the metadata daemons crashed with the following several times:
>
>     -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> In function 'void CIn
> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> 00:20:56.775021
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
> ack_allow_loading_invalid_metadata"))
>
> I made a quick decision to move to a single MDS because I saw
> set_primary_parent, and I thought it might be related to auto
> balancing between the metadata servers.
>
> This caused one MDS to fail, the other crashed, and now rank 0 loads,
> goes active and then crashes with the following:
>     -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> In function 'void M
> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> 258: FAILED ceph_assert(!p)
>
> It now looks like we somehow have a duplicate inode in the MDS journal?
>
> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> then became rank one after the crash and attempted drop to one active
> MDS
> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> and crashed
>
> Anyone have any thoughts on this?
>
> Thanks,
> Adam
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com