Re: MDS Crashing 14.2.1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I ended up backing up the journals of the MDS ranks, recover_dentries for both of them, resetting the journals and session table. It is back up. The recover dentries stage didn't show any errors, so I'm not even sure why the MDS was asserting about duplicate inodes. 

--
Adam

On Thu, May 16, 2019, 13:52 Adam Tygart <mozes@xxxxxxx> wrote:
Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?

Thanks,
Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mozes@xxxxxxx> wrote:
>
> Hello all,
>
> I've got a 30 node cluster serving up lots of CephFS data.
>
> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> this week.
>
> We've been running 2 MDS daemons in an active-active setup. Tonight
> one of the metadata daemons crashed with the following several times:
>
>     -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> In function 'void CIn
> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> 00:20:56.775021
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
> ack_allow_loading_invalid_metadata"))
>
> I made a quick decision to move to a single MDS because I saw
> set_primary_parent, and I thought it might be related to auto
> balancing between the metadata servers.
>
> This caused one MDS to fail, the other crashed, and now rank 0 loads,
> goes active and then crashes with the following:
>     -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> In function 'void M
> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> 258: FAILED ceph_assert(!p)
>
> It now looks like we somehow have a duplicate inode in the MDS journal?
>
> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> then became rank one after the crash and attempted drop to one active
> MDS
> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> and crashed
>
> Anyone have any thoughts on this?
>
> Thanks,
> Adam
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux