Re: [lists.ceph.com代发]Re: MDS Crashing 14.2.1

Adam Tygart <mozes@xxxxxxx> · Fri, 17 May 2019 12:40:49 +0000

I followed the docs from here:
http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts

I exported the journals as a backup for both ranks. I was running 2
active MDS daemons at the time.

cephfs-journal-tool --rank=combined:0 journal export
cephfs-journal-0-201905161412.bin
cephfs-journal-tool --rank=combined:1 journal export
cephfs-journal-1-201905161412.bin

I recovered the Dentries on both ranks
cephfs-journal-tool --rank=combined:0 event recover_dentries summary
cephfs-journal-tool --rank=combined:1 event recover_dentries summary

I reset the journals of both ranks:
cephfs-journal-tool --rank=combined:1 journal reset
cephfs-journal-tool --rank=combined:0 journal reset

Then I reset the session table
cephfs-table-tool all reset session

Once that was done, reboot all machines that were talking to cephfs
(or at least unmount/remount).

On Fri, May 17, 2019 at 2:30 AM <wangzhigang@xxxxxxxxxxx> wrote:
>
> Hi
>    Can you tell me the detail recovery cmd ?
>
> I just started learning cephfs ,I would be grateful.
>
>
>
> 发件人:         Adam Tygart <mozes@xxxxxxx>
> 收件人:         Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> 日期:         2019/05/17 09:04
> 主题:        [lists.ceph.com代发]Re:  MDS Crashing 14.2.1
> 发件人:        "ceph-users" <ceph-users-bounces@xxxxxxxxxxxxxx>
> ________________________________
>
>
>
> I ended up backing up the journals of the MDS ranks, recover_dentries for both of them, resetting the journals and session table. It is back up. The recover dentries stage didn't show any errors, so I'm not even sure why the MDS was asserting about duplicate inodes.
>
> --
> Adam
>
> On Thu, May 16, 2019, 13:52 Adam Tygart <mozes@xxxxxxx> wrote:
> Hello all,
>
> The rank 0 mds is still asserting. Is this duplicate inode situation
> one that I should be considering using the cephfs-journal-tool to
> export, recover dentries and reset?
>
> Thanks,
> Adam
>
> On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mozes@xxxxxxx> wrote:
> >
> > Hello all,
> >
> > I've got a 30 node cluster serving up lots of CephFS data.
> >
> > We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> > this week.
> >
> > We've been running 2 MDS daemons in an active-active setup. Tonight
> > one of the metadata daemons crashed with the following several times:
> >
> >     -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> > In function 'void CIn
> > ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> > 00:20:56.775021
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> > 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
> > ack_allow_loading_invalid_metadata"))
> >
> > I made a quick decision to move to a single MDS because I saw
> > set_primary_parent, and I thought it might be related to auto
> > balancing between the metadata servers.
> >
> > This caused one MDS to fail, the other crashed, and now rank 0 loads,
> > goes active and then crashes with the following:
> >     -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> > In function 'void M
> > DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> > 258: FAILED ceph_assert(!p)
> >
> > It now looks like we somehow have a duplicate inode in the MDS journal?
> >
> > https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> > then became rank one after the crash and attempted drop to one active
> > MDS
> > https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> > and crashed
> >
> > Anyone have any thoughts on this?
> >
> > Thanks,
> > Adam
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com_______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 本邮件及其附件含有浙江宇视科技有限公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发本邮件中的信息。如果您错收了本邮件请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from Uniview, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com