Re: [lists.ceph.com代发]Re: MDS Crashing 14.2.1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've not done a scrub yet, but there was no indication of any duplicate inodes with the recover dentries. The MDS has been rock solid so far.

I'll probably start a scrub Monday.

--
Adam

On Fri, May 17, 2019, 18:40 Sergey Malinin <admin@xxxxxxxxxxxxxxx> wrote:
I've had similar problem twice (with mimic) and in both cases I ended up backing up and restoring to a fresh fs. Did you do MDS scrub after recovery? My experience insists that recovering dup inodes is not a trivial process: my MDS kept crashing on unlink() in some directories, and in other case newly created fs entries would not pass MDS scrub due to linkage errors.


May 17, 2019 3:40 PM, "Adam Tygart" <mozes@xxxxxxx> wrote:

> I followed the docs from here:
> http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> I exported the journals as a backup for both ranks. I was running 2
> active MDS daemons at the time.
>
> cephfs-journal-tool --rank=combined:0 journal export
> cephfs-journal-0-201905161412.bin
> cephfs-journal-tool --rank=combined:1 journal export
> cephfs-journal-1-201905161412.bin
>
> I recovered the Dentries on both ranks
> cephfs-journal-tool --rank=combined:0 event recover_dentries summary
> cephfs-journal-tool --rank=combined:1 event recover_dentries summary
>
> I reset the journals of both ranks:
> cephfs-journal-tool --rank=combined:1 journal reset
> cephfs-journal-tool --rank=combined:0 journal reset
>
> Then I reset the session table
> cephfs-table-tool all reset session
>
> Once that was done, reboot all machines that were talking to cephfs
> (or at least unmount/remount).
>
> On Fri, May 17, 2019 at 2:30 AM <wangzhigang@xxxxxxxxxxx> wrote:
>
>> Hi
>> Can you tell me the detail recovery cmd ?
>>
>> I just started learning cephfs ,I would be grateful.
>>
>> 发件人: Adam Tygart <mozes@xxxxxxx>
>> 收件人: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
>> 日期: 2019/05/17 09:04
>> 主题: [lists.ceph.com代发]Re: [ceph-users] MDS Crashing 14.2.1
>> 发件人: "ceph-users" <ceph-users-bounces@xxxxxxxxxxxxxx>
>> ________________________________
>>
>> I ended up backing up the journals of the MDS ranks, recover_dentries for both of them, resetting
>> the journals and session table. It is back up. The recover dentries stage didn't show any errors,
>> so I'm not even sure why the MDS was asserting about duplicate inodes.
>>
>> --
>> Adam
>>
>> On Thu, May 16, 2019, 13:52 Adam Tygart <mozes@xxxxxxx> wrote:
>> Hello all,
>>
>> The rank 0 mds is still asserting. Is this duplicate inode situation
>> one that I should be considering using the cephfs-journal-tool to
>> export, recover dentries and reset?
>>
>> Thanks,
>> Adam
>>
>> On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mozes@xxxxxxx> wrote:
>>
>> Hello all,
>>
>> I've got a 30 node cluster serving up lots of CephFS data.
>>
>> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
>> this week.
>>
>> We've been running 2 MDS daemons in an active-active setup. Tonight
>> one of the metadata daemons crashed with the following several times:
>>
>> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
>> In function 'void CIn
>> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
>> 00:20:56.775021
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
>> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
>> ack_allow_loading_invalid_metadata"))
>>
>> I made a quick decision to move to a single MDS because I saw
>> set_primary_parent, and I thought it might be related to auto
>> balancing between the metadata servers.
>>
>> This caused one MDS to fail, the other crashed, and now rank 0 loads,
>> goes active and then crashes with the following:
>> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
>> In function 'void M
>> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
>> 258: FAILED ceph_assert(!p)
>>
>> It now looks like we somehow have a duplicate inode in the MDS journal?
>>
>> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
>> then became rank one after the crash and attempted drop to one active
>> MDS
>> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
>> and crashed
>>
>> Anyone have any thoughts on this?
>>
>> Thanks,
>> Adam
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com______________________________________________
>>
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ----------------------------------------------------------------------------------------------------
>> ----------------------------------------------------------------------------
>> 本邮件及其附件含有浙江宇视科技有限公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发本邮件中的信息。如果您错收了本邮件请您
>> 即电话或邮件通知发件人并删除本邮件! This e-mail and its attachments contain confidential information from Uniview,
>> which is intended only for the person or entity whose address is listed above. Any use of the
>> information contained herein in any way (including, but not limited to, total or partial
>> disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is
>> prohibited. If you receive this e-mail in error, please notify the sender by phone or email
>> immediately and delete it!
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux