Re: ceph mds crashing constantly : ceph_assert fail … prepare_new_inode

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 10 Aug 2018 10:14:43 -0700

As Paul said, the MDS is loading "duplicate inodes" and that's very bad. If you've already gone through some of the disaster recovery steps, that's likely the cause. But you'll need to provide a *lot* more information about what you've already done to the cluster for people to be sure.
The backwards scan referred to is the scan_extents/scan_inodes work described in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects

Be advised that there is limited user experience with *any* of these tools and that you have stumbled into some dark corners. I'm rather surprised that a newish deployment could have needed to make use of any of this repair functionality — if you are deliberately breaking things to see how it recovers, you should probably spend some more time understanding plausible failure cases. This generally only comes up in the case of genuine data loss due to multiple simultaneous hardware failures.
-Greg

On Fri, Aug 10, 2018 at 9:05 AM Amit Handa <amit.handa@xxxxxxxxx> wrote:
Thanks alot, Paul.
we did (hopefully) follow through with the disaster recovery.
however, please guide me in how to get the cluster back up !

Thanks,

On Fri, Aug 10, 2018 at 9:32 PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
Looks like you got some duplicate inodes due to corrupted metadata, you
likely tried to a disaster recovery and didn't follow through it completely or
you hit some bug in Ceph.

The solution here is probably to do a full recovery of the metadata/full
backwards scan after resetting the inodes. I've recovered a cluster from
something similar just a few weeks ago. Annoying but recoverable.

Paul

2018-08-10 13:26 GMT+02:00 Amit Handa <amit.handa@xxxxxxxxx>:
We are facing constant crash from ceph mds. We have installed mimic (v13.2.1).

mds: cephfs-1/1/1 up {0=node2=up:active(laggy or crashed)}

mds logs: https://pastebin.com/AWGMLRm0

we have followed the DR steps listed at

http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 

please help in resolving the errors :(

mds crash stacktrace

 ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f984fc3ee1f]
 2: (()+0x284fe7) [0x7f984fc3efe7]
 3: (()+0x2087fe) [0x5563e88537fe]
 4: (Server::prepare_new_inode(boost::intrusive_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0xf37) [0x5563e87ce777]
 5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0xdb0) [0x5563e87d0bd0]
 6: (Server::handle_client_request(MClientRequest*)+0x49e) [0x5563e87d3c0e]
 7: (Server::dispatch(Message*)+0x2db) [0x5563e87d789b]
 8: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x5563e87514b4]
 9: (MDSRank::_dispatch(Message*, bool)+0x63b) [0x5563e875db5b]
 10: (MDSRank::retry_dispatch(Message*)+0x12) [0x5563e875e302]
 11: (MDSInternalContextBase::complete(int)+0x67) [0x5563e89afb57]
 12: (MDSRank::_advance_queues()+0xd1) [0x5563e875cd51]
 13: (MDSRank::ProgressThread::entry()+0x43) [0x5563e875d3e3]
 14: (()+0x7e25) [0x7f984d869e25]
 15: (clone()+0x6d) [0x7f984c949bad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

-- 
Loading ...

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

-- 
Loading ...

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com