Re: CephFS MDS server stuck in "resolve" state

"Dennis Kramer (DBS)" <dennis@xxxxxxxxx> · Thu, 5 Jul 2018 08:50:57 +0000

Hi,

On Thu, 2018-07-05 at 09:55 +0800, Yan, Zheng wrote:
> On Wed, Jul 4, 2018 at 7:02 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx>
> wrote:
> > 
> > 
> > Hi,
> > 
> > I have managed to get cephfs mds online again...for a while.
> > 
> > These topics covers more or less my symptoms and helped me get it
> > up
> > and running again:
> > - https://www.spinics.net/lists/ceph-users/msg45696.h
> > tml
> > - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December
> > /
> > 023133.html
> > 
> > After some time it all goes down again and keeps in a loop trying
> > to
> > get into an "active" state then after a while it crashes again.
> > Logs from the MDS right before it crashes:
> > > 
> > >     0> 2018-07-04 11:34:54.657595 7f50f1c29700 -1 /build/ceph-
> > 12.2.5/src/mds/MDCache.cc: In function 'void
> > > 
> > > MDCache::add_inode(CInode*)' thread 7f50f1c29700 time 2018-07-04
> > 11:34:54.638462
> > > 
> > > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
> > Cluster logging:
> > 2018-07-04 12:50:04.741625 mds.mds01 [ERR] dir 0x1000098a246 object
> > missing on disk; some files may be lost (<obfuscated file path)
> > 2018-07-04 12:50:16.352824 mon.mon01 [ERR] MDS health message
> > (mds.0):
> > Metadata damage detected
> > 2018-07-04 12:50:16.480045 mon.mon01 [ERR] Health check failed: 1
> > MDSs
> > report damaged metadata (MDS_DAMAGE)
> > 2018-07-04 12:53:36.194056 mds.mds01 [ERR] loaded dup inode
> > 0x10000989e52 [2,head] v1104251 at <file>, but inode
> > 0x10000989e52.head
> > v37 already exists at <another file>
> > 
> > CephFS won't stay up for long, after some time it crashes and I
> > need to
> > reset the fs to get it back again.
> > 
> > I'm at a loss here.
> I guess you did reset mds journal.  have you run complete recovery
> sequence?
> 
> cephfs-data-scan init
> cephfs-data-scan scan_extents <data pool>
> cephfs-data-scan scan_inodes <data pool>
> cephfs-data-scan scan_links
> 
> see http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/

cephfs-data-scan init
Inode 0x0x1 already exists, skipping create.  Use
--force-init to overwrite the existing object.
Inode 0x0x100 already
exists, skipping create.  Use --force-init to overwrite the existing
object.

Is it safe to run --force-init or will it break my FS?
Further more, do I need to bring my MDS offline while doing the recovery process of:

cephfs-data-scan init
cephfs-data-scan scan_extents <data pool
cephfs-data-scan scan_inodes <data pool
cephfs-data-scan scan_links

Thank you in advance,

> 
> > 
> > 
> > On Wed, 2018-06-27 at 21:38 +0800, Yan, Zheng wrote:
> > > 
> > > On Wed, Jun 27, 2018 at 6:16 PM Dennis Kramer (DT) <dennis@holmes
> > > .nl>
> > > wrote:
> > > > 
> > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > Currently i'm running Ceph Luminous 12.2.5.
> > > > 
> > > > This morning I tried running Multi MDS with:
> > > > ceph fs set <fs_name> max_mds 2
> > > > 
> > > > I have 5 MDS servers. After running above command,
> > > > I had 2 active MDSs, 2 standby-active and 1 standby.
> > > > 
> > > > And after trying a failover on one
> > > > of the active MDSs, a standby-active did a replay but crashed
> > > > (laggy or
> > > > crashed). Memory and CPU went sky high on the MDS and was
> > > > unresponsive
> > > > after some time. I ended up with the one active MDS but got
> > > > stuck
> > > > with a
> > > > degraded filesystem and warning messages about MDS behind on
> > > > trimming.
> > > > 
> > > > I never got any additional MDS active since then. I tried
> > > > restarting the
> > > > last active MDS (because the filesystem was becoming
> > > > unresponsive
> > > > and had
> > > > a load of slow requets) and it never got passed replay ->
> > > > resolve.
> > > > My MDS
> > > > cluster still isn't active... :(
> > > What is the 'ceph -w' ouput? If you have enabled multi-active
> > > mds.
> > > All
> > > mds ranks need to enter the resolve 'state' before they can
> > > continue
> > > to recover.
> > > 
> > > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > What is the "resolve" state? I have never seen that before pre-
> > > > Luminous.
> > > > Debug on 20 doesn't give me much.
> > > > 
> > > > Also tried removing the Multi MDS setup, but my CephFS cluster
> > > > won't go
> > > > active. How can I get my CephFS up and running again in an
> > > > active
> > > > state.
> > > > 
> > > > Please help.
> > > > 
> > > > 
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com