On Thu, Jul 5, 2018 at 4:51 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> wrote: > > Hi, > > > On Thu, 2018-07-05 at 09:55 +0800, Yan, Zheng wrote: > > On Wed, Jul 4, 2018 at 7:02 PM Dennis Kramer (DBS) <dennis@xxxxxxxxx> > > wrote: > > > > > > > > > Hi, > > > > > > I have managed to get cephfs mds online again...for a while. > > > > > > These topics covers more or less my symptoms and helped me get it > > > up > > > and running again: > > > - https://www.spinics.net/lists/ceph-users/msg45696.h > > > tml > > > - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December > > > / > > > 023133.html > > > > > > After some time it all goes down again and keeps in a loop trying > > > to > > > get into an "active" state then after a while it crashes again. > > > Logs from the MDS right before it crashes: > > > > > > > > 0> 2018-07-04 11:34:54.657595 7f50f1c29700 -1 /build/ceph- > > > 12.2.5/src/mds/MDCache.cc: In function 'void > > > > > > > > MDCache::add_inode(CInode*)' thread 7f50f1c29700 time 2018-07-04 > > > 11:34:54.638462 > > > > > > > > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) > > > Cluster logging: > > > 2018-07-04 12:50:04.741625 mds.mds01 [ERR] dir 0x1000098a246 object > > > missing on disk; some files may be lost (<obfuscated file path) > > > 2018-07-04 12:50:16.352824 mon.mon01 [ERR] MDS health message > > > (mds.0): > > > Metadata damage detected > > > 2018-07-04 12:50:16.480045 mon.mon01 [ERR] Health check failed: 1 > > > MDSs > > > report damaged metadata (MDS_DAMAGE) > > > 2018-07-04 12:53:36.194056 mds.mds01 [ERR] loaded dup inode > > > 0x10000989e52 [2,head] v1104251 at <file>, but inode > > > 0x10000989e52.head > > > v37 already exists at <another file> > > > > > > CephFS won't stay up for long, after some time it crashes and I > > > need to > > > reset the fs to get it back again. > > > > > > I'm at a loss here. > > I guess you did reset mds journal. have you run complete recovery > > sequence? > > > > cephfs-data-scan init > > cephfs-data-scan scan_extents <data pool> > > cephfs-data-scan scan_inodes <data pool> > > cephfs-data-scan scan_links > > > > see http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ > > > cephfs-data-scan init > Inode 0x0x1 already exists, skipping create. Use > --force-init to overwrite the existing object. > Inode 0x0x100 already > exists, skipping create. Use --force-init to overwrite the existing > object. > > Is it safe to run --force-init or will it break my FS? > Further more, do I need to bring my MDS offline while doing the recovery process of: > > cephfs-data-scan init > cephfs-data-scan scan_extents <data pool > cephfs-data-scan scan_inodes <data pool > cephfs-data-scan scan_links > > Thank you in advance, yes, mds needs to be offline. you can try only runing "scan links" > > > > > > > > > > > > On Wed, 2018-06-27 at 21:38 +0800, Yan, Zheng wrote: > > > > > > > > On Wed, Jun 27, 2018 at 6:16 PM Dennis Kramer (DT) <dennis@holmes > > > > .nl> > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > Currently i'm running Ceph Luminous 12.2.5. > > > > > > > > > > This morning I tried running Multi MDS with: > > > > > ceph fs set <fs_name> max_mds 2 > > > > > > > > > > I have 5 MDS servers. After running above command, > > > > > I had 2 active MDSs, 2 standby-active and 1 standby. > > > > > > > > > > And after trying a failover on one > > > > > of the active MDSs, a standby-active did a replay but crashed > > > > > (laggy or > > > > > crashed). Memory and CPU went sky high on the MDS and was > > > > > unresponsive > > > > > after some time. I ended up with the one active MDS but got > > > > > stuck > > > > > with a > > > > > degraded filesystem and warning messages about MDS behind on > > > > > trimming. > > > > > > > > > > I never got any additional MDS active since then. I tried > > > > > restarting the > > > > > last active MDS (because the filesystem was becoming > > > > > unresponsive > > > > > and had > > > > > a load of slow requets) and it never got passed replay -> > > > > > resolve. > > > > > My MDS > > > > > cluster still isn't active... :( > > > > What is the 'ceph -w' ouput? If you have enabled multi-active > > > > mds. > > > > All > > > > mds ranks need to enter the resolve 'state' before they can > > > > continue > > > > to recover. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > What is the "resolve" state? I have never seen that before pre- > > > > > Luminous. > > > > > Debug on 20 doesn't give me much. > > > > > > > > > > Also tried removing the Multi MDS setup, but my CephFS cluster > > > > > won't go > > > > > active. How can I get my CephFS up and running again in an > > > > > active > > > > > state. > > > > > > > > > > Please help. > > > > > > > > > > > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com