Re: MDS damaged

John Spray <jspray@xxxxxxxxxx> · Wed, 11 Jul 2018 16:56:27 +0100

On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
<Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
>
> Hi John,
>
> in fact I get an I/O error by hand too:
>
>
> rados get -p cephfs_metadata 200.00000000 200.00000000
> error getting cephfs_metadata/200.00000000: (5) Input/output error

Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John

>
>
> Can this be recovered someway?
>
> Thanks,
>
>
>      Alessandro
>
>
> Il 11/07/18 18:33, John Spray ha scritto:
> > On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
> > <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
> >> Hi,
> >>
> >> after the upgrade to luminous 12.2.6 today, all our MDSes have been
> >> marked as damaged. Trying to restart the instances only result in
> >> standby MDSes. We currently have 2 filesystems active and 2 MDSes each.
> >>
> >> I found the following error messages in the mon:
> >>
> >>
> >> mds.0 <node1_IP>:6800/2412911269 down:damaged
> >> mds.1 <node2_IP>:6800/830539001 down:damaged
> >> mds.0 <node3_IP>:6800/4080298733 down:damaged
> >>
> >>
> >> Whenever I try to force the repaired state with ceph mds repaired
> >> <fs_name>:<rank> I get something like this in the MDS logs:
> >>
> >>
> >> 2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
> >> error getting journal off disk
> >> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
> >> [ERR] : Error recovering journal 0x201: (5) Input/output error
> > An EIO reading the journal header is pretty scary.  The MDS itself
> > probably can't tell you much more about this: you need to dig down
> > into the RADOS layer.  Try reading the 200.00000000 object (that
> > happens to be the rank 0 journal header, every CephFS filesystem
> > should have one) using the `rados` command line tool.
> >
> > John
> >
> >
> >
> >>
> >> Any attempt of running the journal export results in errors, like this one:
> >>
> >>
> >> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
> >> Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
> >> Header 200.00000000 is unreadable
> >>
> >> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
> >> readable, attempt object-by-object dump with `rados`
> >>
> >>
> >> Same happens for recover_dentries
> >>
> >> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
> >> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
> >> 200.00000000 is unreadable
> >> Errors:
> >> 0
> >>
> >> Is there something I could try to do to have the cluster back?
> >>
> >> I was able to dump the contents of the metadata pool with rados export
> >> -p cephfs_metadata <filename> and I'm currently trying the procedure
> >> described in
> >> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
> >> but I'm not sure if it will work as it's apparently doing nothing at the
> >> moment (maybe it's just very slow).
> >>
> >> Any help is appreciated, thanks!
> >>
> >>
> >>       Alessandro
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com