Re: MDS damaged

Alessandro De Salvo <Alessandro.DeSalvo@xxxxxxxxxxxxx> · Thu, 12 Jul 2018 11:20:53 +0300

Il 12/07/18 10:58, Dan van der Ster ha scritto:
On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
OK, I found where the object is:

ceph osd map cephfs_metadata 200.00000000
osdmap e632418 pool 'cephfs_metadata' (10) object '200.00000000' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)

So, looking at the osds 23, 35 and 18 logs in fact I see:

osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.00000000:head

osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.00000000:head

osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.00000000:head

So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in syslogs, the hosts are
just fine. I cannot exclude an error on the RAID controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on a different one, so I
would tend to exclude they both had (silent) errors at the same time.

That's fairly distressing. At this point I'd probably try extracting the object using ceph-objectstore-tool and seeing if it decodes properly as an mds journal. If it does, you might risk just putting it back in place to overwrite the crc.

Wouldn't it be easier to scrub repair the PG to fix the crc?

this is what I already instructed the cluster to do, a deep scrub, but 
I'm not sure it could repair in case all replicas are bad, as it seems 
to be the case.

Alessandro, did you already try a deep-scrub on pg 10.14?

I'm waiting for the cluster to do that, I've sent it earlier this morning.

  I expect
it'll show an inconsistent object. Though, I'm unsure if repair will
correct the crc given that in this case *all* replicas have a bad crc.

Exactly, this is what I wonder too.
Cheers,

    Alessandro

--Dan

However, I'm also quite curious how it ended up that way, with a checksum mismatch but identical data (and identical checksums!) across the three replicas. Have you previously done some kind of scrub repair on the metadata pool? Did the PG perhaps get backfilled due to cluster changes?
-Greg

Thanks,

      Alessandro

Il 11/07/18 18:56, John Spray ha scritto:
On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
<Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
Hi John,

in fact I get an I/O error by hand too:

rados get -p cephfs_metadata 200.00000000 200.00000000
error getting cephfs_metadata/200.00000000: (5) Input/output error
Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John

Can this be recovered someway?

Thanks,

       Alessandro

Il 11/07/18 18:33, John Spray ha scritto:
On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
<Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been
marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 MDSes each.

I found the following error messages in the mon:

mds.0 <node1_IP>:6800/2412911269 down:damaged
mds.1 <node2_IP>:6800/830539001 down:damaged
mds.0 <node3_IP>:6800/4080298733 down:damaged

Whenever I try to force the repaired state with ceph mds repaired
<fs_name>:<rank> I get something like this in the MDS logs:

2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
[ERR] : Error recovering journal 0x201: (5) Input/output error
An EIO reading the journal header is pretty scary.  The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200.00000000 object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) using the `rados` command line tool.

John

Any attempt of running the journal export results in errors, like this one:

cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
Header 200.00000000 is unreadable

2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`

Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
200.00000000 is unreadable
Errors:
0

Is there something I could try to do to have the cluster back?

I was able to dump the contents of the metadata pool with rados export
-p cephfs_metadata <filename> and I'm currently trying the procedure
described in
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
but I'm not sure if it will work as it's apparently doing nothing at the
moment (maybe it's just very slow).

Any help is appreciated, thanks!

        Alessandro

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com