Re: MDS damaged

Adam Tygart <mozes@xxxxxxx> · Fri, 13 Jul 2018 07:00:57 -0500

Bluestore.

On Fri, Jul 13, 2018, 05:56 Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
Hi Adam,

Are your osds bluestore or filestore?

-- dan

On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart <mozes@xxxxxxx> wrote:

>

> I've hit this today with an upgrade to 12.2.6 on my backup cluster.

> Unfortunately there were issues with the logs (in that the files

> weren't writable) until after the issue struck.

>

> 2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log

> [ERR] : 5.255 full-object read crc 0x4e97b4e != expected 0x6cfe829d on

> 5:aa448500:::500.00000000:head

>

> It is a backup cluster and I can keep it around or blow away the data

> (in this instance) as needed for testing purposes.

>

> --

> Adam

>

> On Thu, Jul 12, 2018 at 10:39 AM, Alessandro De Salvo

> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:

> > Some progress, and more pain...

> >

> > I was able to recover the 200.00000000 using the ceph-objectstore-tool for

> > one of the OSDs (all identical copies) but trying to re-inject it just with

> > rados put was giving no error while the get was still giving the same I/O

> > error. So the solution was to rm the object and the put it again, that

> > worked.

> >

> > However, after restarting one of the MDSes and seeting it to repaired, I've

> > hit another, similar problem:

> >

> >

> > 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] :

> > error reading table object 'mds0_inotable' -5 ((5) Input/output error)

> >

> >

> > Can I safely try to do the same as for object 200.00000000? Should I check

> > something before trying it? Again, checking the copies of the object, they

> > have identical md5sums on all the replicas.

> >

> > Thanks,

> >

> >

> >     Alessandro

> >

> >

> > Il 12/07/18 16:46, Alessandro De Salvo ha scritto:

> >

> > Unfortunately yes, all the OSDs were restarted a few times, but no change.

> >

> > Thanks,

> >

> >

> >     Alessandro

> >

> >

> > Il 12/07/18 15:55, Paul Emmerich ha scritto:

> >

> > This might seem like a stupid suggestion, but: have you tried to restart the

> > OSDs?

> >

> > I've also encountered some random CRC errors that only showed up when trying

> > to read an object,

> > but not on scrubbing, that magically disappeared after restarting the OSD.

> >

> > However, in my case it was clearly related to

> > https://tracker.ceph.com/issues/22464 which doesn't

> > seem to be the issue here.

> >

> > Paul

> >

> > 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo

> > <Alessandro.DeSalvo@xxxxxxxxxxxxx>:

> >>

> >>

> >> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:

> >>

> >>>

> >>>

> >>> Il 12/07/18 10:58, Dan van der Ster ha scritto:

> >>>>

> >>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum <gfarnum@xxxxxxxxxx>

> >>>> wrote:

> >>>>>

> >>>>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo

> >>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:

> >>>>>>

> >>>>>> OK, I found where the object is:

> >>>>>>

> >>>>>>

> >>>>>> ceph osd map cephfs_metadata 200.00000000

> >>>>>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.00000000' -> pg

> >>>>>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)

> >>>>>>

> >>>>>>

> >>>>>> So, looking at the osds 23, 35 and 18 logs in fact I see:

> >>>>>>

> >>>>>>

> >>>>>> osd.23:

> >>>>>>

> >>>>>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log

> >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b

> >>>>>> on

> >>>>>> 10:292cf221:::200.00000000:head

> >>>>>>

> >>>>>>

> >>>>>> osd.35:

> >>>>>>

> >>>>>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log

> >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b

> >>>>>> on

> >>>>>> 10:292cf221:::200.00000000:head

> >>>>>>

> >>>>>>

> >>>>>> osd.18:

> >>>>>>

> >>>>>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log

> >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b

> >>>>>> on

> >>>>>> 10:292cf221:::200.00000000:head

> >>>>>>

> >>>>>>

> >>>>>> So, basically the same error everywhere.

> >>>>>>

> >>>>>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it

> >>>>>> may

> >>>>>> help.

> >>>>>>

> >>>>>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes),

> >>>>>> and

> >>>>>> no disk problems anywhere. No relevant errors in syslogs, the hosts

> >>>>>> are

> >>>>>> just fine. I cannot exclude an error on the RAID controllers, but 2 of

> >>>>>> the OSDs with 10.14 are on a SAN system and one on a different one, so

> >>>>>> I

> >>>>>> would tend to exclude they both had (silent) errors at the same time.

> >>>>>

> >>>>>

> >>>>> That's fairly distressing. At this point I'd probably try extracting

> >>>>> the object using ceph-objectstore-tool and seeing if it decodes properly as

> >>>>> an mds journal. If it does, you might risk just putting it back in place to

> >>>>> overwrite the crc.

> >>>>>

> >>>> Wouldn't it be easier to scrub repair the PG to fix the crc?

> >>>

> >>>

> >>> this is what I already instructed the cluster to do, a deep scrub, but

> >>> I'm not sure it could repair in case all replicas are bad, as it seems to be

> >>> the case.

> >>

> >>

> >> I finally managed (with the help of Dan), to perform the deep-scrub on pg

> >> 10.14, but the deep scrub did not detect anything wrong. Also trying to

> >> repair 10.14 has no effect.

> >> Still, trying to access the object I get in the OSDs:

> >>

> >> 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR]

> >> : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on

> >> 10:292cf221:::200.00000000:head

> >>

> >> Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds

> >> like a bug.

> >> Can I force the repair someway?

> >> Thanks,

> >>

> >>    Alessandro

> >>

> >>>

> >>>>

> >>>> Alessandro, did you already try a deep-scrub on pg 10.14?

> >>>

> >>>

> >>> I'm waiting for the cluster to do that, I've sent it earlier this

> >>> morning.

> >>>

> >>>>   I expect

> >>>> it'll show an inconsistent object. Though, I'm unsure if repair will

> >>>> correct the crc given that in this case *all* replicas have a bad crc.

> >>>

> >>>

> >>> Exactly, this is what I wonder too.

> >>> Cheers,

> >>>

> >>>     Alessandro

> >>>

> >>>>

> >>>> --Dan

> >>>>

> >>>>> However, I'm also quite curious how it ended up that way, with a

> >>>>> checksum mismatch but identical data (and identical checksums!) across the

> >>>>> three replicas. Have you previously done some kind of scrub repair on the

> >>>>> metadata pool? Did the PG perhaps get backfilled due to cluster changes?

> >>>>> -Greg

> >>>>>

> >>>>>>

> >>>>>> Thanks,

> >>>>>>

> >>>>>>

> >>>>>>       Alessandro

> >>>>>>

> >>>>>>

> >>>>>>

> >>>>>> Il 11/07/18 18:56, John Spray ha scritto:

> >>>>>>>

> >>>>>>> On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo

> >>>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:

> >>>>>>>>

> >>>>>>>> Hi John,

> >>>>>>>>

> >>>>>>>> in fact I get an I/O error by hand too:

> >>>>>>>>

> >>>>>>>>

> >>>>>>>> rados get -p cephfs_metadata 200.00000000 200.00000000

> >>>>>>>> error getting cephfs_metadata/200.00000000: (5) Input/output error

> >>>>>>>

> >>>>>>> Next step would be to go look for corresponding errors on your OSD

> >>>>>>> logs, system logs, and possibly also check things like the SMART

> >>>>>>> counters on your hard drives for possible root causes.

> >>>>>>>

> >>>>>>> John

> >>>>>>>

> >>>>>>>

> >>>>>>>

> >>>>>>>> Can this be recovered someway?

> >>>>>>>>

> >>>>>>>> Thanks,

> >>>>>>>>

> >>>>>>>>

> >>>>>>>>        Alessandro

> >>>>>>>>

> >>>>>>>>

> >>>>>>>> Il 11/07/18 18:33, John Spray ha scritto:

> >>>>>>>>>

> >>>>>>>>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo

> >>>>>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:

> >>>>>>>>>>

> >>>>>>>>>> Hi,

> >>>>>>>>>>

> >>>>>>>>>> after the upgrade to luminous 12.2.6 today, all our MDSes have

> >>>>>>>>>> been

> >>>>>>>>>> marked as damaged. Trying to restart the instances only result in

> >>>>>>>>>> standby MDSes. We currently have 2 filesystems active and 2 MDSes

> >>>>>>>>>> each.

> >>>>>>>>>>

> >>>>>>>>>> I found the following error messages in the mon:

> >>>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>> mds.0 <node1_IP>:6800/2412911269 down:damaged

> >>>>>>>>>> mds.1 <node2_IP>:6800/830539001 down:damaged

> >>>>>>>>>> mds.0 <node3_IP>:6800/4080298733 down:damaged

> >>>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>> Whenever I try to force the repaired state with ceph mds repaired

> >>>>>>>>>> <fs_name>:<rank> I get something like this in the MDS logs:

> >>>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>> 2018-07-11 13:20:41.597970 7ff7e010e700  0

> >>>>>>>>>> mds.1.journaler.mdlog(ro)

> >>>>>>>>>> error getting journal off disk

> >>>>>>>>>> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster)

> >>>>>>>>>> log

> >>>>>>>>>> [ERR] : Error recovering journal 0x201: (5) Input/output error

> >>>>>>>>>

> >>>>>>>>> An EIO reading the journal header is pretty scary. The MDS itself

> >>>>>>>>> probably can't tell you much more about this: you need to dig down

> >>>>>>>>> into the RADOS layer.  Try reading the 200.00000000 object (that

> >>>>>>>>> happens to be the rank 0 journal header, every CephFS filesystem

> >>>>>>>>> should have one) using the `rados` command line tool.

> >>>>>>>>>

> >>>>>>>>> John

> >>>>>>>>>

> >>>>>>>>>

> >>>>>>>>>

> >>>>>>>>>> Any attempt of running the journal export results in errors, like

> >>>>>>>>>> this one:

> >>>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin

> >>>>>>>>>> Error ((5) Input/output error)2018-07-11 17:01:30.631571

> >>>>>>>>>> 7f94354fff00 -1

> >>>>>>>>>> Header 200.00000000 is unreadable

> >>>>>>>>>>

> >>>>>>>>>> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal

> >>>>>>>>>> not

> >>>>>>>>>> readable, attempt object-by-object dump with `rados`

> >>>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>> Same happens for recover_dentries

> >>>>>>>>>>

> >>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary

> >>>>>>>>>> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header

> >>>>>>>>>> 200.00000000 is unreadable

> >>>>>>>>>> Errors:

> >>>>>>>>>> 0

> >>>>>>>>>>

> >>>>>>>>>> Is there something I could try to do to have the cluster back?

> >>>>>>>>>>

> >>>>>>>>>> I was able to dump the contents of the metadata pool with rados

> >>>>>>>>>> export

> >>>>>>>>>> -p cephfs_metadata <filename> and I'm currently trying the

> >>>>>>>>>> procedure

> >>>>>>>>>> described in

> >>>>>>>>>>

> >>>>>>>>>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery

> >>>>>>>>>> but I'm not sure if it will work as it's apparently doing nothing

> >>>>>>>>>> at the

> >>>>>>>>>> moment (maybe it's just very slow).

> >>>>>>>>>>

> >>>>>>>>>> Any help is appreciated, thanks!

> >>>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>>         Alessandro

> >>>>>>>>>>

> >>>>>>>>>> _______________________________________________

> >>>>>>>>>> ceph-users mailing list

> >>>>>>>>>> ceph-users@xxxxxxxxxxxxxx

> >>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >>>>>>

> >>>>>> _______________________________________________

> >>>>>> ceph-users mailing list

> >>>>>> ceph-users@xxxxxxxxxxxxxx

> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >>>>>

> >>>>> _______________________________________________

> >>>>> ceph-users mailing list

> >>>>> ceph-users@xxxxxxxxxxxxxx

> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >>>

> >>>

> >>> _______________________________________________

> >>> ceph-users mailing list

> >>> ceph-users@xxxxxxxxxxxxxx

> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >>

> >>

> >> _______________________________________________

> >> ceph-users mailing list

> >> ceph-users@xxxxxxxxxxxxxx

> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

> >

> >

> >

> > --

> > Paul Emmerich

> >

> > Looking for help with your Ceph cluster? Contact us at https://croit.io

> >

> > croit GmbH

> > Freseniusstr. 31h

> > 81247 München

> > www.croit.io

> > Tel: +49 89 1896585 90

> >

> >

> >

> >

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

> >

> >

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com