Re: MDS damaged

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 13 Jul 2018 12:55:59 +0200

Hi Adam,

Are your osds bluestore or filestore?

-- dan

On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart <mozes@xxxxxxx> wrote:
>
> I've hit this today with an upgrade to 12.2.6 on my backup cluster.
> Unfortunately there were issues with the logs (in that the files
> weren't writable) until after the issue struck.
>
> 2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log
> [ERR] : 5.255 full-object read crc 0x4e97b4e != expected 0x6cfe829d on
> 5:aa448500:::500.00000000:head
>
> It is a backup cluster and I can keep it around or blow away the data
> (in this instance) as needed for testing purposes.
>
> --
> Adam
>
> On Thu, Jul 12, 2018 at 10:39 AM, Alessandro De Salvo
> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
> > Some progress, and more pain...
> >
> > I was able to recover the 200.00000000 using the ceph-objectstore-tool for
> > one of the OSDs (all identical copies) but trying to re-inject it just with
> > rados put was giving no error while the get was still giving the same I/O
> > error. So the solution was to rm the object and the put it again, that
> > worked.
> >
> > However, after restarting one of the MDSes and seeting it to repaired, I've
> > hit another, similar problem:
> >
> >
> > 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] :
> > error reading table object 'mds0_inotable' -5 ((5) Input/output error)
> >
> >
> > Can I safely try to do the same as for object 200.00000000? Should I check
> > something before trying it? Again, checking the copies of the object, they
> > have identical md5sums on all the replicas.
> >
> > Thanks,
> >
> >
> >     Alessandro
> >
> >
> > Il 12/07/18 16:46, Alessandro De Salvo ha scritto:
> >
> > Unfortunately yes, all the OSDs were restarted a few times, but no change.
> >
> > Thanks,
> >
> >
> >     Alessandro
> >
> >
> > Il 12/07/18 15:55, Paul Emmerich ha scritto:
> >
> > This might seem like a stupid suggestion, but: have you tried to restart the
> > OSDs?
> >
> > I've also encountered some random CRC errors that only showed up when trying
> > to read an object,
> > but not on scrubbing, that magically disappeared after restarting the OSD.
> >
> > However, in my case it was clearly related to
> > https://tracker.ceph.com/issues/22464 which doesn't
> > seem to be the issue here.
> >
> > Paul
> >
> > 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo
> > <Alessandro.DeSalvo@xxxxxxxxxxxxx>:
> >>
> >>
> >> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:
> >>
> >>>
> >>>
> >>> Il 12/07/18 10:58, Dan van der Ster ha scritto:
> >>>>
> >>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum <gfarnum@xxxxxxxxxx>
> >>>> wrote:
> >>>>>
> >>>>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
> >>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> OK, I found where the object is:
> >>>>>>
> >>>>>>
> >>>>>> ceph osd map cephfs_metadata 200.00000000
> >>>>>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.00000000' -> pg
> >>>>>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
> >>>>>>
> >>>>>>
> >>>>>> So, looking at the osds 23, 35 and 18 logs in fact I see:
> >>>>>>
> >>>>>>
> >>>>>> osd.23:
> >>>>>>
> >>>>>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
> >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
> >>>>>> on
> >>>>>> 10:292cf221:::200.00000000:head
> >>>>>>
> >>>>>>
> >>>>>> osd.35:
> >>>>>>
> >>>>>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
> >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
> >>>>>> on
> >>>>>> 10:292cf221:::200.00000000:head
> >>>>>>
> >>>>>>
> >>>>>> osd.18:
> >>>>>>
> >>>>>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
> >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
> >>>>>> on
> >>>>>> 10:292cf221:::200.00000000:head
> >>>>>>
> >>>>>>
> >>>>>> So, basically the same error everywhere.
> >>>>>>
> >>>>>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it
> >>>>>> may
> >>>>>> help.
> >>>>>>
> >>>>>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes),
> >>>>>> and
> >>>>>> no disk problems anywhere. No relevant errors in syslogs, the hosts
> >>>>>> are
> >>>>>> just fine. I cannot exclude an error on the RAID controllers, but 2 of
> >>>>>> the OSDs with 10.14 are on a SAN system and one on a different one, so
> >>>>>> I
> >>>>>> would tend to exclude they both had (silent) errors at the same time.
> >>>>>
> >>>>>
> >>>>> That's fairly distressing. At this point I'd probably try extracting
> >>>>> the object using ceph-objectstore-tool and seeing if it decodes properly as
> >>>>> an mds journal. If it does, you might risk just putting it back in place to
> >>>>> overwrite the crc.
> >>>>>
> >>>> Wouldn't it be easier to scrub repair the PG to fix the crc?
> >>>
> >>>
> >>> this is what I already instructed the cluster to do, a deep scrub, but
> >>> I'm not sure it could repair in case all replicas are bad, as it seems to be
> >>> the case.
> >>
> >>
> >> I finally managed (with the help of Dan), to perform the deep-scrub on pg
> >> 10.14, but the deep scrub did not detect anything wrong. Also trying to
> >> repair 10.14 has no effect.
> >> Still, trying to access the object I get in the OSDs:
> >>
> >> 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR]
> >> : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
> >> 10:292cf221:::200.00000000:head
> >>
> >> Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds
> >> like a bug.
> >> Can I force the repair someway?
> >> Thanks,
> >>
> >>    Alessandro
> >>
> >>>
> >>>>
> >>>> Alessandro, did you already try a deep-scrub on pg 10.14?
> >>>
> >>>
> >>> I'm waiting for the cluster to do that, I've sent it earlier this
> >>> morning.
> >>>
> >>>>   I expect
> >>>> it'll show an inconsistent object. Though, I'm unsure if repair will
> >>>> correct the crc given that in this case *all* replicas have a bad crc.
> >>>
> >>>
> >>> Exactly, this is what I wonder too.
> >>> Cheers,
> >>>
> >>>     Alessandro
> >>>
> >>>>
> >>>> --Dan
> >>>>
> >>>>> However, I'm also quite curious how it ended up that way, with a
> >>>>> checksum mismatch but identical data (and identical checksums!) across the
> >>>>> three replicas. Have you previously done some kind of scrub repair on the
> >>>>> metadata pool? Did the PG perhaps get backfilled due to cluster changes?
> >>>>> -Greg
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>>       Alessandro
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Il 11/07/18 18:56, John Spray ha scritto:
> >>>>>>>
> >>>>>>> On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
> >>>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> Hi John,
> >>>>>>>>
> >>>>>>>> in fact I get an I/O error by hand too:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> rados get -p cephfs_metadata 200.00000000 200.00000000
> >>>>>>>> error getting cephfs_metadata/200.00000000: (5) Input/output error
> >>>>>>>
> >>>>>>> Next step would be to go look for corresponding errors on your OSD
> >>>>>>> logs, system logs, and possibly also check things like the SMART
> >>>>>>> counters on your hard drives for possible root causes.
> >>>>>>>
> >>>>>>> John
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Can this be recovered someway?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>        Alessandro
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Il 11/07/18 18:33, John Spray ha scritto:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
> >>>>>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> after the upgrade to luminous 12.2.6 today, all our MDSes have
> >>>>>>>>>> been
> >>>>>>>>>> marked as damaged. Trying to restart the instances only result in
> >>>>>>>>>> standby MDSes. We currently have 2 filesystems active and 2 MDSes
> >>>>>>>>>> each.
> >>>>>>>>>>
> >>>>>>>>>> I found the following error messages in the mon:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> mds.0 <node1_IP>:6800/2412911269 down:damaged
> >>>>>>>>>> mds.1 <node2_IP>:6800/830539001 down:damaged
> >>>>>>>>>> mds.0 <node3_IP>:6800/4080298733 down:damaged
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Whenever I try to force the repaired state with ceph mds repaired
> >>>>>>>>>> <fs_name>:<rank> I get something like this in the MDS logs:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2018-07-11 13:20:41.597970 7ff7e010e700  0
> >>>>>>>>>> mds.1.journaler.mdlog(ro)
> >>>>>>>>>> error getting journal off disk
> >>>>>>>>>> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster)
> >>>>>>>>>> log
> >>>>>>>>>> [ERR] : Error recovering journal 0x201: (5) Input/output error
> >>>>>>>>>
> >>>>>>>>> An EIO reading the journal header is pretty scary. The MDS itself
> >>>>>>>>> probably can't tell you much more about this: you need to dig down
> >>>>>>>>> into the RADOS layer.  Try reading the 200.00000000 object (that
> >>>>>>>>> happens to be the rank 0 journal header, every CephFS filesystem
> >>>>>>>>> should have one) using the `rados` command line tool.
> >>>>>>>>>
> >>>>>>>>> John
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Any attempt of running the journal export results in errors, like
> >>>>>>>>>> this one:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
> >>>>>>>>>> Error ((5) Input/output error)2018-07-11 17:01:30.631571
> >>>>>>>>>> 7f94354fff00 -1
> >>>>>>>>>> Header 200.00000000 is unreadable
> >>>>>>>>>>
> >>>>>>>>>> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal
> >>>>>>>>>> not
> >>>>>>>>>> readable, attempt object-by-object dump with `rados`
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Same happens for recover_dentries
> >>>>>>>>>>
> >>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
> >>>>>>>>>> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
> >>>>>>>>>> 200.00000000 is unreadable
> >>>>>>>>>> Errors:
> >>>>>>>>>> 0
> >>>>>>>>>>
> >>>>>>>>>> Is there something I could try to do to have the cluster back?
> >>>>>>>>>>
> >>>>>>>>>> I was able to dump the contents of the metadata pool with rados
> >>>>>>>>>> export
> >>>>>>>>>> -p cephfs_metadata <filename> and I'm currently trying the
> >>>>>>>>>> procedure
> >>>>>>>>>> described in
> >>>>>>>>>>
> >>>>>>>>>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
> >>>>>>>>>> but I'm not sure if it will work as it's apparently doing nothing
> >>>>>>>>>> at the
> >>>>>>>>>> moment (maybe it's just very slow).
> >>>>>>>>>>
> >>>>>>>>>> Any help is appreciated, thanks!
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>         Alessandro
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> ceph-users mailing list
> >>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com