Re: MDS damaged

Adam Tygart <mozes@xxxxxxx> · Fri, 13 Jul 2018 00:37:42 -0500

I've hit this today with an upgrade to 12.2.6 on my backup cluster.
Unfortunately there were issues with the logs (in that the files
weren't writable) until after the issue struck.

2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log
[ERR] : 5.255 full-object read crc 0x4e97b4e != expected 0x6cfe829d on
5:aa448500:::500.00000000:head

It is a backup cluster and I can keep it around or blow away the data
(in this instance) as needed for testing purposes.

--
Adam

On Thu, Jul 12, 2018 at 10:39 AM, Alessandro De Salvo
<Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
> Some progress, and more pain...
>
> I was able to recover the 200.00000000 using the ceph-objectstore-tool for
> one of the OSDs (all identical copies) but trying to re-inject it just with
> rados put was giving no error while the get was still giving the same I/O
> error. So the solution was to rm the object and the put it again, that
> worked.
>
> However, after restarting one of the MDSes and seeting it to repaired, I've
> hit another, similar problem:
>
>
> 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] :
> error reading table object 'mds0_inotable' -5 ((5) Input/output error)
>
>
> Can I safely try to do the same as for object 200.00000000? Should I check
> something before trying it? Again, checking the copies of the object, they
> have identical md5sums on all the replicas.
>
> Thanks,
>
>
>     Alessandro
>
>
> Il 12/07/18 16:46, Alessandro De Salvo ha scritto:
>
> Unfortunately yes, all the OSDs were restarted a few times, but no change.
>
> Thanks,
>
>
>     Alessandro
>
>
> Il 12/07/18 15:55, Paul Emmerich ha scritto:
>
> This might seem like a stupid suggestion, but: have you tried to restart the
> OSDs?
>
> I've also encountered some random CRC errors that only showed up when trying
> to read an object,
> but not on scrubbing, that magically disappeared after restarting the OSD.
>
> However, in my case it was clearly related to
> https://tracker.ceph.com/issues/22464 which doesn't
> seem to be the issue here.
>
> Paul
>
> 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo
> <Alessandro.DeSalvo@xxxxxxxxxxxxx>:
>>
>>
>> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:
>>
>>>
>>>
>>> Il 12/07/18 10:58, Dan van der Ster ha scritto:
>>>>
>>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum <gfarnum@xxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> OK, I found where the object is:
>>>>>>
>>>>>>
>>>>>> ceph osd map cephfs_metadata 200.00000000
>>>>>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.00000000' -> pg
>>>>>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>>>>>>
>>>>>>
>>>>>> So, looking at the osds 23, 35 and 18 logs in fact I see:
>>>>>>
>>>>>>
>>>>>> osd.23:
>>>>>>
>>>>>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
>>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>>>>>> on
>>>>>> 10:292cf221:::200.00000000:head
>>>>>>
>>>>>>
>>>>>> osd.35:
>>>>>>
>>>>>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
>>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>>>>>> on
>>>>>> 10:292cf221:::200.00000000:head
>>>>>>
>>>>>>
>>>>>> osd.18:
>>>>>>
>>>>>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
>>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>>>>>> on
>>>>>> 10:292cf221:::200.00000000:head
>>>>>>
>>>>>>
>>>>>> So, basically the same error everywhere.
>>>>>>
>>>>>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it
>>>>>> may
>>>>>> help.
>>>>>>
>>>>>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes),
>>>>>> and
>>>>>> no disk problems anywhere. No relevant errors in syslogs, the hosts
>>>>>> are
>>>>>> just fine. I cannot exclude an error on the RAID controllers, but 2 of
>>>>>> the OSDs with 10.14 are on a SAN system and one on a different one, so
>>>>>> I
>>>>>> would tend to exclude they both had (silent) errors at the same time.
>>>>>
>>>>>
>>>>> That's fairly distressing. At this point I'd probably try extracting
>>>>> the object using ceph-objectstore-tool and seeing if it decodes properly as
>>>>> an mds journal. If it does, you might risk just putting it back in place to
>>>>> overwrite the crc.
>>>>>
>>>> Wouldn't it be easier to scrub repair the PG to fix the crc?
>>>
>>>
>>> this is what I already instructed the cluster to do, a deep scrub, but
>>> I'm not sure it could repair in case all replicas are bad, as it seems to be
>>> the case.
>>
>>
>> I finally managed (with the help of Dan), to perform the deep-scrub on pg
>> 10.14, but the deep scrub did not detect anything wrong. Also trying to
>> repair 10.14 has no effect.
>> Still, trying to access the object I get in the OSDs:
>>
>> 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR]
>> : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
>> 10:292cf221:::200.00000000:head
>>
>> Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds
>> like a bug.
>> Can I force the repair someway?
>> Thanks,
>>
>>    Alessandro
>>
>>>
>>>>
>>>> Alessandro, did you already try a deep-scrub on pg 10.14?
>>>
>>>
>>> I'm waiting for the cluster to do that, I've sent it earlier this
>>> morning.
>>>
>>>>   I expect
>>>> it'll show an inconsistent object. Though, I'm unsure if repair will
>>>> correct the crc given that in this case *all* replicas have a bad crc.
>>>
>>>
>>> Exactly, this is what I wonder too.
>>> Cheers,
>>>
>>>     Alessandro
>>>
>>>>
>>>> --Dan
>>>>
>>>>> However, I'm also quite curious how it ended up that way, with a
>>>>> checksum mismatch but identical data (and identical checksums!) across the
>>>>> three replicas. Have you previously done some kind of scrub repair on the
>>>>> metadata pool? Did the PG perhaps get backfilled due to cluster changes?
>>>>> -Greg
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>       Alessandro
>>>>>>
>>>>>>
>>>>>>
>>>>>> Il 11/07/18 18:56, John Spray ha scritto:
>>>>>>>
>>>>>>> On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
>>>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> in fact I get an I/O error by hand too:
>>>>>>>>
>>>>>>>>
>>>>>>>> rados get -p cephfs_metadata 200.00000000 200.00000000
>>>>>>>> error getting cephfs_metadata/200.00000000: (5) Input/output error
>>>>>>>
>>>>>>> Next step would be to go look for corresponding errors on your OSD
>>>>>>> logs, system logs, and possibly also check things like the SMART
>>>>>>> counters on your hard drives for possible root causes.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Can this be recovered someway?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>>        Alessandro
>>>>>>>>
>>>>>>>>
>>>>>>>> Il 11/07/18 18:33, John Spray ha scritto:
>>>>>>>>>
>>>>>>>>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
>>>>>>>>> <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> after the upgrade to luminous 12.2.6 today, all our MDSes have
>>>>>>>>>> been
>>>>>>>>>> marked as damaged. Trying to restart the instances only result in
>>>>>>>>>> standby MDSes. We currently have 2 filesystems active and 2 MDSes
>>>>>>>>>> each.
>>>>>>>>>>
>>>>>>>>>> I found the following error messages in the mon:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> mds.0 <node1_IP>:6800/2412911269 down:damaged
>>>>>>>>>> mds.1 <node2_IP>:6800/830539001 down:damaged
>>>>>>>>>> mds.0 <node3_IP>:6800/4080298733 down:damaged
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Whenever I try to force the repaired state with ceph mds repaired
>>>>>>>>>> <fs_name>:<rank> I get something like this in the MDS logs:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2018-07-11 13:20:41.597970 7ff7e010e700  0
>>>>>>>>>> mds.1.journaler.mdlog(ro)
>>>>>>>>>> error getting journal off disk
>>>>>>>>>> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster)
>>>>>>>>>> log
>>>>>>>>>> [ERR] : Error recovering journal 0x201: (5) Input/output error
>>>>>>>>>
>>>>>>>>> An EIO reading the journal header is pretty scary. The MDS itself
>>>>>>>>> probably can't tell you much more about this: you need to dig down
>>>>>>>>> into the RADOS layer.  Try reading the 200.00000000 object (that
>>>>>>>>> happens to be the rank 0 journal header, every CephFS filesystem
>>>>>>>>> should have one) using the `rados` command line tool.
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Any attempt of running the journal export results in errors, like
>>>>>>>>>> this one:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
>>>>>>>>>> Error ((5) Input/output error)2018-07-11 17:01:30.631571
>>>>>>>>>> 7f94354fff00 -1
>>>>>>>>>> Header 200.00000000 is unreadable
>>>>>>>>>>
>>>>>>>>>> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal
>>>>>>>>>> not
>>>>>>>>>> readable, attempt object-by-object dump with `rados`
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Same happens for recover_dentries
>>>>>>>>>>
>>>>>>>>>> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
>>>>>>>>>> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
>>>>>>>>>> 200.00000000 is unreadable
>>>>>>>>>> Errors:
>>>>>>>>>> 0
>>>>>>>>>>
>>>>>>>>>> Is there something I could try to do to have the cluster back?
>>>>>>>>>>
>>>>>>>>>> I was able to dump the contents of the metadata pool with rados
>>>>>>>>>> export
>>>>>>>>>> -p cephfs_metadata <filename> and I'm currently trying the
>>>>>>>>>> procedure
>>>>>>>>>> described in
>>>>>>>>>>
>>>>>>>>>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
>>>>>>>>>> but I'm not sure if it will work as it's apparently doing nothing
>>>>>>>>>> at the
>>>>>>>>>> moment (maybe it's just very slow).
>>>>>>>>>>
>>>>>>>>>> Any help is appreciated, thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Alessandro
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com