Re: MDS damaged

Adam Tygart <mozes@xxxxxxx> · Sun, 15 Jul 2018 11:01:29 -0500

Check out the message titled "IMPORTANT: broken luminous 12.2.6
release in repo, do not upgrade"

It sounds like 12.2.7 should come *soon* to fix this transparently.

--
Adam

On Sun, Jul 15, 2018 at 10:28 AM, Nicolas Huillard
<nhuillard@xxxxxxxxxxx> wrote:
> Hi all,
>
> I have the same problem here:
> * during the upgrade from 12.2.5 to 12.2.6
> * I restarted all the OSD server in turn, which did not trigger any bad
> thing
> * a few minutes after upgrading the OSDs/MONs/MDSs/MGRs (all on the
> same set of servers) and unsetting noout, I upgraded the clients, which
> triggers a temporary loss of connectivity between the two datacenters
>
> 2018-07-15 12:49:09.851204 mon.brome mon.0 172.21.0.16:6789/0 98 : cluster [INF] Health check cleared: OSDMAP_FLAGS (was: noout flag(s) set)
> 2018-07-15 12:49:09.851286 mon.brome mon.0 172.21.0.16:6789/0 99 : cluster [INF] Cluster is now healthy
> 2018-07-15 12:56:26.446062 mon.soufre mon.5 172.22.0.20:6789/0 34 : cluster [INF] mon.soufre calling monitor election
> 2018-07-15 12:56:26.446288 mon.oxygene mon.3 172.22.0.16:6789/0 13 : cluster [INF] mon.oxygene calling monitor election
> 2018-07-15 12:56:26.522520 mon.macaret mon.6 172.30.0.3:6789/0 10 : cluster [INF] mon.macaret calling monitor election
> 2018-07-15 12:56:26.539575 mon.phosphore mon.4 172.22.0.18:6789/0 20 : cluster [INF] mon.phosphore calling monitor election
> 2018-07-15 12:56:36.485881 mon.oxygene mon.3 172.22.0.16:6789/0 14 : cluster [INF] mon.oxygene is new leader, mons oxygene,phosphore,soufre,macaret in quorum (ranks 3,4,5,6)
> 2018-07-15 12:56:36.930096 mon.oxygene mon.3 172.22.0.16:6789/0 19 : cluster [WRN] Health check failed: 3/7 mons down, quorum oxygene,phosphore,soufre,macaret (MON_DOWN)
> 2018-07-15 12:56:37.041888 mon.oxygene mon.3 172.22.0.16:6789/0 26 : cluster [WRN] overall HEALTH_WARN 3/7 mons down, quorum oxygene,phosphore,soufre,macaret
> 2018-07-15 12:56:55.456239 mon.oxygene mon.3 172.22.0.16:6789/0 57 : cluster [WRN] daemon mds.fluor is not responding, replacing it as rank 0 with standby daemon mds.brome
> 2018-07-15 12:56:55.456365 mon.oxygene mon.3 172.22.0.16:6789/0 58 : cluster [INF] Standby daemon mds.chlore is not responding, dropping it
> 2018-07-15 12:56:55.456486 mon.oxygene mon.3 172.22.0.16:6789/0 59 : cluster [WRN] daemon mds.brome is not responding, replacing it as rank 0 with standby daemon mds.oxygene
> 2018-07-15 12:56:55.464196 mon.oxygene mon.3 172.22.0.16:6789/0 60 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
> 2018-07-15 12:56:55.691674 mds.oxygene mds.0 172.22.0.16:6800/4212961230 1 : cluster [ERR] Error recovering journal 0x200: (5) Input/output error
> 2018-07-15 12:56:56.645914 mon.oxygene mon.3 172.22.0.16:6789/0 64 : cluster [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE)
>
> I have above the hint about journal 0x200. The error appears much later
> in the logs :
>
> 2018-07-15 16:34:28.567267 osd.11 osd.11 172.22.0.20:6805/2150 21 : cluster [ERR] 6.14 full-object read crc 0x38f8faae != expected 0xed23f8df on 6:292cf221:::200.00000000:head
>
> I tried a repair and deep-scrub on PG 6.14, with the same nil result as
> Alessandro.
> I can't find any other error about the MDS journal 200.00000000 on the
> other OSDs so I can't check CRCs.
>
> I'll try the next steps taken by Alessandro, but I'm in unknown
> territory...
>
> Le mercredi 11 juillet 2018 à 18:10 +0300, Alessandro De Salvo a
> écrit :
>> Hi,
>>
>> after the upgrade to luminous 12.2.6 today, all our MDSes have been
>> marked as damaged. Trying to restart the instances only result in
>> standby MDSes. We currently have 2 filesystems active and 2 MDSes
>> each.
>>
>> I found the following error messages in the mon:
>>
>>
>> mds.0 <node1_IP>:6800/2412911269 down:damaged
>> mds.1 <node2_IP>:6800/830539001 down:damaged
>> mds.0 <node3_IP>:6800/4080298733 down:damaged
>>
>>
>> Whenever I try to force the repaired state with ceph mds repaired
>> <fs_name>:<rank> I get something like this in the MDS logs:
>>
>>
>> 2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
>> error getting journal off disk
>> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
>> [ERR] : Error recovering journal 0x201: (5) Input/output error
>>
>>
>> Any attempt of running the journal export results in errors, like
>> this one:
>>
>>
>> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
>> Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00
>> -1
>> Header 200.00000000 is unreadable
>>
>> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal
>> not
>> readable, attempt object-by-object dump with `rados`
>>
>>
>> Same happens for recover_dentries
>>
>> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
>> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
>> 200.00000000 is unreadable
>> Errors:
>> 0
>>
>> Is there something I could try to do to have the cluster back?
>>
>> I was able to dump the contents of the metadata pool with rados
>> export
>> -p cephfs_metadata <filename> and I'm currently trying the procedure
>> described in
>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#us
>> ing-an-alternate-metadata-pool-for-recovery
>> but I'm not sure if it will work as it's apparently doing nothing at
>> the
>> moment (maybe it's just very slow).
>>
>> Any help is appreciated, thanks!
>>
>>
>>      Alessandro
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> Nicolas Huillard
> Associé fondateur - Directeur Technique - Dolomède
>
> nhuillard@xxxxxxxxxxx
> Fixe : +33 9 52 31 06 10
> Mobile : +33 6 50 27 69 08
> http://www.dolomede.fr/
>
> https://reseauactionclimat.org/planetman/
> http://climat-2020.eu/
> http://www.350.org/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com