Check out the message titled "IMPORTANT: broken luminous 12.2.6 release in repo, do not upgrade" It sounds like 12.2.7 should come *soon* to fix this transparently. -- Adam On Sun, Jul 15, 2018 at 10:28 AM, Nicolas Huillard <nhuillard@xxxxxxxxxxx> wrote: > Hi all, > > I have the same problem here: > * during the upgrade from 12.2.5 to 12.2.6 > * I restarted all the OSD server in turn, which did not trigger any bad > thing > * a few minutes after upgrading the OSDs/MONs/MDSs/MGRs (all on the > same set of servers) and unsetting noout, I upgraded the clients, which > triggers a temporary loss of connectivity between the two datacenters > > 2018-07-15 12:49:09.851204 mon.brome mon.0 172.21.0.16:6789/0 98 : cluster [INF] Health check cleared: OSDMAP_FLAGS (was: noout flag(s) set) > 2018-07-15 12:49:09.851286 mon.brome mon.0 172.21.0.16:6789/0 99 : cluster [INF] Cluster is now healthy > 2018-07-15 12:56:26.446062 mon.soufre mon.5 172.22.0.20:6789/0 34 : cluster [INF] mon.soufre calling monitor election > 2018-07-15 12:56:26.446288 mon.oxygene mon.3 172.22.0.16:6789/0 13 : cluster [INF] mon.oxygene calling monitor election > 2018-07-15 12:56:26.522520 mon.macaret mon.6 172.30.0.3:6789/0 10 : cluster [INF] mon.macaret calling monitor election > 2018-07-15 12:56:26.539575 mon.phosphore mon.4 172.22.0.18:6789/0 20 : cluster [INF] mon.phosphore calling monitor election > 2018-07-15 12:56:36.485881 mon.oxygene mon.3 172.22.0.16:6789/0 14 : cluster [INF] mon.oxygene is new leader, mons oxygene,phosphore,soufre,macaret in quorum (ranks 3,4,5,6) > 2018-07-15 12:56:36.930096 mon.oxygene mon.3 172.22.0.16:6789/0 19 : cluster [WRN] Health check failed: 3/7 mons down, quorum oxygene,phosphore,soufre,macaret (MON_DOWN) > 2018-07-15 12:56:37.041888 mon.oxygene mon.3 172.22.0.16:6789/0 26 : cluster [WRN] overall HEALTH_WARN 3/7 mons down, quorum oxygene,phosphore,soufre,macaret > 2018-07-15 12:56:55.456239 mon.oxygene mon.3 172.22.0.16:6789/0 57 : cluster [WRN] daemon mds.fluor is not responding, replacing it as rank 0 with standby daemon mds.brome > 2018-07-15 12:56:55.456365 mon.oxygene mon.3 172.22.0.16:6789/0 58 : cluster [INF] Standby daemon mds.chlore is not responding, dropping it > 2018-07-15 12:56:55.456486 mon.oxygene mon.3 172.22.0.16:6789/0 59 : cluster [WRN] daemon mds.brome is not responding, replacing it as rank 0 with standby daemon mds.oxygene > 2018-07-15 12:56:55.464196 mon.oxygene mon.3 172.22.0.16:6789/0 60 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED) > 2018-07-15 12:56:55.691674 mds.oxygene mds.0 172.22.0.16:6800/4212961230 1 : cluster [ERR] Error recovering journal 0x200: (5) Input/output error > 2018-07-15 12:56:56.645914 mon.oxygene mon.3 172.22.0.16:6789/0 64 : cluster [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE) > > I have above the hint about journal 0x200. The error appears much later > in the logs : > > 2018-07-15 16:34:28.567267 osd.11 osd.11 172.22.0.20:6805/2150 21 : cluster [ERR] 6.14 full-object read crc 0x38f8faae != expected 0xed23f8df on 6:292cf221:::200.00000000:head > > I tried a repair and deep-scrub on PG 6.14, with the same nil result as > Alessandro. > I can't find any other error about the MDS journal 200.00000000 on the > other OSDs so I can't check CRCs. > > I'll try the next steps taken by Alessandro, but I'm in unknown > territory... > > Le mercredi 11 juillet 2018 à 18:10 +0300, Alessandro De Salvo a > écrit : >> Hi, >> >> after the upgrade to luminous 12.2.6 today, all our MDSes have been >> marked as damaged. Trying to restart the instances only result in >> standby MDSes. We currently have 2 filesystems active and 2 MDSes >> each. >> >> I found the following error messages in the mon: >> >> >> mds.0 <node1_IP>:6800/2412911269 down:damaged >> mds.1 <node2_IP>:6800/830539001 down:damaged >> mds.0 <node3_IP>:6800/4080298733 down:damaged >> >> >> Whenever I try to force the repaired state with ceph mds repaired >> <fs_name>:<rank> I get something like this in the MDS logs: >> >> >> 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) >> error getting journal off disk >> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log >> [ERR] : Error recovering journal 0x201: (5) Input/output error >> >> >> Any attempt of running the journal export results in errors, like >> this one: >> >> >> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin >> Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 >> -1 >> Header 200.00000000 is unreadable >> >> 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal >> not >> readable, attempt object-by-object dump with `rados` >> >> >> Same happens for recover_dentries >> >> cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary >> Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header >> 200.00000000 is unreadable >> Errors: >> 0 >> >> Is there something I could try to do to have the cluster back? >> >> I was able to dump the contents of the metadata pool with rados >> export >> -p cephfs_metadata <filename> and I'm currently trying the procedure >> described in >> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#us >> ing-an-alternate-metadata-pool-for-recovery >> but I'm not sure if it will work as it's apparently doing nothing at >> the >> moment (maybe it's just very slow). >> >> Any help is appreciated, thanks! >> >> >> Alessandro >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- > Nicolas Huillard > Associé fondateur - Directeur Technique - Dolomède > > nhuillard@xxxxxxxxxxx > Fixe : +33 9 52 31 06 10 > Mobile : +33 6 50 27 69 08 > http://www.dolomede.fr/ > > https://reseauactionclimat.org/planetman/ > http://climat-2020.eu/ > http://www.350.org/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com