Hi all, I have the same problem here: * during the upgrade from 12.2.5 to 12.2.6 * I restarted all the OSD server in turn, which did not trigger any bad thing * a few minutes after upgrading the OSDs/MONs/MDSs/MGRs (all on the same set of servers) and unsetting noout, I upgraded the clients, which triggers a temporary loss of connectivity between the two datacenters 2018-07-15 12:49:09.851204 mon.brome mon.0 172.21.0.16:6789/0 98 : cluster [INF] Health check cleared: OSDMAP_FLAGS (was: noout flag(s) set) 2018-07-15 12:49:09.851286 mon.brome mon.0 172.21.0.16:6789/0 99 : cluster [INF] Cluster is now healthy 2018-07-15 12:56:26.446062 mon.soufre mon.5 172.22.0.20:6789/0 34 : cluster [INF] mon.soufre calling monitor election 2018-07-15 12:56:26.446288 mon.oxygene mon.3 172.22.0.16:6789/0 13 : cluster [INF] mon.oxygene calling monitor election 2018-07-15 12:56:26.522520 mon.macaret mon.6 172.30.0.3:6789/0 10 : cluster [INF] mon.macaret calling monitor election 2018-07-15 12:56:26.539575 mon.phosphore mon.4 172.22.0.18:6789/0 20 : cluster [INF] mon.phosphore calling monitor election 2018-07-15 12:56:36.485881 mon.oxygene mon.3 172.22.0.16:6789/0 14 : cluster [INF] mon.oxygene is new leader, mons oxygene,phosphore,soufre,macaret in quorum (ranks 3,4,5,6) 2018-07-15 12:56:36.930096 mon.oxygene mon.3 172.22.0.16:6789/0 19 : cluster [WRN] Health check failed: 3/7 mons down, quorum oxygene,phosphore,soufre,macaret (MON_DOWN) 2018-07-15 12:56:37.041888 mon.oxygene mon.3 172.22.0.16:6789/0 26 : cluster [WRN] overall HEALTH_WARN 3/7 mons down, quorum oxygene,phosphore,soufre,macaret 2018-07-15 12:56:55.456239 mon.oxygene mon.3 172.22.0.16:6789/0 57 : cluster [WRN] daemon mds.fluor is not responding, replacing it as rank 0 with standby daemon mds.brome 2018-07-15 12:56:55.456365 mon.oxygene mon.3 172.22.0.16:6789/0 58 : cluster [INF] Standby daemon mds.chlore is not responding, dropping it 2018-07-15 12:56:55.456486 mon.oxygene mon.3 172.22.0.16:6789/0 59 : cluster [WRN] daemon mds.brome is not responding, replacing it as rank 0 with standby daemon mds.oxygene 2018-07-15 12:56:55.464196 mon.oxygene mon.3 172.22.0.16:6789/0 60 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED) 2018-07-15 12:56:55.691674 mds.oxygene mds.0 172.22.0.16:6800/4212961230 1 : cluster [ERR] Error recovering journal 0x200: (5) Input/output error 2018-07-15 12:56:56.645914 mon.oxygene mon.3 172.22.0.16:6789/0 64 : cluster [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE) I have above the hint about journal 0x200. The error appears much later in the logs : 2018-07-15 16:34:28.567267 osd.11 osd.11 172.22.0.20:6805/2150 21 : cluster [ERR] 6.14 full-object read crc 0x38f8faae != expected 0xed23f8df on 6:292cf221:::200.00000000:head I tried a repair and deep-scrub on PG 6.14, with the same nil result as Alessandro. I can't find any other error about the MDS journal 200.00000000 on the other OSDs so I can't check CRCs. I'll try the next steps taken by Alessandro, but I'm in unknown territory... Le mercredi 11 juillet 2018 à 18:10 +0300, Alessandro De Salvo a écrit : > Hi, > > after the upgrade to luminous 12.2.6 today, all our MDSes have been > marked as damaged. Trying to restart the instances only result in > standby MDSes. We currently have 2 filesystems active and 2 MDSes > each. > > I found the following error messages in the mon: > > > mds.0 <node1_IP>:6800/2412911269 down:damaged > mds.1 <node2_IP>:6800/830539001 down:damaged > mds.0 <node3_IP>:6800/4080298733 down:damaged > > > Whenever I try to force the repaired state with ceph mds repaired > <fs_name>:<rank> I get something like this in the MDS logs: > > > 2018-07-11 13:20:41.597970 7ff7e010e700 0 mds.1.journaler.mdlog(ro) > error getting journal off disk > 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log > [ERR] : Error recovering journal 0x201: (5) Input/output error > > > Any attempt of running the journal export results in errors, like > this one: > > > cephfs-journal-tool --rank=cephfs:0 journal export backup.bin > Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 > -1 > Header 200.00000000 is unreadable > > 2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal > not > readable, attempt object-by-object dump with `rados` > > > Same happens for recover_dentries > > cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary > Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header > 200.00000000 is unreadable > Errors: > 0 > > Is there something I could try to do to have the cluster back? > > I was able to dump the contents of the metadata pool with rados > export > -p cephfs_metadata <filename> and I'm currently trying the procedure > described in > http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#us > ing-an-alternate-metadata-pool-for-recovery > but I'm not sure if it will work as it's apparently doing nothing at > the > moment (maybe it's just very slow). > > Any help is appreciated, thanks! > > > Alessandro > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Nicolas Huillard Associé fondateur - Directeur Technique - Dolomède nhuillard@xxxxxxxxxxx Fixe : +33 9 52 31 06 10 Mobile : +33 6 50 27 69 08 http://www.dolomede.fr/ https://reseauactionclimat.org/planetman/ http://climat-2020.eu/ http://www.350.org/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com