Hi Kevin, Are your OSDs bluestore or filestore? -- dan On Thu, Jul 12, 2018 at 11:30 PM Kevin <kevin@xxxxxxxxxx> wrote: > > Sorry for the long posting but trying to cover everything > > I woke up to find my cephfs filesystem down. This was in the logs > > 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc > 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.00000000:head > > I had one standby MDS, but as far as I can tell it did not fail over. > This was in the logs > > (insufficient standby MDS daemons available) > > Currently my ceph looks like this > cluster: > id: ...................... > health: HEALTH_ERR > 1 filesystem is degraded > 1 mds daemon damaged > > services: > mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29 > mgr: ids27(active) > mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged > osd: 5 osds: 5 up, 5 in > > data: > pools: 3 pools, 202 pgs > objects: 1013k objects, 4018 GB > usage: 12085 GB used, 6544 GB / 18630 GB avail > pgs: 201 active+clean > 1 active+clean+scrubbing+deep > > io: > client: 0 B/s rd, 0 op/s rd, 0 op/s wr > > I started trying to get the damaged MDS back online > > Based on this page > http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts > > # cephfs-journal-tool journal export backup.bin > 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200.00000000 is > unreadable > 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not > readable, attempt object-by-object dump with `rados` > Error ((5) Input/output error) > > # cephfs-journal-tool event recover_dentries summary > Events by type: > 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200.00000000 is > unreadableErrors: 0 > > cephfs-journal-tool journal reset - (I think this command might have > worked) > > Next up, tried to reset the filesystem > > ceph fs reset test-cephfs-1 --yes-i-really-mean-it > > Each time same errors > > 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: > MDS_DAMAGE (was: 1 mds daemon damaged) > 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 > assigned to filesystem test-cephfs-1 as rank 0 > 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal > 0x200: (5) Input/output error > 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds > daemon damaged (MDS_DAMAGE) > 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc > 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.00000000:head > 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 > filesystem is degraded; 1 mds daemon damaged > > Tried to 'fail' mds.ds27 > # ceph mds fail ds27 > # failed mds gid 1929168 > > Command worked, but each time I run the reset command the same errors > above appear > > Online searches say the object read error has to be removed. But there's > no object listed. This web page is the closest to the issue > http://tracker.ceph.com/issues/20863 > > Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it > completes but still have the same issue above > > Final option is to attempt removing mds.ds27. If mds.ds29 was a standby > and has data it should become live. If it was not > I assume we will lose the filesystem at this point > > Why didn't the standby MDS failover? > > Just looking for any way to recover the cephfs, thanks! > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com