Re: mds daemon damaged

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 13 Jul 2018 12:55:35 +0200

Hi Kevin,

Are your OSDs bluestore or filestore?

-- dan

On Thu, Jul 12, 2018 at 11:30 PM Kevin <kevin@xxxxxxxxxx> wrote:
>
> Sorry for the long posting but trying to cover everything
>
> I woke up to find my cephfs filesystem down. This was in the logs
>
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc
> 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.00000000:head
>
> I had one standby MDS, but as far as I can tell it did not fail over.
> This was in the logs
>
> (insufficient standby MDS daemons available)
>
> Currently my ceph looks like this
>    cluster:
>      id:     ......................
>      health: HEALTH_ERR
>              1 filesystem is degraded
>              1 mds daemon damaged
>
>    services:
>      mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>      mgr: ids27(active)
>      mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>      osd: 5 osds: 5 up, 5 in
>
>    data:
>      pools:   3 pools, 202 pgs
>      objects: 1013k objects, 4018 GB
>      usage:   12085 GB used, 6544 GB / 18630 GB avail
>      pgs:     201 active+clean
>               1   active+clean+scrubbing+deep
>
>    io:
>      client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>
> I started trying to get the damaged MDS back online
>
> Based on this page
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200.00000000 is
> unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
>
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200.00000000 is
> unreadableErrors: 0
>
> cephfs-journal-tool journal reset - (I think this command might have
> worked)
>
> Next up, tried to reset the filesystem
>
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>
> Each time same errors
>
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared:
> MDS_DAMAGE (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27
> assigned to filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal
> 0x200: (5) Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds
> daemon damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc
> 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.00000000:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1
> filesystem is degraded; 1 mds daemon damaged
>
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
>
> Command worked, but each time I run the reset command the same errors
> above appear
>
> Online searches say the object read error has to be removed. But there's
> no object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
>
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
> completes but still have the same issue above
>
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby
> and has data it should become live. If it was not
> I assume we will lose the filesystem at this point
>
> Why didn't the standby MDS failover?
>
> Just looking for any way to recover the cephfs, thanks!
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com