mds daemon damaged

Kevin <kevin@xxxxxxxxxx> · Thu, 12 Jul 2018 14:30:28 -0700

Sorry for the long posting but trying to cover everything

I woke up to find my cephfs filesystem down. This was in the logs

2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.00000000:head

I had one standby MDS, but as far as I can tell it did not fail over. 
This was in the logs

(insufficient standby MDS daemons available)

Currently my ceph looks like this
  cluster:
    id:     ......................
    health: HEALTH_ERR
            1 filesystem is degraded
            1 mds daemon damaged

  services:
    mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
    mgr: ids27(active)
    mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
    osd: 5 osds: 5 up, 5 in

  data:
    pools:   3 pools, 202 pgs
    objects: 1013k objects, 4018 GB
    usage:   12085 GB used, 6544 GB / 18630 GB avail
    pgs:     201 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   0 B/s rd, 0 op/s rd, 0 op/s wr

I started trying to get the damaged MDS back online

Based on this page 
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts

# cephfs-journal-tool journal export backup.bin
2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200.00000000 is 
unreadable
2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
readable, attempt object-by-object dump with `rados`
Error ((5) Input/output error)

# cephfs-journal-tool event recover_dentries summary
Events by type:
2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200.00000000 is 
unreadableErrors: 0

cephfs-journal-tool journal reset - (I think this command might have 
worked)

Next up, tried to reset the filesystem

ceph fs reset test-cephfs-1 --yes-i-really-mean-it

Each time same errors

2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: 
MDS_DAMAGE (was: 1 mds daemon damaged)
2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 
assigned to filesystem test-cephfs-1 as rank 0
2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 
0x200: (5) Input/output error
2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds 
daemon damaged (MDS_DAMAGE)
2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.00000000:head
2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 
filesystem is degraded; 1 mds daemon damaged

Tried to 'fail' mds.ds27
# ceph mds fail ds27
# failed mds gid 1929168

Command worked, but each time I run the reset command the same errors 
above appear

Online searches say the object read error has to be removed. But there's 
no object listed. This web page is the closest to the issue
http://tracker.ceph.com/issues/20863

Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
completes but still have the same issue above

Final option is to attempt removing mds.ds27. If mds.ds29 was a standby 
and has data it should become live. If it was not
I assume we will lose the filesystem at this point

Why didn't the standby MDS failover?

Just looking for any way to recover the cephfs, thanks!

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com