Re: One mds daemon damaged, filesystem is offline. How to recover?

Eugen Block <eblock@xxxxxx> · Sat, 22 May 2021 17:10:39 +0000

Awesome! I'm glad it worked out this far! At least you have a working  
filesystem now even it means that you may have to use a backup.
But now I can say it: Having only three OSDs is really not the best  
idea. ;-) Are all those OSDs on the same host?

Zitat von Sagara Wijetunga <sagarawmw@xxxxxxxxx>:

Hi Eugen
 Now the Ceph is HEALTH_OK.

 > I think what we need to do now is:
1. Get the MDS.0 recover, discard if necessary part of the object 
200.00006048 and bring the MSD.0 up.

Yes, I agree, I just can't tell what the best way is here, maybe 
remove all three objects from the disks (make a backup before doing 
that, just in case) and try the steps to recover the journal (also 
make a backup of the journal first):

mds01:~ # systemctl stop ceph-mds@mds01.service

mds01:~ # cephfs-journal-tool journal export myjournal.bin

mds01:~ # cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary

mds01:~ # cephfs-journal-tool --rank=cephfs:0 journal reset

mds01:~ # cephfs-table-tool all reset session

mds01:~ # systemctl start ceph-mds(a)mds01.service

mds01:~ # ceph mds repaired 0

mds01:~ # ceph daemon mds.mds01 scrub_path / recursive repair

Only the last step above failed as follows:
# ceph daemon mds.a scrub_path / recursive repair
"mds_not_active"
failed

But the ceph -w showed:
2021-05-22 23:30:00.199164 mon.a [INF] Health check cleared:  
MDS_DAMAGE (was: 1 mds daemon damaged)
2021-05-22 23:30:00.208558 mon.a [INF] Standby daemon mds.c assigned  
to filesystem cephfs as rank 0
2021-05-22 23:30:00.208614 mon.a [INF] Health check cleared:  
MDS_ALL_DOWN (was: 1 filesystem is offline)
2021-05-22 23:30:04.029282 mon.a [INF] daemon mds.c is now active in  
filesystem cephfs as rank 0
2021-05-22 23:30:04.378670 mon.a [INF] Health check cleared:  
FS_DEGRADED (was: 1 filesystem is degraded)

Since most errors fixed, I tried to repair 2.44:
ceph pg repair 2.44
ceph -w
2021-05-23 00:00:00.009926 mon.a [ERR] overall HEALTH_ERR 4 scrub  
errors; Possible data damage: 1 pg inconsistent
2021-05-23 00:01:17.454975 mon.a [INF] Health check cleared:  
OSD_SCRUB_ERRORS (was: 4 scrub errors)
2021-05-23 00:01:17.454993 mon.a [INF] Health check cleared:  
PG_DAMAGED (was: Possible data damage: 1 pg inconsistent)
2021-05-23 00:01:17.455002 mon.a [INF] Cluster is now healthy
2021-05-23 00:01:13.544097 osd.0 [ERR] 2.44 repair : stat mismatch,  
got 108/109 objects, 0/0 clones, 108/109 dirty, 108/109 omap, 0/0  
pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/1555896 bytes, 0/0  
manifest objects, 0/0 hit_set_archive bytes.
2021-05-23 00:01:13.544154 osd.0 [ERR] 2.44 repair 1 errors, 1 fixed

 # ceph -s
  cluster:
    id:     abc...
    health: HEALTH_OK
   services:
    mon: 3 daemons, quorum a,b,c (age 22h)
    mgr: a(active, since 22h), standbys: b, c
    mds: cephfs:1 {0=c=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 22h), 3 in (since 22h)

  task status:
    scrub status:
        mds.c: idle
   data:    pools:   3 pools, 192 pgs
    objects: 281.06k objects, 327 GiB
    usage:   2.4 TiB used, 8.1 TiB / 11 TiB avail
    pgs:     192 active+clean

I mounted the CephFS as before and tried following:cephfs-data-scan  
pg_files /mnt/ceph/Home/sagara 2.44

But it complains invalid path. I'm trying to see what files are  
effected by the missing object in PG 2.44.
Thank you very much helping this far. 
But I still prefer to understand whether any file effected by this disaster.
Best regardsSagara

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx