Re: MDS recovery

Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> · Thu, 27 Apr 2023 15:53:51 +0530

Hi,

First of all I would suggest upgrading your cluster on one of the supported
releases.

I think full recovery is recommended to get back the mds.

1. Stop the mdses and all the clients.

2. Fail the fs.

    a. ceph fs fail <fs name>
3. Backup the journal: (If the below command fails, make rados level copy
using http://tracker.ceph.com/issues/9902). Since the mds is corrupted, we
can skip this too ?

    # cephfs-journal-tool journal export backup.bin

4. Cleanup up ancillary data generated during if any previous recovery.

    # cephfs-data-scan cleanup [<data pool>]

5. Recover_dentries, reset session, and reset_journal:

    # cephfs-journal-tool --rank <fsname>:0 event recover_dentries list

    # cephfs-table-tool <fsname>:all reset session

    # cephfs-journal-tool --rank <fsname>:0 journal reset

6. Execute scan_extents on each of the x4 tools pods in parallel:

    # cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 --filesystem
<fsname> <data-pool>

    # cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 --filesystem
<fsname> <data-pool>

    # cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 --filesystem
<fsname> <data-pool>

    # cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 --filesystem
<fsname> <data-pool>

 7. Execute scan_inodes on each of the x4 tools pods in parallel:

    # cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 --filesystem
<fsname> <data-pool>

    # cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 --filesystem
<fsname> <data-pool>

    # cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 --filesystem
<fsname> <data-pool>

    # cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 --filesystem
<fsname> <data-pool>

 8. scan_links:

    # cephfs-data-scan scan_links --filesystem <fsname>

9. Mark the filesystem joinable from pod/rook-ceph-tools:

    # ceph fs set <fsname> joinable true

10. Startup MDSs

11. Scrub online fs

       # ceph tell mds.<fsname>-<active-mds[a|b]> scrub start / recursive
repair

12. Check scrub status:

       # ceph tell mds.<fsname>-{pick-active-mds| a or b} scrub status

For more information please look into
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/

Thanks,

Kotresh H R

On Wed, Apr 26, 2023 at 3:08 AM <jack@xxxxxxxxxxxxxxxxxxx> wrote:

> Hi All,
>
> We have a CephFS cluster running Octopus with three control nodes each
> running an MDS, Monitor, and Manager on Ubuntu 20.04. The OS drive on one
> of these nodes failed recently and we had to do a fresh install, but made
> the mistake of installing Ubuntu 22.04 where Octopus is not available. We
> tried to force apt to use the Ubuntu 20.04 repo when installing Ceph so
> that it would install Octopus, but for some reason Quincy was still
> installed. We re-integrated this node and it seemed to work fine for about
> a week until our cluster reported damage to an MDS daemon and placed our
> filesystem into a degraded state.
>
> cluster:
>     id:     692905c0-f271-4cd8-9e43-1c32ef8abd13
>     health: HEALTH_ERR
>             mons are allowing insecure global_id reclaim
>             1 filesystem is degraded
>             1 filesystem is offline
>             1 mds daemon damaged
>             noout flag(s) set
>             161 scrub errors
>             Possible data damage: 24 pgs inconsistent
>             8 pgs not deep-scrubbed in time
>             4 pgs not scrubbed in time
>             6 daemons have recently crashed
>
>   services:
>     mon: 3 daemons, quorum database-0,file-server,webhost (age 12d)
>     mgr: database-0(active, since 4w), standbys: webhost, file-server
>     mds: cephfs:0/1 3 up:standby, 1 damaged
>     osd: 91 osds: 90 up (since 32h), 90 in (since 5M)
>          flags noout
>
>   task status:
>
>   data:
>     pools:   7 pools, 633 pgs
>     objects: 169.18M objects, 640 TiB
>     usage:   883 TiB used, 251 TiB / 1.1 PiB avail
>     pgs:     605 active+clean
>              23  active+clean+inconsistent
>              4   active+clean+scrubbing+deep
>              1   active+clean+scrubbing+deep+inconsistent
>
> We are not sure if the Quincy/Octopus version mismatch is the problem, but
> we are in the process of downgrading this node now to ensure all nodes are
> running Octopus. Before doing that, we ran the following commands to try
> and recover:
>
> $ cephfs-journal-tool --rank=cephfs:all journal export backup.bin
>
> $ sudo cephfs-journal-tool --rank=cephfs:all event recover_dentries
> summary:
>
> Events by type:
>   OPEN: 29589
>   PURGED: 1
>   SESSION: 16
>   SESSIONS: 4
>   SUBTREEMAP: 127
>   UPDATE: 70438
> Errors: 0
>
> $ cephfs-journal-tool --rank=cephfs:0 journal reset:
>
> old journal was 170234219175~232148677
> new journal start will be 170469097472 (2729620 bytes past old end)
> writing journal head
> writing EResetJournal entry
> done
>
> $ cephfs-table-tool all reset session
>
> All of our MDS daemons are down and fail to restart with the following
> errors:
>
> -3> 2023-04-20T10:25:15.072-0700 7f0465069700 -1 log_channel(cluster) log
> [ERR] : journal replay alloc 0x1000053af79 not in free
> [0x1000053af7d~0x1e8,0x1000053b35c~0x1f7,0x1000053b555~0x2,0x1000053b559~0x2,0x1000053b55d~0x2,0x1000053b561~0x2,0x1000053b565~0x1de,0x1000053b938~0x1fd,0x1000053bd2a~0x4,0x1000053bf23~0x4,0x1000053c11c~0x4,0x1000053cd7b~0x158,0x1000053ced8~0xffffac3128]
>     -2> 2023-04-20T10:25:15.072-0700 7f0465069700 -1 log_channel(cluster)
> log [ERR] : journal replay alloc
> [0x1000053af7a~0x1eb,0x1000053b35c~0x1f7,0x1000053b555~0x2,0x1000053b559~0x2,0x1000053b55d~0x2],
> only
> [0x1000053af7d~0x1e8,0x1000053b35c~0x1f7,0x1000053b555~0x2,0x1000053b559~0x2,0x1000053b55d~0x2]
> is in free
> [0x1000053af7d~0x1e8,0x1000053b35c~0x1f7,0x1000053b555~0x2,0x1000053b559~0x2,0x1000053b55d~0x2,0x1000053b561~0x2,0x1000053b565~0x1de,0x1000053b938~0x1fd,0x1000053bd2a~0x4,0x1000053bf23~0x4,0x1000053c11c~0x4,0x1000053cd7b~0x158,0x1000053ced8~0xffffac3128]
>     -1> 2023-04-20T10:25:15.072-0700 7f0465069700 -1
> /build/ceph-15.2.15/src/mds/journal.cc: In function 'void
> EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)' thread
> 7f0465069700 time 2023-04-20T10:25:15.076784-0700
> /build/ceph-15.2.15/src/mds/journal.cc: 1513: FAILED ceph_assert(inotablev
> == mds->inotable->get_version())
>
>  ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x155) [0x7f04717a3be1]
>  2: (()+0x26ade9) [0x7f04717a3de9]
>  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x67e2)
> [0x560feaca36f2]
>  4: (EUpdate::replay(MDSRank*)+0x42) [0x560feaca5bd2]
>  5: (MDLog::_replay_thread()+0x90c) [0x560feac393ac]
>  6: (MDLog::ReplayThread::entry()+0x11) [0x560fea920821]
>  7: (()+0x8609) [0x7f0471318609]
>  8: (clone()+0x43) [0x7f0470ee9163]
>
>      0> 2023-04-20T10:25:15.076-0700 7f0465069700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f0465069700 thread_name:md_log_replay
>
>  ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus
> (stable)
>  1: (()+0x143c0) [0x7f04713243c0]
>  2: (gsignal()+0xcb) [0x7f0470e0d03b]
>  3: (abort()+0x12b) [0x7f0470dec859]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1b0) [0x7f04717a3c3c]
>  5: (()+0x26ade9) [0x7f04717a3de9]
>  6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x67e2)
> [0x560feaca36f2]
>  7: (EUpdate::replay(MDSRank*)+0x42) [0x560feaca5bd2]
>  8: (MDLog::_replay_thread()+0x90c) [0x560feac393ac]
>  9: (MDLog::ReplayThread::entry()+0x11) [0x560fea920821]
>  10: (()+0x8609) [0x7f0471318609]
>  11: (clone()+0x43) [0x7f0470ee9163]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> At this point, we decided it's best to ask for some guidance before
> issuing any other recovery commands.
>
> Can anyone advise what we should do?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx