Hello Ceph Users, I wanted to hopefully get some advice or at least get some questions answered about the Ceph Disaster Recovery Process detailed in the docs. The questions I have are as follows: - Do all the steps need to be performed or can I check the status of the MDS after each until it recovers? - What does the Journal truncate do? From what the name suggests it truncates part of the journal, but from what the warnings sound like, it might cause some unexpected data to be deleted or delete the journal entirely. - Where would I use the data stored from recover_dentries to rebuild the metadata? - What sorts of information would an "expert" need to perform a successful disaster recovery? Other than questions, I was hoping to get some advice on my situation and whether I even need disaster recovery. I recently had a power blip reset the ceph servers and they came back barking about the CephFS MDSs being unable to start. The status listed UP:replay. Upon further investigation, there seemed to be issues with the journal and the MDS log had some errors in replaying. The somewhat abridged log can be found here (abridged because it spits out the same stuff): https://pastebin.com/FkypNkSZ The main errors lines in my mind, though, are: Jan 19 13:28:26 nxpmn01 ceph-mds[313765]: -3> 2022-01-19T13:28:26.091-0500 7f80a0ba7700 -1 log_channel(cluster) log [ERR] : journal replay inotablev mismatch 2 -> 2417 Jan 19 13:28:26 nxpmn01 ceph-mds[313765]: -2> 2022-01-19T13:28:26.091-0500 7f80a0ba7700 -1 log_channel(cluster) log [ERR] : EMetaBlob.replay sessionmap v 1160787 - 1 > table 0 Everything I've found online says I might need a journal truncate. I was hoping to avoid it coming to that, though, as I'm not an "expert" as mentioned in the Disaster Recovery docs. Relevant info about my Ceph setup: - 3 servers running Proxmox 6.4-13 and Ceph 15.2.10 - ceph -s returns: cluster: id: 642c8584-f642-4043-a43d-a984bbf75603 health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available 99 daemons have recently crashed services: mon: 3 daemons, quorum nxpmn01,nxpmn02,nxpmn03 (age 5d) mgr: nxpmn02(active, since 9d), standbys: nxpmn03, nxpmn01 mds: cephfs:1/1 {0=nxpmn01=up:replay(laggy or crashed)} osd: 18 osds: 18 up (since 5d), 18 in (since 3w) data: pools: 5 pools, 209 pgs objects: 4.25M objects, 16 TiB usage: 23 TiB used, 28 TiB / 51 TiB avail pgs: 209 active+clean - OSDs are UP and IN - To my knowledge CephFS has only 1 rank (rank 0?) Thanks _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx