Re: MDS corrupt (also RADOS-level copy?)

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Wed, 31 May 2023 16:49:51 +0200

Forgot so say: As for your corrupt rank 0, you should check the logs 
with a higher debug level. Looks like you were less lucky than we were. 
Your journal position may be incorrect. This could be fixed by editing 
the journal header. You might also try to tell your MDS to skip corrupt 
entries. None of these operations are safe, though.

On 31/05/2023 16:41, Janek Bevendorff wrote:
Hi Jake,

Very interesting. This sounds very much like what we have been 
experiencing the last two days. We also had a sudden fill-up of the 
metadata pool, which repeated last night. See my question here: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/

I also noticed that I couldn't dump the current journal using the 
cephfs-journal-tool, as it would eat up all my RAM (probably not 
surprising with a journal that seems to be filling up a 16TiB pool).

Note: I did NOT need to reset the journal (and you probably don't need 
to either). I did, however, have to add extra capacity and balance out 
the data. After an MDS restart, the pool quickly cleared out again. 
The first MDS restart took an hour or so and I had to increase the MDS 
lag timeout (mds_beacon_grace), otherwise the MONs kept killing the 
MDS during the resolve phase. I set it to 1600 to be on the safe side.

While your MDS are recovering, you may want to set debug_mds to 10 for 
one of your MDS and check the logs. My logs were being spammed with 
snapshot-related messages, but I cannot really make sense of them. 
Still hoping for a reply on the ML.

In any case, once you are recovered, I recommend you adjust the 
weights of some of your OSDs to be much lower than others as a 
temporary safeguard. This way, only some OSDs would fill up and 
trigger your FULL watermark should this thing repeat.

Janek

On 31/05/2023 16:13, Jake Grimmett wrote:
Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:

<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>

Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.

We added 4 x extra SSD to increase pool capacity to 3.5TB, however 
MDS did not recover

# ceph fs status
cephfs2 - 0 clients
=======
RANK   STATE     MDS     ACTIVITY   DNS    INOS   DIRS   CAPS
 0     failed
 1    resolve  wilma-s3            8065   8063   8047      0
 2    resolve  wilma-s2             901k   802k  34.4k     0
      POOL         TYPE     USED  AVAIL
    mds_ssd      metadata  2296G  3566G
primary_fs_data    data       0   3566G
    ec82pool       data    2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.

Following the disaster-recovery-experts guide, the first step we did 
was to export the MDS journals, e.g:

# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the 
process consumes all available RAM (470GB) and needs to be killed 
after 14 minutes.

# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied 
to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"

at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 
<http://tracker.ceph.com/issues/9902>

At this point we are tempted to reset the journal on MDS 2, but 
wanted to get a feeling from others about how dangerous this could be?

We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore....

any ideas gratefully received.

Jake

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx