Re: MDS corrupt (also RADOS-level copy?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Forgot so say: As for your corrupt rank 0, you should check the logs with a higher debug level. Looks like you were less lucky than we were. Your journal position may be incorrect. This could be fixed by editing the journal header. You might also try to tell your MDS to skip corrupt entries. None of these operations are safe, though.


On 31/05/2023 16:41, Janek Bevendorff wrote:
Hi Jake,

Very interesting. This sounds very much like what we have been experiencing the last two days. We also had a sudden fill-up of the metadata pool, which repeated last night. See my question here: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/

I also noticed that I couldn't dump the current journal using the cephfs-journal-tool, as it would eat up all my RAM (probably not surprising with a journal that seems to be filling up a 16TiB pool).

Note: I did NOT need to reset the journal (and you probably don't need to either). I did, however, have to add extra capacity and balance out the data. After an MDS restart, the pool quickly cleared out again. The first MDS restart took an hour or so and I had to increase the MDS lag timeout (mds_beacon_grace), otherwise the MONs kept killing the MDS during the resolve phase. I set it to 1600 to be on the safe side.

While your MDS are recovering, you may want to set debug_mds to 10 for one of your MDS and check the logs. My logs were being spammed with snapshot-related messages, but I cannot really make sense of them. Still hoping for a reply on the ML.

In any case, once you are recovered, I recommend you adjust the weights of some of your OSDs to be much lower than others as a temporary safeguard. This way, only some OSDs would fill up and trigger your FULL watermark should this thing repeat.

Janek


On 31/05/2023 16:13, Jake Grimmett wrote:
Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:

<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>

Symptoms: MDS SSD pool (2TB) filled completely over the weekend, normally uses less than 400GB, resulting in MDS crash.

We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS did not recover

# ceph fs status
cephfs2 - 0 clients
=======
RANK   STATE     MDS     ACTIVITY   DNS    INOS   DIRS   CAPS
 0     failed
 1    resolve  wilma-s3            8065   8063   8047      0
 2    resolve  wilma-s2             901k   802k  34.4k     0
      POOL         TYPE     USED  AVAIL
    mds_ssd      metadata  2296G  3566G
primary_fs_data    data       0   3566G
    ec82pool       data    2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then immediately fail.

Following the disaster-recovery-experts guide, the first step we did was to export the MDS journals, e.g:

# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the process consumes all available RAM (470GB) and needs to be killed after 14 minutes.

# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 event recover_dentries summary"

at this point, we tried to follow the instructions and make a RADOS level copy of the journal data, however the link in the docs doesn't explain how to do this and just points to <http://tracker.ceph.com/issues/9902>

At this point we are tempted to reset the journal on MDS 2, but wanted to get a feeling from others about how dangerous this could be?

We have a backup, but as there is 1.8PB of data, it's going to take a few weeks to restore....

any ideas gratefully received.

Jake


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux