Forgot so say: As for your corrupt rank 0, you should check the logs
with a higher debug level. Looks like you were less lucky than we were.
Your journal position may be incorrect. This could be fixed by editing
the journal header. You might also try to tell your MDS to skip corrupt
entries. None of these operations are safe, though.
On 31/05/2023 16:41, Janek Bevendorff wrote:
Hi Jake,
Very interesting. This sounds very much like what we have been
experiencing the last two days. We also had a sudden fill-up of the
metadata pool, which repeated last night. See my question here:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/
I also noticed that I couldn't dump the current journal using the
cephfs-journal-tool, as it would eat up all my RAM (probably not
surprising with a journal that seems to be filling up a 16TiB pool).
Note: I did NOT need to reset the journal (and you probably don't need
to either). I did, however, have to add extra capacity and balance out
the data. After an MDS restart, the pool quickly cleared out again.
The first MDS restart took an hour or so and I had to increase the MDS
lag timeout (mds_beacon_grace), otherwise the MONs kept killing the
MDS during the resolve phase. I set it to 1600 to be on the safe side.
While your MDS are recovering, you may want to set debug_mds to 10 for
one of your MDS and check the logs. My logs were being spammed with
snapshot-related messages, but I cannot really make sense of them.
Still hoping for a reply on the ML.
In any case, once you are recovered, I recommend you adjust the
weights of some of your OSDs to be much lower than others as a
temporary safeguard. This way, only some OSDs would fill up and
trigger your FULL watermark should this thing repeat.
Janek
On 31/05/2023 16:13, Jake Grimmett wrote:
Dear All,
we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:
<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>
Symptoms: MDS SSD pool (2TB) filled completely over the weekend,
normally uses less than 400GB, resulting in MDS crash.
We added 4 x extra SSD to increase pool capacity to 3.5TB, however
MDS did not recover
# ceph fs status
cephfs2 - 0 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
1 resolve wilma-s3 8065 8063 8047 0
2 resolve wilma-s2 901k 802k 34.4k 0
POOL TYPE USED AVAIL
mds_ssd metadata 2296G 3566G
primary_fs_data data 0 3566G
ec82pool data 2168T 3557T
STANDBY MDS
wilma-s1
wilma-s4
setting "ceph mds repaired 0" causes rank 0 to restart, and then
immediately fail.
Following the disaster-recovery-experts guide, the first step we did
was to export the MDS journals, e.g:
# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0
so far so good, however when we try to backup the final MDS the
process consumes all available RAM (470GB) and needs to be killed
after 14 minutes.
# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2
similarly, "recover_dentries summary" consumes all RAM when applied
to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary
We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1
event recover_dentries summary"
at this point, we tried to follow the instructions and make a RADOS
level copy of the journal data, however the link in the docs doesn't
explain how to do this and just points to
<http://tracker.ceph.com/issues/9902>
At this point we are tempted to reset the journal on MDS 2, but
wanted to get a feeling from others about how dangerous this could be?
We have a backup, but as there is 1.8PB of data, it's going to take a
few weeks to restore....
any ideas gratefully received.
Jake
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx