Dear All,
we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:
<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>
Symptoms: MDS SSD pool (2TB) filled completely over the weekend,
normally uses less than 400GB, resulting in MDS crash.
We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS
did not recover
# ceph fs status
cephfs2 - 0 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
1 resolve wilma-s3 8065 8063 8047 0
2 resolve wilma-s2 901k 802k 34.4k 0
POOL TYPE USED AVAIL
mds_ssd metadata 2296G 3566G
primary_fs_data data 0 3566G
ec82pool data 2168T 3557T
STANDBY MDS
wilma-s1
wilma-s4
setting "ceph mds repaired 0" causes rank 0 to restart, and then
immediately fail.
Following the disaster-recovery-experts guide, the first step we did was
to export the MDS journals, e.g:
# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0
so far so good, however when we try to backup the final MDS the process
consumes all available RAM (470GB) and needs to be killed after 14 minutes.
# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2
similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary
We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1
event recover_dentries summary"
at this point, we tried to follow the instructions and make a RADOS
level copy of the journal data, however the link in the docs doesn't
explain how to do this and just points to
<http://tracker.ceph.com/issues/9902>
At this point we are tempted to reset the journal on MDS 2, but wanted
to get a feeling from others about how dangerous this could be?
We have a backup, but as there is 1.8PB of data, it's going to take a
few weeks to restore....
any ideas gratefully received.
Jake
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx