Re: One mds daemon damaged, filesystem is offline. How to recover?

Sagara Wijetunga <sagarawmw@xxxxxxxxx> · Sun, 23 May 2021 04:45:13 +0000 (UTC)

    On Sunday, May 23, 2021, 01:16:12 AM GMT+8, Eugen Block <eblock@xxxxxx> wrote: Awesome! I'm glad it worked out this far! At least you have a working  
filesystem now even it means that you may have to use a backup.
But now I can say it: Having only three OSDs is really not the best  
idea. ;-) Are all those OSDs on the same host?

1. For safe side I did a full deep-scrubceph osd deep-scrub all

ceph -w shows no error, only following line repeating:2021-05-23 01:00:00.003140 mon.a [INF] overall HEALTH_OK

2021-05-23 02:00:00.007661 mon.a [INF] overall HEALTH_OK

That is, whatever in the cluster is clean.

2. I take daily rsync-based backups. 
I still not sure what the removed metadata object represented. 

3. I have allocated three (3) separate machines for the Ceph cluster. That is, I have 3 separate instances of MON, MGR, OSD and MDS running on 3 separate machines.
I agree it is better allocate five (5) different machines with pool size 5. It further reduces the risk factor that losing quorum if one machine is already down.
I think to avoid this kind of mess happening again, have to use data center-grade SSDs with PLP (Power Loss Protection). Mine is hard disks. 
The issue with data center-grade SSDs with PLP is still low capacity and very expensive.
One not so expensive option is to keep the journal on separate data center-grade SSD with PLP. But Ceph has to give me a guarantee that it's flushing or sync the journal to high capacity hard disks are fail safe. What's your understanding on this? Is it fail safe? Any link for me to further read.
Best regardsSagara

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx