Re: One mds daemon damaged, filesystem is offline. How to recover?

Eugen Block <eblock@xxxxxx> · Sun, 23 May 2021 08:31:11 +0000

Hi,

3. I have allocated three (3) separate machines for the Ceph  
cluster. That is, I have 3 separate instances of MON, MGR, OSD and  
MDS running on 3 separate machines.

okay, so at least those are three different hosts, although in a  
production environment I would strongly recommend to use a dedicated  
MDS server. But why only three OSDs? In case of a disk failure the  
cluster is in a degraded state until you recover or rebuild that one  
OSD on that host. If you had more disks per node those PGs could at  
least be remapped to a different OSD and let the cluster recover.
The other thing is to have the CephFS metadata pool on SSDs, that's a  
common recommendation to reduce latency. And since the metadata pool  
is usually quite small it wouldn't be that expensive.

Increasing the number of MONs to 5 is not unreasonable although most  
of our customers (as well as our own cluster) are fine with 3 MONs.  
But increasing the pool size to 5 can or will have an impact on the  
performance since it also increases the latency, every write has to be  
acked 5 times instead of 3. I think you'd be fine with pool size 3  
(failure domain host) but you should move the metadata to SSDs and  
increase the overall number of OSDs.

There is no guarantee, you can only reduce the risks of data loss but  
prepare for it with backups.

[5] https://docs.ceph.com/en/latest/cephfs/createfs/

Zitat von Sagara Wijetunga <sagarawmw@xxxxxxxxx>:

On Sunday, May 23, 2021, 01:16:12 AM GMT+8, Eugen Block  
<eblock@xxxxxx> wrote: Awesome! I'm glad it worked out this far! At  
least you have a working 
filesystem now even it means that you may have to use a backup.
But now I can say it: Having only three OSDs is really not the best 
idea. ;-) Are all those OSDs on the same host?

1. For safe side I did a full deep-scrubceph osd deep-scrub all

ceph -w shows no error, only following line repeating:2021-05-23  
01:00:00.003140 mon.a [INF] overall HEALTH_OK

2021-05-23 02:00:00.007661 mon.a [INF] overall HEALTH_OK

That is, whatever in the cluster is clean.

2. I take daily rsync-based backups. 
I still not sure what the removed metadata object represented. 

3. I have allocated three (3) separate machines for the Ceph  
cluster. That is, I have 3 separate instances of MON, MGR, OSD and  
MDS running on 3 separate machines.
I agree it is better allocate five (5) different machines with pool  
size 5. It further reduces the risk factor that losing quorum if one  
machine is already down.
I think to avoid this kind of mess happening again, have to use data  
center-grade SSDs with PLP (Power Loss Protection). Mine is hard  
disks. 
The issue with data center-grade SSDs with PLP is still low capacity  
and very expensive.
One not so expensive option is to keep the journal on separate data  
center-grade SSD with PLP. But Ceph has to give me a guarantee that  
it's flushing or sync the journal to high capacity hard disks are  
fail safe. What's your understanding on this? Is it fail safe? Any  
link for me to further read.
Best regardsSagara

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx