Re: [CEPH] Ceph multi nodes failed

Eugen Block <eblock@xxxxxx> · Fri, 24 Nov 2023 06:55:24 +0000

Hi,

basically, with EC pools you usually have a min_size of k + 1 to  
prevent data loss. There was a thread about that just a few days ago  
on this list. So in your case your min_size is probably 9, which makes  
IO pause in case two chunks become unavailable. If your crush failure  
domain is host (seems like it is) and you have "only" 10 hosts I'd  
recommend to add a host if possible to be able to fully recover while  
one host is down. Otherwise the PGs stay degraded until the host comes  
back.
So in your case your cluster can handle only one down host, e. g. for  
maintenance. If another host goes down (disk, network, whatever) you  
hit the min_size limit. Temporarily, you can set min_size = k but you  
should not risk anything and increase back to k + 1 after successful  
recovery. It's not possible to change the EC profile of a pool, you'd  
have to create a new pool and copy the data.

Check out the EC docs [1] to have some more details.

Regards,
Eugen

[1]  
https://docs.ceph.com/en/quincy/rados/operations/erasure-code/?highlight=k%2B1#erasure-coded-pool-recovery

Zitat von Nguyễn Hữu Khôi <nguyenhuukhoinw@xxxxxxxxx>:

Hello guys.

I see many docs and threads talking about osd failed. I have a question:
how many nodes in a cluster can be failed.

I am using ec 8 + 2(10 osd nodes) and when I shutdown 2 nodes then my
cluster crashes, It cannot write anymore.

Thank you. Regards

Nguyen Huu Khoi
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx