Re: EC Profiles & DR

Rich Freeman <r-ceph@xxxxxxxxxxxx> · Tue, 5 Dec 2023 23:11:00 +0000

On Tue, Dec 5, 2023 at 6:35 AM Patrick Begou
<Patrick.Begou@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Ok, so I've misunderstood the meaning of failure domain. If there is no
> way to request using 2 osd/node and node as failure domain, with 5 nodes
> k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a
> raid1  setup. A little bit better than replication in the point of view
> of global storage capacity.
>

I'm not sure what you mean by requesting 2osd/node.  If the failure
domain is set to the host, then by default k/m refer to hosts, and the
PGs will be spread across all OSDs on all hosts, but with any
particular PG only being present on one OSD on each host.  You can get
fancy with device classes and crush rules and such and be more
specific with how they're allocated, but that would be the typical
behavior.

Since k/m refer to hosts, then k+m must be less than or equal to the
number of hosts or you'll have a degraded pool because there won't be
enough hosts to allocate them all.  It won't ever stack them across
multiple OSDs on the same host with that configuration.

k=2,m=2 with min=3 would require at least 4 hosts (k+m), and would
allow you to operate degraded with a single host down, and the PGs
would become inactive but would still be recoverable with two hosts
down.  While strictly speaking only 4 hosts are required, you'd do
better to have more than that since then the cluster can immediately
recover from a loss, assuming you have sufficient space.  As you say
it is no more space-efficient than RAID1 or size=2, and it suffers
write amplification for modifications, but it does allow recovery
after the loss of up to two hosts, and you can operate degraded with
one host down which allows for somewhat high availability.

--
Rich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx