Hi Frank,
Thanks for this detailed answer. About your point of 4+2 or similar schemes
defeating the purpose of a 3-datacenter configuration, you're right in
principle. In our case, the goal is to avoid any impact for replicated
pools (in particular RBD for the cloud) but it may be acceptable for some
pools to be readonly during a short period. But I'll explore your
alternative k+m scénarios as some may be interesting..
I'm also interested by experience feedback with LRC EC, even if I don't
think it changes the problem for resilience to a DC failure.
Best regards,
Michel
Sent from my mobile
Le 3 avril 2023 21:57:41 Frank Schilder <frans@xxxxxx> a écrit :
Hi Michel,
failure domain = datacenter doesn't work, because crush wants to put 1
shard per failure domain and you have 3 data centers and not 6. The
modified crush rule you wrote should work. I believe equally well with x=0
or 2 -- but try it out before doing anything to your cluster.
The easiest way for non-destructive testing is to download the osdmap from
your cluster and from that map extract the crush map. You can then
*without* modifying your cluster update the crush map in the (off-line)
copy of the OSD map and let it compute mappings (commands for all this are
in the ceph docs, look for osdmaptool). These mappings you can check for if
they are as you want. There was an earlier case where someone posted a
script to confirm mappings automatically. I used some awk magic, its not
that difficult.
As a note of warning, if you want to be failure resistant, don't use 4+2.
Its not worth the effort of having 3 data centers. In case you loose one
DC, you have only 4 shards left, in which case the pool becomes read-only.
Don't even consider to set min_size=4, it again completely defeats the
purpose of having 3 DCs in the first place.
The smallest profile you can use that will ensure RW access in case of a DC
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards,
which is equivalent to 5+1. Here, you have RW access in case of 1 DC down.
However, k=5 is a prime number with negative performance impact, ideal are
powers of 2 for k. The alternative is k=4, m=5 (44% usable capacity) with
good performance but higher redundancy overhead.
You can construct valid schemes by looking at all N multiples of 3 and
trying k<=(2N/3-1):
N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6
As you can see, the larger N the smaller the overhead. The downside is
larger stripes, meaning that larger N only make sense if you have a large
files/objects. An often overlooked advantage of profiles with large m is
that you can defeat tail latencies for read operations by setting
fast_read=true for the pool. This is really great when you have silently
failing disks. Unfortunately, there is no fast_write counterpart (which
would not be useful in your use case any ways).
There are only very few useful profiles with k a power of 2 (4+5, 8+7).
Some people use 7+5 with success and 5+4 does look somewhat OK as well. If
you use latest ceph with bluestore min alloc size = 4K, stripe size is less
of an issue and 8+7 is a really good candidate that I would give a shot in
a benchmark.
You should benchmark a number of different profiles on your system to get
an idea of how important the profile is for performance and how much
replication overhead you can afford. Remember to benchmark also in degraded
condition. While as an admin you might be happy that stuff is up, users
will still complain if things are suddenly unbearably slow. Make
long-running tests in degraded state to catch all the pitfalls of MONs not
trimming logs etc. to have a reliable configuration that doesn't let you
down the first time it rains.
Good luck and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@xxxxxxx
Subject: Crushmap rule for multi-datacenter erasure coding
Hi,
We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:
step choose indep 3 datacenter
step chooseleaf indep x host
step emit
Is it the right approach? If yes, what should be 'x'? Would 0 work?
From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?
Thanks in advance for your help or suggestions. Best regards,
Michel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx