Re: Crushmap rule for multi-datacenter erasure coding

Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> · Mon, 03 Apr 2023 22:19:48 +0200

Hi Frank,

Thanks for this detailed answer. About your point of 4+2 or similar schemes 
defeating the purpose of a 3-datacenter configuration, you're right in 
principle. In our case, the goal is to avoid any impact for replicated 
pools (in particular RBD for the cloud) but it may be acceptable for some 
pools to be readonly during a short period. But I'll explore your 
alternative k+m scénarios as some may be interesting..

I'm also interested by experience feedback with LRC EC, even if I don't 
think it changes the problem for resilience to a DC failure.

Best regards,

Michel
Sent from my mobile
Le 3 avril 2023 21:57:41 Frank Schilder <frans@xxxxxx> a écrit :

Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 
shard per failure domain and you have 3 data centers and not 6. The 
modified crush rule you wrote should work. I believe equally well with x=0 
or 2 -- but try it out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from 
your cluster and from that map extract the crush map. You can then 
*without* modifying your cluster update the crush map in the (off-line) 
copy of the OSD map and let it compute mappings (commands for all this are 
in the ceph docs, look for osdmaptool). These mappings you can check for if 
they are as you want. There was an earlier case where someone posted a 
script to confirm mappings automatically. I used some awk magic, its not 
that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. 
Its not worth the effort of having 3 data centers. In case you loose one 
DC, you have only 4 shards left, in which case the pool becomes read-only. 
Don't even consider to set min_size=4, it again completely defeats the 
purpose of having 3 DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC 
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, 
which is equivalent to 5+1. Here, you have RW access in case of 1 DC down. 
However, k=5 is a prime number with negative performance impact, ideal are 
powers of 2 for k. The alternative is k=4, m=5 (44% usable capacity) with 
good performance but higher redundancy overhead.

You can construct valid schemes by looking at all N multiples of 3 and 
trying k<=(2N/3-1):

N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is 
larger stripes, meaning that larger N only make sense if you have a large 
files/objects. An often overlooked advantage of profiles with large m is 
that you can defeat tail latencies for read operations by setting 
fast_read=true for the pool. This is really great when you have silently 
failing disks. Unfortunately, there is no fast_write counterpart (which 
would not be useful in your use case any ways).

There are only very few useful profiles with k a power of 2 (4+5, 8+7). 
Some people use 7+5 with success and 5+4 does look somewhat OK as well. If 
you use latest ceph with bluestore min alloc size = 4K, stripe size is less 
of an issue and 8+7 is a really good candidate that I would give a shot in 
a benchmark.

You should benchmark a number of different profiles on your system to get 
an idea of how important the profile is for performance and how much 
replication overhead you can afford. Remember to benchmark also in degraded 
condition. While as an admin you might be happy that stuff is up, users 
will still complain if things are suddenly unbearably slow. Make 
long-running tests in degraded state to catch all the pitfalls of MONs not 
trimming logs etc. to have a reliable configuration that doesn't let you 
down the first time it rains.

Good luck and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@xxxxxxx
Subject:  Crushmap rule for multi-datacenter erasure coding

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:

step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?

Thanks in advance for your help or suggestions. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx