Re: Crushmap rule for multi-datacenter erasure coding

Frank Schilder <frans@xxxxxx> · Mon, 3 Apr 2023 19:57:36 +0000

Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 shard per failure domain and you have 3 data centers and not 6. The modified crush rule you wrote should work. I believe equally well with x=0 or 2 -- but try it out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from your cluster and from that map extract the crush map. You can then *without* modifying your cluster update the crush map in the (off-line) copy of the OSD map and let it compute mappings (commands for all this are in the ceph docs, look for osdmaptool). These mappings you can check for if they are as you want. There was an earlier case where someone posted a script to confirm mappings automatically. I used some awk magic, its not that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. Its not worth the effort of having 3 data centers. In case you loose one DC, you have only 4 shards left, in which case the pool becomes read-only. Don't even consider to set min_size=4, it again completely defeats the purpose of having 3 DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, which is equivalent to 5+1. Here, you have RW access in case of 1 DC down. However, k=5 is a prime number with negative performance impact, ideal are powers of 2 for k. The alternative is k=4, m=5 (44% usable capacity) with good performance but higher redundancy overhead.

You can construct valid schemes by looking at all N multiples of 3 and trying k<=(2N/3-1):

N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is larger stripes, meaning that larger N only make sense if you have a large files/objects. An often overlooked advantage of profiles with large m is that you can defeat tail latencies for read operations by setting fast_read=true for the pool. This is really great when you have silently failing disks. Unfortunately, there is no fast_write counterpart (which would not be useful in your use case any ways).

There are only very few useful profiles with k a power of 2 (4+5, 8+7). Some people use 7+5 with success and 5+4 does look somewhat OK as well. If you use latest ceph with bluestore min alloc size = 4K, stripe size is less of an issue and 8+7 is a really good candidate that I would give a shot in a benchmark.

You should benchmark a number of different profiles on your system to get an idea of how important the profile is for performance and how much replication overhead you can afford. Remember to benchmark also in degraded condition. While as an admin you might be happy that stuff is up, users will still complain if things are suddenly unbearably slow. Make long-running tests in degraded state to catch all the pitfalls of MONs not trimming logs etc. to have a reliable configuration that doesn't let you down the first time it rains.

Good luck and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@xxxxxxx
Subject:  Crushmap rule for multi-datacenter erasure coding

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:

step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

 From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?

Thanks in advance for your help or suggestions. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx