Re: Crushmap rule for multi-datacenter erasure coding

Frank Schilder <frans@xxxxxx> · Tue, 4 Apr 2023 08:10:01 +0000

Hi Michel,

I don't have experience with LRC profiles. They may reduce cross-site traffic at the expense of extra overhead. But this might actually be unproblematic with EC profiles that have a large m any ways. If you do experiments with this, please let the list know.

I would like to add here a remark that might be of general interest. One particular advantage of the 4+5 and 8+7 profile in addition to k being a power of 2 is the following. In case of DC failure, one has 6 or 10 shards available, respectively. This computes to being equivalent to 4+2 and 8+2. In other words, the 4+5 and 8+7 profiles allow maintenance under degraded conditions, because one can loose 1DC + 1 other host and still have RW access. This is an advantage over 3-times replicated (with less overhead!), where if 1 DC is down one really needs to keep everything else running at all cost.

In addition to that, the 4+5 profile tolerates 4 and the 8+7 profile 6 hosts down. This means that one can use DCs that are independently administrated. A ceph upgrade would require only minimal coordination to synchronize the major steps. Specifically, after MONs and MGRs are upgraded, the OSD upgrade can proceed independently host by host on each DC without service outage. Even if something goes wrong in one DC, the others could proceed without service outage.

In many discussions I miss the impact of replication on maintainability (under degraded conditions). Just adding that here, because often the value of maintainability greatly outweighs the cost of the extra redundancy. For example, I gave up on trying to squeeze out the last byte of disks that are actually very cheap compared to my salary and rather have a system that runs without me doing overtime on incidents. Its much cheaper in the long run. A 50-60% usable-capacity cluster is very easy and cheap to administrate.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>
Sent: Monday, April 3, 2023 10:19 PM
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Crushmap rule for multi-datacenter erasure coding

Hi Frank,

Thanks for this detailed answer. About your point of 4+2 or similar schemes defeating the purpose of a 3-datacenter configuration, you're right in principle. In our case, the goal is to avoid any impact for replicated pools (in particular RBD for the cloud) but it may be acceptable for some pools to be readonly during a short period. But I'll explore your alternative k+m scénarios as some may be interesting..

I'm also interested by experience feedback with LRC EC, even if I don't think it changes the problem for resilience to a DC failure.

Best regards,

Michel
Sent from my mobile

Le 3 avril 2023 21:57:41 Frank Schilder <frans@xxxxxx> a écrit :

Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 shard per failure domain and you have 3 data centers and not 6. The modified crush rule you wrote should work. I believe equally well with x=0 or 2 -- but try it out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from your cluster and from that map extract the crush map. You can then *without* modifying your cluster update the crush map in the (off-line) copy of the OSD map and let it compute mappings (commands for all this are in the ceph docs, look for osdmaptool). These mappings you can check for if they are as you want. There was an earlier case where someone posted a script to confirm mappings automatically. I used some awk magic, its not that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. Its not worth the effort of having 3 data centers. In case you loose one DC, you have only 4 shards left, in which case the pool becomes read-only. Don't even consider to set min_size=4, it again completely defeats the purpose of having 3 DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, which is equivalent to 5+1. Here, you have RW access in case of 1 DC down. However, k=5 is a prime number with negative performance impact, ideal are powers of 2 for k. The alternative is k=4, m=5 (44% usable capacity) with good performance but higher redundancy overhead.

You can construct valid schemes by looking at all N multiples of 3 and trying k<=(2N/3-1):

N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is larger stripes, meaning that larger N only make sense if you have a large files/objects. An often overlooked advantage of profiles with large m is that you can defeat tail latencies for read operations by setting fast_read=true for the pool. This is really great when you have silently failing disks. Unfortunately, there is no fast_write counterpart (which would not be useful in your use case any ways).

There are only very few useful profiles with k a power of 2 (4+5, 8+7). Some people use 7+5 with success and 5+4 does look somewhat OK as well. If you use latest ceph with bluestore min alloc size = 4K, stripe size is less of an issue and 8+7 is a really good candidate that I would give a shot in a benchmark.

You should benchmark a number of different profiles on your system to get an idea of how important the profile is for performance and how much replication overhead you can afford. Remember to benchmark also in degraded condition. While as an admin you might be happy that stuff is up, users will still complain if things are suddenly unbearably slow. Make long-running tests in degraded state to catch all the pitfalls of MONs not trimming logs etc. to have a reliable configuration that doesn't let you down the first time it rains.

Good luck and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@xxxxxxx
Subject:  Crushmap rule for multi-datacenter erasure coding

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:

step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

 From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?

Thanks in advance for your help or suggestions. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx