Re: Recomand number of k and m erasure code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


I would like to add here a detail that is often overlooked: maintainability under degraded conditions.

For production systems I would recommend to use EC profiles with at least m=3. The reason being that if you have a longer problem with a node that is down and m=2 it is not possible to do any maintenance on the system without loosing write access. Don't trust what users claim they are willing to tolerate - at least get it in writing. Once a problem occurs they will be at your door step no matter what they said before.

Similarly, when doing a longer maintenance task and m=2, any disk fail during maintenance will imply loosing write access.

Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being unavailable simultaneously while service is fully operational. That can be a life saver in many situations.

An additional reason for larger m is systematic failures of drives if your vendor doesn't mix drives from different batches and factories. If a batch has a systematic production error, failures are no longer statistically independent. In such a situation, if one drives fails the likelihood that more drives fail at the same time is very high. Having a larger number of parity shards increases the chances of recovering from such events.

For similar reasons I would recommend to deploy 5 MONs instead of 3. My life got so much better after having the extra redundancy.

As some background, in our situation we experience(d) somewhat heavy maintenance operations including modifying/updating ceph nodes (hardware, not software), exchanging Racks, switches, cooling and power etc. This required longer downtime and/or moving of servers and moving the ceph hardware was the easiest compared with other systems due to the extra redundancy bits in it. We had no service outages during such operations.

Best regards,
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: Saturday, January 13, 2024 5:36 PM
To: Phong Tran Thanh
Cc: ceph-users@xxxxxxx
Subject:  Re: Recomand number of k and m erasure code

There are nuances, but in general the higher the sum of m+k, the lower the performance, because *every* operation has to hit that many drives, which is especially impactful with HDDs.  So there’s a tradeoff between storage efficiency and performance.  And as you’ve seen, larger parity groups especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small prime factors, but I wouldn’t worry too much about that.

You can find EC efficiency tables on the net:

I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:;

The overhead factor is (k+m) / k

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object size.  There’s an analysis of the potential for space amplification in the docs so I won’t repeat it here in detail. This sheet visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually small objects — if you have a lot of S3 objects in the multiples of KB size, you waste a significant fraction of underlying storage.  This is exacerbated by EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 4,2 as a profile until / unless someone has specific needs and can understand the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across multiple buildings or sites, choose a larger value of k.  There are people running profiles like 4,6 because they have unusual and specific needs.

> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh <tranphong079@xxxxxxxxx> wrote:
> Hi ceph user!
> I need to determine which erasure code values (k and m) to choose for a
> Ceph cluster with 10 nodes.
> I am using the reef version with rbd. Furthermore, when using a larger k,
> for example, ec6+2 and ec4+2, which erasure coding performance is better,
> and what are the criteria for choosing the appropriate erasure coding?
> Please help me
> Email: tranphong079@xxxxxxxxx
> Skype: tranphong079
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux