Re: Recomand number of k and m erasure code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


There are nuances, but in general the higher the sum of m+k, the lower the performance, because *every* operation has to hit that many drives, which is especially impactful with HDDs.  So there’s a tradeoff between storage efficiency and performance.  And as you’ve seen, larger parity groups especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small prime factors, but I wouldn’t worry too much about that.

You can find EC efficiency tables on the net:

I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:;

The overhead factor is (k+m) / k 

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object size.  There’s an analysis of the potential for space amplification in the docs so I won’t repeat it here in detail. This sheet visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually small objects — if you have a lot of S3 objects in the multiples of KB size, you waste a significant fraction of underlying storage.  This is exacerbated by EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 4,2 as a profile until / unless someone has specific needs and can understand the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across multiple buildings or sites, choose a larger value of k.  There are people running profiles like 4,6 because they have unusual and specific needs.

> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh <tranphong079@xxxxxxxxx> wrote:
> Hi ceph user!
> I need to determine which erasure code values (k and m) to choose for a
> Ceph cluster with 10 nodes.
> I am using the reef version with rbd. Furthermore, when using a larger k,
> for example, ec6+2 and ec4+2, which erasure coding performance is better,
> and what are the criteria for choosing the appropriate erasure coding?
> Please help me
> Email: tranphong079@xxxxxxxxx
> Skype: tranphong079
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux