Re: Recomand number of k and m erasure code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Frank,

"For production systems I would recommend to use EC profiles with at least
m=3" -> can i set min_size with min_size=4 for ec4+2 it's ok for
productions? My data is video from the camera system, it's hot data, write
and delete in some day, 10-15 day ex... Read and write availability is more
important than data loss

Thanks Frank

Vào Th 2, 15 thg 1, 2024 vào lúc 16:46 Frank Schilder <frans@xxxxxx> đã
viết:

> I would like to add here a detail that is often overlooked:
> maintainability under degraded conditions.
>
> For production systems I would recommend to use EC profiles with at least
> m=3. The reason being that if you have a longer problem with a node that is
> down and m=2 it is not possible to do any maintenance on the system without
> loosing write access. Don't trust what users claim they are willing to
> tolerate - at least get it in writing. Once a problem occurs they will be
> at your door step no matter what they said before.
>
> Similarly, when doing a longer maintenance task and m=2, any disk fail
> during maintenance will imply loosing write access.
>
> Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being
> unavailable simultaneously while service is fully operational. That can be
> a life saver in many situations.
>
> An additional reason for larger m is systematic failures of drives if your
> vendor doesn't mix drives from different batches and factories. If a batch
> has a systematic production error, failures are no longer statistically
> independent. In such a situation, if one drives fails the likelihood that
> more drives fail at the same time is very high. Having a larger number of
> parity shards increases the chances of recovering from such events.
>
> For similar reasons I would recommend to deploy 5 MONs instead of 3. My
> life got so much better after having the extra redundancy.
>
> As some background, in our situation we experience(d) somewhat heavy
> maintenance operations including modifying/updating ceph nodes (hardware,
> not software), exchanging Racks, switches, cooling and power etc. This
> required longer downtime and/or moving of servers and moving the ceph
> hardware was the easiest compared with other systems due to the extra
> redundancy bits in it. We had no service outages during such operations.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> Sent: Saturday, January 13, 2024 5:36 PM
> To: Phong Tran Thanh
> Cc: ceph-users@xxxxxxx
> Subject:  Re: Recomand number of k and m erasure code
>
> There are nuances, but in general the higher the sum of m+k, the lower the
> performance, because *every* operation has to hit that many drives, which
> is especially impactful with HDDs.  So there’s a tradeoff between storage
> efficiency and performance.  And as you’ve seen, larger parity groups
> especially mean slower recovery/backfill.
>
> There’s also a modest benefit to choosing values of m and k that have
> small prime factors, but I wouldn’t worry too much about that.
>
>
> You can find EC efficiency tables on the net:
>
>
>
> https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html
>
>
> I should really add a table to the docs, making a note to do that.
>
> There’s a nice calculator at the OSNEXUS site:
>
> https://www.osnexus.com/ceph-designer;
>
>
> The overhead factor is (k+m) / k
>
> So for a 4,2 profile, that’s 6 / 4 == 1.5
>
> For 6,2, 8 / 6 = 1.33
>
> For 10,2, 12 / 10 = 1.2
>
> and so forth.  As k increases, the incremental efficiency gain sees
> diminishing returns, but performance continues to decrease.
>
> Think of m as the number of copies you can lose without losing data, and
> m-1 as the number you can lose / have down and still have data *available*.
>
> I also suggest that the number of failure domains — in your cases this
> means OSD nodes — be *at least* k+m+1, so in your case you want k+m to be
> at most 9.
>
> With RBD and many CephFS implementations, we mostly have relatively large
> RADOS objects that are striped over many OSDs.
>
> When using RGW especially, one should attend to average and median S3
> object size.  There’s an analysis of the potential for space amplification
> in the docs so I won’t repeat it here in detail. This sheet
> https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
> visually demonstrates this.
>
> Basically, for an RGW bucket pool — or for a CephFS data pool storing
> unusually small objects — if you have a lot of S3 objects in the multiples
> of KB size, you waste a significant fraction of underlying storage.  This
> is exacerbated by EC, and the larger the sum of k+m, the more waste.
>
> When people ask me about replication vs EC and EC profile, the first
> question I ask is what they’re storing.  When EC isn’t a non-starter, I
> tend to recommend 4,2 as a profile until / unless someone has specific
> needs and can understand the tradeoffs. This lets you store ~~ 2x the data
> of 3x replication while not going overboard on the performance hit.
>
> If you care about your data, do not set m=1.
>
> If you need to survive the loss of many drives, say if your cluster is
> across multiple buildings or sites, choose a larger value of k.  There are
> people running profiles like 4,6 because they have unusual and specific
> needs.
>
>
>
>
> > On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh <tranphong079@xxxxxxxxx>
> wrote:
> >
> > Hi ceph user!
> >
> > I need to determine which erasure code values (k and m) to choose for a
> > Ceph cluster with 10 nodes.
> >
> > I am using the reef version with rbd. Furthermore, when using a larger k,
> > for example, ec6+2 and ec4+2, which erasure coding performance is better,
> > and what are the criteria for choosing the appropriate erasure coding?
> > Please help me
> >
> > Email: tranphong079@xxxxxxxxx
> > Skype: tranphong079
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>


-- 
Trân trọng,
----------------------------------------------------------------------------

*Tran Thanh Phong*

Email: tranphong079@xxxxxxxxx
Skype: tranphong079
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux