Re: Erasure coded pool chunk count k

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A couple of notes to this:

Ideally you should have at least 2 more failure domains than your base
resilience (K+M for EC or size=N for replicated) - reasoning: Maintenance
needs to be performed so chances are every now and then you take a host
down for a few hours or possibly days to do some upgrade, fix some broken
things, etc. This means you're running in degraded state since only K+M-1
shards are available. While in that state a drive in another host dies on
you. Now recovery for that is blocked because you have insufficient failure
domains available and things start getting a bit uncomfortable depending on
how large M is. Or a whole host dies on you in that state ...
Generally planning your cluster resources right along the fault lines is
going to bite you and cause high levels of stress and anxiety. I know -
budgets have a limit but still, there is plenty of history on this list for
desperate calls for help simply because clusters were only planned for the
happy day case.

Unlike replicated pools you cannot change your profile on an EC-pool after
it has been created - so if you decide to change EC profile this means
creating a new pool and migrating. Just something to keep in mind.

On Tue, 5 Oct 2021 at 14:58, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:

>
> The larger the value of K relative to M, the more efficient the raw ::
> usable ratio ends up.
>
> There are tradeoffs and caveats.  Here are some of my thoughts; if I’m
> off-base here, I welcome enlightenment.
>
>
>
> When possible, it’s ideal to have at least K+M failure domains — often
> racks, sometimes hosts, chassis, etc.  Thus smaller clusters, say with 5-6
> nodes, aren’t good fits for larger sums of K+M if your data is valuable.
>
> Larger sums of K+M also mean that more drives will be touched by each read
> or write, especially during recovery.  This could be a factor if one is
> IOPS-limited.  Same with scrubs.
>
> When using a pool for, eg. RGW buckets, larger sums of K+M may result in
> greater overhead when storing small objects, since Ceph / RGW only AIUI
> writes full stripes.  So say you have an EC pool of 17,3 on drives with the
> default 4kB bluestone_min_alloc size.  A 1kB S3 object would thus allocate
> 17+3=20 x 4kB == 80kB of storage, which is 7900% overhead.  This is an
> extreme example to illustrate the point.
>
> Larger sums of K+M may present more IOPs to each storage drive, dependent
> on workload and the EC plugin selected.
>
> With larger objects (including RBD) the modulo factor is dramatically
> smaller.  One’s use-case and dataset per-pool may thus inform the EC
> profiles that make sense; workloads that are predominately smaller objects
> might opt for replication instead.
>
> There was a post ….. a year ago? suggesting that values with small prime
> factors are advantageous, but I never saw a discussion of why that might be.
>
> In some cases where one might be pressured to use replication with only 2
> copies of data, a 2,2 EC profile might achieve the same efficiency with
> greater safety.
>
> Geo / stretch clusters or ones in challenging environments are a special
> case; they might choose values of M equal to or even larger than K.
>
> That said, I think 4,2 is a reasonable place to *start*, adjusted by one’s
> specific needs.  You get a raw :: usable ratio of 1.5 without getting too
> complicated.
>
> ymmv
>
>
>
>
>
>
> >
> > Hi,
> >
> > It depends of hardware, failure domain, use case, overhead.
> >
> > I don’t see an easy way to chose k and m values.
> >
> > -
> > Etienne Menguy
> > etienne.menguy@xxxxxxxx
> >
> >
> >> On 4 Oct 2021, at 16:57, Golasowski Martin <martin.golasowski@xxxxxx>
> wrote:
> >>
> >> Hello guys,
> >> how does one estimate number of chunks for erasure coded pool ( k = ? )
> ? I see that number of m chunks determines the pool’s resiliency, however I
> did not find clear guideline how to determine k.
> >>
> >> Red Hat states that they support only the following combinations:
> >>
> >> k=8, m=3
> >> k=8, m=4
> >> k=4, m=2
> >>
> >> without any rationale behind them.
> >> The table is taken from
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/storage_strategies_guide/erasure_code_pools
> .
> >>
> >> Thanks!
> >>
> >> Regards,
> >> Martin
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux