Re: Optimal Erasure Code profile?

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Fri, 5 Nov 2021 14:06:48 +0200

Thanks! I'll stick to 3:2 for now then.

/Z

On Fri, Nov 5, 2021 at 1:55 PM Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
wrote:

> With 6 servers I'd go with 3:2, with 7 can go with 4:2.
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx
> ---------------------------------------------------
>
> -----Original Message-----
> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> Sent: Friday, November 5, 2021 6:45 PM
> To: ceph-users <ceph-users@xxxxxxx>
> Subject:  Re: Optimal Erasure Code profile?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> ________________________________
>
> Many thanks for your detailed advices, gents, I very much appreciate them!
>
> I read in various places that for production environments it's advised to
> keep (k+m) <= host count. Looks like for my setup it is 3+2 then. Would it
> be best to proceed with 3+2, or should we go with 4+2?
>
> /Z
>
> On Fri, Nov 5, 2021 at 1:33 PM Sebastian Mazza <sebastian@xxxxxxxxxxx>
> wrote:
>
> > Hi Zakhar,
> >
> > I don't have much experience with Ceph, so you should read my words
> > with reasonable skepticism.
> >
> > If your failure domain should be the host level, then k=4, m=2 is you
> > most space efficient option for 6 server that allows you to still do
> > write IO when one of the servers failed. Assuming that you want you
> > pool min-size to be at least k+1. However, your cluster will not be
> > able to “heal itself” in the event of a single server outage, since 5
> > servers are not enough to distribute pgs for “k=4, m=2”.
> >
> > When you add one more server (7 in total) with enough space your
> > cluster will be able to self heal from one server outage. Or you can
> > create a new CRUSH rule with “k=5, m=2” and “rebalance” your data with
> > this new rule, and get a better space efficiency.
> >
> > I think for a production setup you should go with a CRUSH rule that
> > establishes a simple host based failure domain like you already stated.
> > Furthermore, your pool min-size for write IO should be at least k+1,
> > so that you are not in immediate danger of losing data after the first
> > OSD/Host failure. However, for the sake of completeness I want to
> > mention that it is possible to to create CRUSH rules that does not
> > consider Hosts and only uses the OSD level as failure domain. This
> > would allow you to use erasure coding also with a larger k (e.g. k=6,
> > n=2). Furthermore, it is also possible to create CRUSH rules that
> > establishes resilient agains a server outage while the pool has “k+m >
> your server count”.
> > See
> >         *
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033502.html
> >         *
> > http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clu
> > sters
> >
> > Example of a CRUSH rule that uses 5 hosts and 2 OSDs per host:
> > ---------------
> > rule ec5x2hdd {
> >         id 123
> >         type erasure
> >         min_size 10
> >         max_size 10
> >         step set_chooseleaf_tries 5
> >         step set_choose_tries 100
> >         step take default class hdd
> >         step choose indep 5 type host
> >         step choose indep 2 type osd
> >         step emit
> > }
> > This selects 5 servers and uses 2 OSDs on each server. With an erasure
> > coded pool of “k=7, m=3” the system could take one failing server or
> > two failing HDDs before you loose write IO. The system can also
> > survive one Host + one OSD or 3 OSD fails before you loose data. This
> > would give you theoretically 70% usable space with only 5 active
> > servers. Instead of 66% with the simple Host failure domain on 6 aktive
> servers and “k=4, m=2”
> > erasure coding.
> > But I don't advise you to do that!
> >
> >
> > Best
> > Sebastian
> >
> >
> > > On 05.11.2021, at 06:14, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
> > >
> > > Hi!
> > >
> > > I've got a CEPH 16.2.6 cluster, the hardware is 6 x Supermicro
> > > SSG-6029P nodes, each equipped with:
> > >
> > > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > > 384 GB RAM
> > > 2 x boot drives
> > > 2 x 1.6 TB enterprise NVME drives (DB/WAL)
> > > 2 x 6.4 TB enterprise drives (storage tier)
> > > 9 x 9TB HDDs (storage tier)
> > > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> > >
> > > Please help me understand the calculation / choice of the optimal EC
> > > profile for this setup. I would like the EC pool to span all 6 nodes
> > > on
> > HDD
> > > only and have the optimal combination of resiliency and efficiency
> > > with
> > the
> > > view that the cluster will expand. Previously when I had only 3
> > > nodes I tested EC with:
> > >
> > > crush-device-class=hdd
> > > crush-failure-domain=host
> > > crush-root=default
> > > jerasure-per-chunk-alignment=false
> > > k=2
> > > m=1
> > > plugin=jerasure
> > > technique=reed_sol_van
> > > w=8
> > >
> > > I am leaning towards using the above profile with k=4,m=2 for
> > "production"
> > > use, but am not sure that I understand the math correctly, that this
> > > profile is optimal for my current setup, and that I'll be able to
> > > scale
> > it
> > > properly by adding new nodes. I would very much appreciate any advice!
> > >
> > > Best regards,
> > > Zakhar
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > > email to ceph-users-leave@xxxxxxx
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx