Re: Failure Domain = NVMe?

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Fri, 12 Mar 2021 07:28:19 +1300

For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2
shards similar to this:
https://ceph.io/planet/erasure-code-on-small-clusters/
If a host dies/goes down you can still recover all data (although at that
stage your cluster is no longer available for client io).
You shouldn't just consider failure but also maintenance scenarios which
will require a node to offline for some time. In particular a ceph upgrades
can take some time - especially if something goes wrong. You have no
breathing room left at that stage and your cluster will be dead until all
nodes are up again

On Fri, 12 Mar 2021 at 02:03, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:

> Istvan,
>
> I agree that there is always risk with failure-domain < node, especially
> with EC pools.  We are accepting this risk to lower the financial barrier
> to entry.
>
> In our minds, we have good power protection and new hardware, so the
> greatest immediate risks for our smaller cluster (approaching 6 OSD nodes
> and 48 HDDs) are NVMe write exhaustion and HDD failures.   Since we have
> multiple OSDs sharing a single NVMe device it occurs to me that we might
> want to get Ceph to 'map' against that.  In a way, NVMe devices are our
> 'nodes' at the current size of our cluster.
>
> -Dave
>
> --
> Dave Hall
> Binghamton University
>
> On Wed, Mar 10, 2021 at 10:41 PM Szabo, Istvan (Agoda) <
> Istvan.Szabo@xxxxxxxxx> wrote:
>
> > Don't forget if you have server failure you might loose many objects. If
> > the failure domain is osd, it means let's say you have 12 drives in each
> > server, 8+2 EC in an unlucky situation can be located in 1 server also.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---------------------------------------------------
> > Agoda Services Co., Ltd.
> > e: istvan.szabo@xxxxxxxxx
> > ---------------------------------------------------
> >
> > -----Original Message-----
> > From: Dave Hall <kdhall@xxxxxxxxxxxxxx>
> > Sent: Wednesday, March 10, 2021 11:42 PM
> > To: ceph-users <ceph-users@xxxxxxx>
> > Subject:  Failure Domain = NVMe?
> >
> > Email received from outside the company. If in doubt don't click links
> nor
> > open attachments!
> > ________________________________
> >
> > Hello,
> >
> > In some documentation I was reading last night about laying out OSDs, it
> > was suggested that if more that one OSD uses the same NVMe drive, the
> > failure-domain should probably be set to node. However, for a small
> cluster
> > the inclination is to use EC-pools and failure-domain = OSD.
> >
> > I was wondering if there is a middle ground - could we define
> > failure-domain = NVMe?  I think the map would need to be defined manually
> > in the same way that failure-domain = rack requires information about
> which
> > nodes are in each rack.
> >
> > Example:  My latest OSD nodes have 8 HDDs and 3 U.2 NVMe.  I'd set up the
> > WAL/DB for with HDDs per OSD  (wasted space on the 3rd NVMe).
> > Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe
> > devices per node - 15 total NVMe devices.   My preferred EC-pool profile
> > is 8+2.  It seems that this profile could be safely dispersed across 15
> > failure domains, resulting in protection against NVMe failure.
> >
> > Please let me know if this is worth pursuing.
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdhall@xxxxxxxxxxxxxx
> > 607-760-2328 (Cell)
> > 607-777-4641 (Office)
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > email to ceph-users-leave@xxxxxxx
> >
> > ________________________________
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by
> copyright
> > or other legal rules. If you have received it by mistake please let us
> know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx