Re: Failure Domain = NVMe?

Steven Pine <steven.pine@xxxxxxxxxx> · Thu, 11 Mar 2021 12:36:40 -0500

One potential issue is maintenance after a nvme failure. Depending on how
the hardware is configured, you will need to bring the whole node down to
replace the failed nvme, which could cause PG to become read only if you
are close to your min threshold. I think the additional risk is not worth
it, but if you move ahead anyway you should also either not use EC, or have
a much higher EC ratio, 8-3 or 8-4, or since you have only 48 HDDs, a lower
count, 6-3, might be better.

On Thu, Mar 11, 2021 at 8:03 AM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:

> Istvan,
>
> I agree that there is always risk with failure-domain < node, especially
> with EC pools.  We are accepting this risk to lower the financial barrier
> to entry.
>
> In our minds, we have good power protection and new hardware, so the
> greatest immediate risks for our smaller cluster (approaching 6 OSD nodes
> and 48 HDDs) are NVMe write exhaustion and HDD failures.   Since we have
> multiple OSDs sharing a single NVMe device it occurs to me that we might
> want to get Ceph to 'map' against that.  In a way, NVMe devices are our
> 'nodes' at the current size of our cluster.
>
> -Dave
>
> --
> Dave Hall
> Binghamton University
>
> On Wed, Mar 10, 2021 at 10:41 PM Szabo, Istvan (Agoda) <
> Istvan.Szabo@xxxxxxxxx> wrote:
>
> > Don't forget if you have server failure you might loose many objects. If
> > the failure domain is osd, it means let's say you have 12 drives in each
> > server, 8+2 EC in an unlucky situation can be located in 1 server also.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---------------------------------------------------
> > Agoda Services Co., Ltd.
> > e: istvan.szabo@xxxxxxxxx
> > ---------------------------------------------------
> >
> > -----Original Message-----
> > From: Dave Hall <kdhall@xxxxxxxxxxxxxx>
> > Sent: Wednesday, March 10, 2021 11:42 PM
> > To: ceph-users <ceph-users@xxxxxxx>
> > Subject:  Failure Domain = NVMe?
> >
> > Email received from outside the company. If in doubt don't click links
> nor
> > open attachments!
> > ________________________________
> >
> > Hello,
> >
> > In some documentation I was reading last night about laying out OSDs, it
> > was suggested that if more that one OSD uses the same NVMe drive, the
> > failure-domain should probably be set to node. However, for a small
> cluster
> > the inclination is to use EC-pools and failure-domain = OSD.
> >
> > I was wondering if there is a middle ground - could we define
> > failure-domain = NVMe?  I think the map would need to be defined manually
> > in the same way that failure-domain = rack requires information about
> which
> > nodes are in each rack.
> >
> > Example:  My latest OSD nodes have 8 HDDs and 3 U.2 NVMe.  I'd set up the
> > WAL/DB for with HDDs per OSD  (wasted space on the 3rd NVMe).
> > Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe
> > devices per node - 15 total NVMe devices.   My preferred EC-pool profile
> > is 8+2.  It seems that this profile could be safely dispersed across 15
> > failure domains, resulting in protection against NVMe failure.
> >
> > Please let me know if this is worth pursuing.
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdhall@xxxxxxxxxxxxxx
> > 607-760-2328 (Cell)
> > 607-777-4641 (Office)
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > email to ceph-users-leave@xxxxxxx
> >
> > ________________________________
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by
> copyright
> > or other legal rules. If you have received it by mistake please let us
> know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Steven Pine

*E * steven.pine@xxxxxxxxxx  |  *P * 516.938.4100 x
*Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530
webair.com
[image: Facebook icon] <https://www.facebook.com/WebairInc/>  [image:
Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon]
<https://www.linkedin.com/company/webair>
NOTICE: This electronic mail message and all attachments transmitted with
it are intended solely for the use of the addressee and may contain legally
privileged proprietary and confidential information. If the reader of this
message is not the intended recipient, or if you are an employee or agent
responsible for delivering this message to the intended recipient, you are
hereby notified that any dissemination, distribution, copying, or other use
of this message or its attachments is strictly prohibited. If you have
received this message in error, please notify the sender immediately by
replying to this message and delete it from your computer.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx