Re: [External Email] Re: Re: Failure Domain = NVMe?

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Thu, 11 Mar 2021 15:27:01 -0500

Steven,

In my current hardware configurations each NVMe supports multiple OSDs.  In
my earlier nodes it is 8 OSDs sharing one NVMe (which is also too small).
In the near term I will add NVMe to those nodes, but I'll still have 5 OSDs
some OSDs, and 2 or 3 on all the others.  So an NVMe failure will take out
at least 2 OSDs.  Becasue of this it seems potentially worthwhile to go
through the trouble of defining failure domain = nvme to assure maximum
resilience.

-Dvae

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
607-760-2328 (Cell)
607-777-4641 (Office)

On Thu, Mar 11, 2021 at 2:20 PM Steven Pine <steven.pine@xxxxxxxxxx> wrote:

> Setting domain failure on a per node basis will prevent data loss in the
> case of an nvme failure, you would need multiple nvme failures across
> different hosts. If data loss is the primary concern then again, you will
> want a higher EC ratio, 6:3 or 6:4 but with only 6 osds, then 4:2 or even
> 3:3, or skip EC altogether and use 3x replication, that is likely the
> safest and best tested use case. You can also take backups of your ceph
> cluster and send it elsewhere, a tool like backy2 can do this with somewhat
> minimal setup.
>
> but if you have some magical insistence on using the setup you had already
> determined prior to asking the mailing list then go ahead, and good luck.
>
> On Thu, Mar 11, 2021 at 1:56 PM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>
>> Hello,
>>
>> While I appreciate and acknowledge the concerns regarding host failure and
>> maintenance shutdowns, our main concern at this time is data loss.  Our
>> use
>> case at this time allows for suspension of client I/0 and/or for full
>> cluster shutdown for maintenance, but loss of data would be catastrophic.
>> It seems that with my current configuration an NVMe failure could cause
>> data loss unless the shards are organized to survive this.
>>
>> So my question is not whether this is prudent, but actually whether this
>> is
>> possible, and if anybody could point to hints on how to implement it.
>>
>> Thanks.
>>
>> -Dave
>>
>> --
>> Dave Hall
>> Binghamton University
>> kdhall@xxxxxxxxxxxxxx
>> 607-760-2328 (Cell)
>> 607-777-4641 (Office)
>>
>>
>> On Thu, Mar 11, 2021 at 1:28 PM Christian Wuerdig <
>> christian.wuerdig@xxxxxxxxx> wrote:
>>
>> > For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2
>> > shards similar to this:
>> > https://ceph.io/planet/erasure-code-on-small-clusters/
>> > If a host dies/goes down you can still recover all data (although at
>> that
>> > stage your cluster is no longer available for client io).
>> > You shouldn't just consider failure but also maintenance scenarios which
>> > will require a node to offline for some time. In particular a ceph
>> upgrades
>> > can take some time - especially if something goes wrong. You have no
>> > breathing room left at that stage and your cluster will be dead until
>> all
>> > nodes are up again
>> >
>> >
>> > On Fri, 12 Mar 2021 at 02:03, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>> >
>> >> Istvan,
>> >>
>> >> I agree that there is always risk with failure-domain < node,
>> especially
>> >> with EC pools.  We are accepting this risk to lower the financial
>> barrier
>> >> to entry.
>> >>
>> >> In our minds, we have good power protection and new hardware, so the
>> >> greatest immediate risks for our smaller cluster (approaching 6 OSD
>> nodes
>> >> and 48 HDDs) are NVMe write exhaustion and HDD failures.   Since we
>> have
>> >> multiple OSDs sharing a single NVMe device it occurs to me that we
>> might
>> >> want to get Ceph to 'map' against that.  In a way, NVMe devices are our
>> >> 'nodes' at the current size of our cluster.
>> >>
>> >> -Dave
>> >>
>> >> --
>> >> Dave Hall
>> >> Binghamton University
>> >>
>> >> On Wed, Mar 10, 2021 at 10:41 PM Szabo, Istvan (Agoda) <
>> >> Istvan.Szabo@xxxxxxxxx> wrote:
>> >>
>> >> > Don't forget if you have server failure you might loose many
>> objects. If
>> >> > the failure domain is osd, it means let's say you have 12 drives in
>> each
>> >> > server, 8+2 EC in an unlucky situation can be located in 1 server
>> also.
>> >> >
>> >> > Istvan Szabo
>> >> > Senior Infrastructure Engineer
>> >> > ---------------------------------------------------
>> >> > Agoda Services Co., Ltd.
>> >> > e: istvan.szabo@xxxxxxxxx
>> >> > ---------------------------------------------------
>> >> >
>> >> > -----Original Message-----
>> >> > From: Dave Hall <kdhall@xxxxxxxxxxxxxx>
>> >> > Sent: Wednesday, March 10, 2021 11:42 PM
>> >> > To: ceph-users <ceph-users@xxxxxxx>
>> >> > Subject:  Failure Domain = NVMe?
>> >> >
>> >> > Email received from outside the company. If in doubt don't click
>> links
>> >> nor
>> >> > open attachments!
>> >> > ________________________________
>> >> >
>> >> > Hello,
>> >> >
>> >> > In some documentation I was reading last night about laying out
>> OSDs, it
>> >> > was suggested that if more that one OSD uses the same NVMe drive, the
>> >> > failure-domain should probably be set to node. However, for a small
>> >> cluster
>> >> > the inclination is to use EC-pools and failure-domain = OSD.
>> >> >
>> >> > I was wondering if there is a middle ground - could we define
>> >> > failure-domain = NVMe?  I think the map would need to be defined
>> >> manually
>> >> > in the same way that failure-domain = rack requires information about
>> >> which
>> >> > nodes are in each rack.
>> >> >
>> >> > Example:  My latest OSD nodes have 8 HDDs and 3 U.2 NVMe.  I'd set up
>> >> the
>> >> > WAL/DB for with HDDs per OSD  (wasted space on the 3rd NVMe).
>> >> > Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe
>> >> > devices per node - 15 total NVMe devices.   My preferred EC-pool
>> profile
>> >> > is 8+2.  It seems that this profile could be safely dispersed across
>> 15
>> >> > failure domains, resulting in protection against NVMe failure.
>> >> >
>> >> > Please let me know if this is worth pursuing.
>> >> >
>> >> > Thanks.
>> >> >
>> >> > -Dave
>> >> >
>> >> > --
>> >> > Dave Hall
>> >> > Binghamton University
>> >> > kdhall@xxxxxxxxxxxxxx
>> >> > 607-760-2328 (Cell)
>> >> > 607-777-4641 (Office)
>> >> > _______________________________________________
>> >> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> >> > email to ceph-users-leave@xxxxxxx
>> >> >
>> >> > ________________________________
>> >> > This message is confidential and is for the sole use of the intended
>> >> > recipient(s). It may also be privileged or otherwise protected by
>> >> copyright
>> >> > or other legal rules. If you have received it by mistake please let
>> us
>> >> know
>> >> > by reply email and delete it from your system. It is prohibited to
>> copy
>> >> > this message or disclose its content to anyone. Any confidentiality
>> or
>> >> > privilege is not waived or lost by any mistaken delivery or
>> unauthorized
>> >> > disclosure of the message. All messages sent to and from Agoda may be
>> >> > monitored to ensure compliance with company policies, to protect the
>> >> > company's interests and to remove potential malware. Electronic
>> messages
>> >> > may be intercepted, amended, lost or deleted, or contain viruses.
>> >> >
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
>
> --
> Steven Pine
>
> *E * steven.pine@xxxxxxxxxx  |  *P * 516.938.4100 x
> *Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530
> webair.com
> [image: Facebook icon] <https://www.facebook.com/WebairInc/>  [image:
> Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon]
> <https://www.linkedin.com/company/webair>
> NOTICE: This electronic mail message and all attachments transmitted with
> it are intended solely for the use of the addressee and may contain legally
> privileged proprietary and confidential information. If the reader of this
> message is not the intended recipient, or if you are an employee or agent
> responsible for delivering this message to the intended recipient, you are
> hereby notified that any dissemination, distribution, copying, or other use
> of this message or its attachments is strictly prohibited. If you have
> received this message in error, please notify the sender immediately by
> replying to this message and delete it from your computer.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx