Re: [External Email] Re: Re: Failure Domain = NVMe?

Steven Pine <steven.pine@xxxxxxxxxx> · Thu, 11 Mar 2021 16:48:21 -0500

Setting the failure domain to host will accomplish nearly the same goal and
provide better results during maintenance, host reboots, and of course host
failures.

But otherwise you can try manually creating crush maps and map a domain
failure to nvme and the osds under it, but the additional work and room for
error and bugs this can cause is not recommended.

On Thu, Mar 11, 2021 at 3:27 PM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:

> Steven,
>
> In my current hardware configurations each NVMe supports multiple OSDs.
> In my earlier nodes it is 8 OSDs sharing one NVMe (which is also too
> small).  In the near term I will add NVMe to those nodes, but I'll still
> have 5 OSDs some OSDs, and 2 or 3 on all the others.  So an NVMe failure
> will take out at least 2 OSDs.  Becasue of this it seems potentially
> worthwhile to go through the trouble of defining failure domain = nvme to
> assure maximum resilience.
>
> -Dvae
>
> --
> Dave Hall
> Binghamton University
> kdhall@xxxxxxxxxxxxxx
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
>
> On Thu, Mar 11, 2021 at 2:20 PM Steven Pine <steven.pine@xxxxxxxxxx>
> wrote:
>
>> Setting domain failure on a per node basis will prevent data loss in the
>> case of an nvme failure, you would need multiple nvme failures across
>> different hosts. If data loss is the primary concern then again, you will
>> want a higher EC ratio, 6:3 or 6:4 but with only 6 osds, then 4:2 or even
>> 3:3, or skip EC altogether and use 3x replication, that is likely the
>> safest and best tested use case. You can also take backups of your ceph
>> cluster and send it elsewhere, a tool like backy2 can do this with somewhat
>> minimal setup.
>>
>> but if you have some magical insistence on using the setup you had
>> already determined prior to asking the mailing list then go ahead, and good
>> luck.
>>
>> On Thu, Mar 11, 2021 at 1:56 PM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>>
>>> Hello,
>>>
>>> While I appreciate and acknowledge the concerns regarding host failure
>>> and
>>> maintenance shutdowns, our main concern at this time is data loss.  Our
>>> use
>>> case at this time allows for suspension of client I/0 and/or for full
>>> cluster shutdown for maintenance, but loss of data would be catastrophic.
>>> It seems that with my current configuration an NVMe failure could cause
>>> data loss unless the shards are organized to survive this.
>>>
>>> So my question is not whether this is prudent, but actually whether this
>>> is
>>> possible, and if anybody could point to hints on how to implement it.
>>>
>>> Thanks.
>>>
>>> -Dave
>>>
>>> --
>>> Dave Hall
>>> Binghamton University
>>> kdhall@xxxxxxxxxxxxxx
>>> 607-760-2328 (Cell)
>>> 607-777-4641 (Office)
>>>
>>>
>>> On Thu, Mar 11, 2021 at 1:28 PM Christian Wuerdig <
>>> christian.wuerdig@xxxxxxxxx> wrote:
>>>
>>> > For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2
>>> > shards similar to this:
>>> > https://ceph.io/planet/erasure-code-on-small-clusters/
>>> > If a host dies/goes down you can still recover all data (although at
>>> that
>>> > stage your cluster is no longer available for client io).
>>> > You shouldn't just consider failure but also maintenance scenarios
>>> which
>>> > will require a node to offline for some time. In particular a ceph
>>> upgrades
>>> > can take some time - especially if something goes wrong. You have no
>>> > breathing room left at that stage and your cluster will be dead until
>>> all
>>> > nodes are up again
>>> >
>>> >
>>> > On Fri, 12 Mar 2021 at 02:03, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>>> >
>>> >> Istvan,
>>> >>
>>> >> I agree that there is always risk with failure-domain < node,
>>> especially
>>> >> with EC pools.  We are accepting this risk to lower the financial
>>> barrier
>>> >> to entry.
>>> >>
>>> >> In our minds, we have good power protection and new hardware, so the
>>> >> greatest immediate risks for our smaller cluster (approaching 6 OSD
>>> nodes
>>> >> and 48 HDDs) are NVMe write exhaustion and HDD failures.   Since we
>>> have
>>> >> multiple OSDs sharing a single NVMe device it occurs to me that we
>>> might
>>> >> want to get Ceph to 'map' against that.  In a way, NVMe devices are
>>> our
>>> >> 'nodes' at the current size of our cluster.
>>> >>
>>> >> -Dave
>>> >>
>>> >> --
>>> >> Dave Hall
>>> >> Binghamton University
>>> >>
>>> >> On Wed, Mar 10, 2021 at 10:41 PM Szabo, Istvan (Agoda) <
>>> >> Istvan.Szabo@xxxxxxxxx> wrote:
>>> >>
>>> >> > Don't forget if you have server failure you might loose many
>>> objects. If
>>> >> > the failure domain is osd, it means let's say you have 12 drives in
>>> each
>>> >> > server, 8+2 EC in an unlucky situation can be located in 1 server
>>> also.
>>> >> >
>>> >> > Istvan Szabo
>>> >> > Senior Infrastructure Engineer
>>> >> > ---------------------------------------------------
>>> >> > Agoda Services Co., Ltd.
>>> >> > e: istvan.szabo@xxxxxxxxx
>>> >> > ---------------------------------------------------
>>> >> >
>>> >> > -----Original Message-----
>>> >> > From: Dave Hall <kdhall@xxxxxxxxxxxxxx>
>>> >> > Sent: Wednesday, March 10, 2021 11:42 PM
>>> >> > To: ceph-users <ceph-users@xxxxxxx>
>>> >> > Subject:  Failure Domain = NVMe?
>>> >> >
>>> >> > Email received from outside the company. If in doubt don't click
>>> links
>>> >> nor
>>> >> > open attachments!
>>> >> > ________________________________
>>> >> >
>>> >> > Hello,
>>> >> >
>>> >> > In some documentation I was reading last night about laying out
>>> OSDs, it
>>> >> > was suggested that if more that one OSD uses the same NVMe drive,
>>> the
>>> >> > failure-domain should probably be set to node. However, for a small
>>> >> cluster
>>> >> > the inclination is to use EC-pools and failure-domain = OSD.
>>> >> >
>>> >> > I was wondering if there is a middle ground - could we define
>>> >> > failure-domain = NVMe?  I think the map would need to be defined
>>> >> manually
>>> >> > in the same way that failure-domain = rack requires information
>>> about
>>> >> which
>>> >> > nodes are in each rack.
>>> >> >
>>> >> > Example:  My latest OSD nodes have 8 HDDs and 3 U.2 NVMe.  I'd set
>>> up
>>> >> the
>>> >> > WAL/DB for with HDDs per OSD  (wasted space on the 3rd NVMe).
>>> >> > Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe
>>> >> > devices per node - 15 total NVMe devices.   My preferred EC-pool
>>> profile
>>> >> > is 8+2.  It seems that this profile could be safely dispersed
>>> across 15
>>> >> > failure domains, resulting in protection against NVMe failure.
>>> >> >
>>> >> > Please let me know if this is worth pursuing.
>>> >> >
>>> >> > Thanks.
>>> >> >
>>> >> > -Dave
>>> >> >
>>> >> > --
>>> >> > Dave Hall
>>> >> > Binghamton University
>>> >> > kdhall@xxxxxxxxxxxxxx
>>> >> > 607-760-2328 (Cell)
>>> >> > 607-777-4641 (Office)
>>> >> > _______________________________________________
>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
>>> an
>>> >> > email to ceph-users-leave@xxxxxxx
>>> >> >
>>> >> > ________________________________
>>> >> > This message is confidential and is for the sole use of the intended
>>> >> > recipient(s). It may also be privileged or otherwise protected by
>>> >> copyright
>>> >> > or other legal rules. If you have received it by mistake please let
>>> us
>>> >> know
>>> >> > by reply email and delete it from your system. It is prohibited to
>>> copy
>>> >> > this message or disclose its content to anyone. Any confidentiality
>>> or
>>> >> > privilege is not waived or lost by any mistaken delivery or
>>> unauthorized
>>> >> > disclosure of the message. All messages sent to and from Agoda may
>>> be
>>> >> > monitored to ensure compliance with company policies, to protect the
>>> >> > company's interests and to remove potential malware. Electronic
>>> messages
>>> >> > may be intercepted, amended, lost or deleted, or contain viruses.
>>> >> >
>>> >> _______________________________________________
>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >>
>>> >
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
>>
>> --
>> Steven Pine
>>
>> *E * steven.pine@xxxxxxxxxx  |  *P * 516.938.4100 x
>> *Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530
>> webair.com
>> [image: Facebook icon] <https://www.facebook.com/WebairInc/>  [image:
>> Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon]
>> <https://www.linkedin.com/company/webair>
>> NOTICE: This electronic mail message and all attachments transmitted with
>> it are intended solely for the use of the addressee and may contain legally
>> privileged proprietary and confidential information. If the reader of this
>> message is not the intended recipient, or if you are an employee or agent
>> responsible for delivering this message to the intended recipient, you are
>> hereby notified that any dissemination, distribution, copying, or other use
>> of this message or its attachments is strictly prohibited. If you have
>> received this message in error, please notify the sender immediately by
>> replying to this message and delete it from your computer.
>>
>

-- 
Steven Pine

*E * steven.pine@xxxxxxxxxx  |  *P * 516.938.4100 x
*Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530
webair.com
[image: Facebook icon] <https://www.facebook.com/WebairInc/>  [image:
Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon]
<https://www.linkedin.com/company/webair>
NOTICE: This electronic mail message and all attachments transmitted with
it are intended solely for the use of the addressee and may contain legally
privileged proprietary and confidential information. If the reader of this
message is not the intended recipient, or if you are an employee or agent
responsible for delivering this message to the intended recipient, you are
hereby notified that any dissemination, distribution, copying, or other use
of this message or its attachments is strictly prohibited. If you have
received this message in error, please notify the sender immediately by
replying to this message and delete it from your computer.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx