Re: Failure Domain = NVMe?

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Thu, 11 Mar 2021 13:55:27 -0500

Hello,

While I appreciate and acknowledge the concerns regarding host failure and
maintenance shutdowns, our main concern at this time is data loss.  Our use
case at this time allows for suspension of client I/0 and/or for full
cluster shutdown for maintenance, but loss of data would be catastrophic.
It seems that with my current configuration an NVMe failure could cause
data loss unless the shards are organized to survive this.

So my question is not whether this is prudent, but actually whether this is
possible, and if anybody could point to hints on how to implement it.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
607-760-2328 (Cell)
607-777-4641 (Office)

On Thu, Mar 11, 2021 at 1:28 PM Christian Wuerdig <
christian.wuerdig@xxxxxxxxx> wrote:

> For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2
> shards similar to this:
> https://ceph.io/planet/erasure-code-on-small-clusters/
> If a host dies/goes down you can still recover all data (although at that
> stage your cluster is no longer available for client io).
> You shouldn't just consider failure but also maintenance scenarios which
> will require a node to offline for some time. In particular a ceph upgrades
> can take some time - especially if something goes wrong. You have no
> breathing room left at that stage and your cluster will be dead until all
> nodes are up again
>
>
> On Fri, 12 Mar 2021 at 02:03, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
>
>> Istvan,
>>
>> I agree that there is always risk with failure-domain < node, especially
>> with EC pools.  We are accepting this risk to lower the financial barrier
>> to entry.
>>
>> In our minds, we have good power protection and new hardware, so the
>> greatest immediate risks for our smaller cluster (approaching 6 OSD nodes
>> and 48 HDDs) are NVMe write exhaustion and HDD failures.   Since we have
>> multiple OSDs sharing a single NVMe device it occurs to me that we might
>> want to get Ceph to 'map' against that.  In a way, NVMe devices are our
>> 'nodes' at the current size of our cluster.
>>
>> -Dave
>>
>> --
>> Dave Hall
>> Binghamton University
>>
>> On Wed, Mar 10, 2021 at 10:41 PM Szabo, Istvan (Agoda) <
>> Istvan.Szabo@xxxxxxxxx> wrote:
>>
>> > Don't forget if you have server failure you might loose many objects. If
>> > the failure domain is osd, it means let's say you have 12 drives in each
>> > server, 8+2 EC in an unlucky situation can be located in 1 server also.
>> >
>> > Istvan Szabo
>> > Senior Infrastructure Engineer
>> > ---------------------------------------------------
>> > Agoda Services Co., Ltd.
>> > e: istvan.szabo@xxxxxxxxx
>> > ---------------------------------------------------
>> >
>> > -----Original Message-----
>> > From: Dave Hall <kdhall@xxxxxxxxxxxxxx>
>> > Sent: Wednesday, March 10, 2021 11:42 PM
>> > To: ceph-users <ceph-users@xxxxxxx>
>> > Subject:  Failure Domain = NVMe?
>> >
>> > Email received from outside the company. If in doubt don't click links
>> nor
>> > open attachments!
>> > ________________________________
>> >
>> > Hello,
>> >
>> > In some documentation I was reading last night about laying out OSDs, it
>> > was suggested that if more that one OSD uses the same NVMe drive, the
>> > failure-domain should probably be set to node. However, for a small
>> cluster
>> > the inclination is to use EC-pools and failure-domain = OSD.
>> >
>> > I was wondering if there is a middle ground - could we define
>> > failure-domain = NVMe?  I think the map would need to be defined
>> manually
>> > in the same way that failure-domain = rack requires information about
>> which
>> > nodes are in each rack.
>> >
>> > Example:  My latest OSD nodes have 8 HDDs and 3 U.2 NVMe.  I'd set up
>> the
>> > WAL/DB for with HDDs per OSD  (wasted space on the 3rd NVMe).
>> > Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe
>> > devices per node - 15 total NVMe devices.   My preferred EC-pool profile
>> > is 8+2.  It seems that this profile could be safely dispersed across 15
>> > failure domains, resulting in protection against NVMe failure.
>> >
>> > Please let me know if this is worth pursuing.
>> >
>> > Thanks.
>> >
>> > -Dave
>> >
>> > --
>> > Dave Hall
>> > Binghamton University
>> > kdhall@xxxxxxxxxxxxxx
>> > 607-760-2328 (Cell)
>> > 607-777-4641 (Office)
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> > email to ceph-users-leave@xxxxxxx
>> >
>> > ________________________________
>> > This message is confidential and is for the sole use of the intended
>> > recipient(s). It may also be privileged or otherwise protected by
>> copyright
>> > or other legal rules. If you have received it by mistake please let us
>> know
>> > by reply email and delete it from your system. It is prohibited to copy
>> > this message or disclose its content to anyone. Any confidentiality or
>> > privilege is not waived or lost by any mistaken delivery or unauthorized
>> > disclosure of the message. All messages sent to and from Agoda may be
>> > monitored to ensure compliance with company policies, to protect the
>> > company's interests and to remove potential malware. Electronic messages
>> > may be intercepted, amended, lost or deleted, or contain viruses.
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx