Steven, In my current hardware configurations each NVMe supports multiple OSDs. In my earlier nodes it is 8 OSDs sharing one NVMe (which is also too small). In the near term I will add NVMe to those nodes, but I'll still have 5 OSDs some OSDs, and 2 or 3 on all the others. So an NVMe failure will take out at least 2 OSDs. Becasue of this it seems potentially worthwhile to go through the trouble of defining failure domain = nvme to assure maximum resilience. -Dvae -- Dave Hall Binghamton University kdhall@xxxxxxxxxxxxxx 607-760-2328 (Cell) 607-777-4641 (Office) On Thu, Mar 11, 2021 at 2:20 PM Steven Pine <steven.pine@xxxxxxxxxx> wrote: > Setting domain failure on a per node basis will prevent data loss in the > case of an nvme failure, you would need multiple nvme failures across > different hosts. If data loss is the primary concern then again, you will > want a higher EC ratio, 6:3 or 6:4 but with only 6 osds, then 4:2 or even > 3:3, or skip EC altogether and use 3x replication, that is likely the > safest and best tested use case. You can also take backups of your ceph > cluster and send it elsewhere, a tool like backy2 can do this with somewhat > minimal setup. > > but if you have some magical insistence on using the setup you had already > determined prior to asking the mailing list then go ahead, and good luck. > > On Thu, Mar 11, 2021 at 1:56 PM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: > >> Hello, >> >> While I appreciate and acknowledge the concerns regarding host failure and >> maintenance shutdowns, our main concern at this time is data loss. Our >> use >> case at this time allows for suspension of client I/0 and/or for full >> cluster shutdown for maintenance, but loss of data would be catastrophic. >> It seems that with my current configuration an NVMe failure could cause >> data loss unless the shards are organized to survive this. >> >> So my question is not whether this is prudent, but actually whether this >> is >> possible, and if anybody could point to hints on how to implement it. >> >> Thanks. >> >> -Dave >> >> -- >> Dave Hall >> Binghamton University >> kdhall@xxxxxxxxxxxxxx >> 607-760-2328 (Cell) >> 607-777-4641 (Office) >> >> >> On Thu, Mar 11, 2021 at 1:28 PM Christian Wuerdig < >> christian.wuerdig@xxxxxxxxx> wrote: >> >> > For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2 >> > shards similar to this: >> > https://ceph.io/planet/erasure-code-on-small-clusters/ >> > If a host dies/goes down you can still recover all data (although at >> that >> > stage your cluster is no longer available for client io). >> > You shouldn't just consider failure but also maintenance scenarios which >> > will require a node to offline for some time. In particular a ceph >> upgrades >> > can take some time - especially if something goes wrong. You have no >> > breathing room left at that stage and your cluster will be dead until >> all >> > nodes are up again >> > >> > >> > On Fri, 12 Mar 2021 at 02:03, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: >> > >> >> Istvan, >> >> >> >> I agree that there is always risk with failure-domain < node, >> especially >> >> with EC pools. We are accepting this risk to lower the financial >> barrier >> >> to entry. >> >> >> >> In our minds, we have good power protection and new hardware, so the >> >> greatest immediate risks for our smaller cluster (approaching 6 OSD >> nodes >> >> and 48 HDDs) are NVMe write exhaustion and HDD failures. Since we >> have >> >> multiple OSDs sharing a single NVMe device it occurs to me that we >> might >> >> want to get Ceph to 'map' against that. In a way, NVMe devices are our >> >> 'nodes' at the current size of our cluster. >> >> >> >> -Dave >> >> >> >> -- >> >> Dave Hall >> >> Binghamton University >> >> >> >> On Wed, Mar 10, 2021 at 10:41 PM Szabo, Istvan (Agoda) < >> >> Istvan.Szabo@xxxxxxxxx> wrote: >> >> >> >> > Don't forget if you have server failure you might loose many >> objects. If >> >> > the failure domain is osd, it means let's say you have 12 drives in >> each >> >> > server, 8+2 EC in an unlucky situation can be located in 1 server >> also. >> >> > >> >> > Istvan Szabo >> >> > Senior Infrastructure Engineer >> >> > --------------------------------------------------- >> >> > Agoda Services Co., Ltd. >> >> > e: istvan.szabo@xxxxxxxxx >> >> > --------------------------------------------------- >> >> > >> >> > -----Original Message----- >> >> > From: Dave Hall <kdhall@xxxxxxxxxxxxxx> >> >> > Sent: Wednesday, March 10, 2021 11:42 PM >> >> > To: ceph-users <ceph-users@xxxxxxx> >> >> > Subject: Failure Domain = NVMe? >> >> > >> >> > Email received from outside the company. If in doubt don't click >> links >> >> nor >> >> > open attachments! >> >> > ________________________________ >> >> > >> >> > Hello, >> >> > >> >> > In some documentation I was reading last night about laying out >> OSDs, it >> >> > was suggested that if more that one OSD uses the same NVMe drive, the >> >> > failure-domain should probably be set to node. However, for a small >> >> cluster >> >> > the inclination is to use EC-pools and failure-domain = OSD. >> >> > >> >> > I was wondering if there is a middle ground - could we define >> >> > failure-domain = NVMe? I think the map would need to be defined >> >> manually >> >> > in the same way that failure-domain = rack requires information about >> >> which >> >> > nodes are in each rack. >> >> > >> >> > Example: My latest OSD nodes have 8 HDDs and 3 U.2 NVMe. I'd set up >> >> the >> >> > WAL/DB for with HDDs per OSD (wasted space on the 3rd NVMe). >> >> > Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe >> >> > devices per node - 15 total NVMe devices. My preferred EC-pool >> profile >> >> > is 8+2. It seems that this profile could be safely dispersed across >> 15 >> >> > failure domains, resulting in protection against NVMe failure. >> >> > >> >> > Please let me know if this is worth pursuing. >> >> > >> >> > Thanks. >> >> > >> >> > -Dave >> >> > >> >> > -- >> >> > Dave Hall >> >> > Binghamton University >> >> > kdhall@xxxxxxxxxxxxxx >> >> > 607-760-2328 (Cell) >> >> > 607-777-4641 (Office) >> >> > _______________________________________________ >> >> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >> >> > email to ceph-users-leave@xxxxxxx >> >> > >> >> > ________________________________ >> >> > This message is confidential and is for the sole use of the intended >> >> > recipient(s). It may also be privileged or otherwise protected by >> >> copyright >> >> > or other legal rules. If you have received it by mistake please let >> us >> >> know >> >> > by reply email and delete it from your system. It is prohibited to >> copy >> >> > this message or disclose its content to anyone. Any confidentiality >> or >> >> > privilege is not waived or lost by any mistaken delivery or >> unauthorized >> >> > disclosure of the message. All messages sent to and from Agoda may be >> >> > monitored to ensure compliance with company policies, to protect the >> >> > company's interests and to remove potential malware. Electronic >> messages >> >> > may be intercepted, amended, lost or deleted, or contain viruses. >> >> > >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users@xxxxxxx >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> > >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > > > -- > Steven Pine > > *E * steven.pine@xxxxxxxxxx | *P * 516.938.4100 x > *Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530 > webair.com > [image: Facebook icon] <https://www.facebook.com/WebairInc/> [image: > Twitter icon] <https://twitter.com/WebairInc> [image: Linkedin icon] > <https://www.linkedin.com/company/webair> > NOTICE: This electronic mail message and all attachments transmitted with > it are intended solely for the use of the addressee and may contain legally > privileged proprietary and confidential information. If the reader of this > message is not the intended recipient, or if you are an employee or agent > responsible for delivering this message to the intended recipient, you are > hereby notified that any dissemination, distribution, copying, or other use > of this message or its attachments is strictly prohibited. If you have > received this message in error, please notify the sender immediately by > replying to this message and delete it from your computer. > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx