This thread caught my attention. I have a smaller cluster with a lot of OSDs sharing the same SSD on each OSD node. I mentioned in an earlier post that I found a statement in https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/ indicating that if the SSD/NVMe in a node is not very big, one could put the DB in the HDD and only have WAL and/or Journal on the NVMe. In the context of this thread we're talking about resilience in the face of an SSD/NVMe failure. It would be interesting to know which parts are critical for recovering from an OSD failure: DB, for sure, but is it possible to recover an OSD if looses it's WAL or journal (or both)? I'm sure that to some extent this depends on the replication set-up: With 3-way replication and 3 OSD nodes, if all of the SSDs on one node fail, you can replace the SSD and recover. However for an EC-pool and failure-domain = OSD, data loss may be possible due to the failure of a shared SSD/NVMe. Maybe it's important with a small cluster to suggest to place WAL/DB on the HDD and use SSD/NVMe only for journal? -Dave -- Dave Hall Binghamton University kdhall@xxxxxxxxxxxxxx 607-760-2328 (Cell) 607-777-4641 (Office) On Sun, Nov 8, 2020 at 3:19 PM Tony Liu <tonyliu0592@xxxxxxxxxxx> wrote: > Sorry for confusing, what I meant to say is that "having all WAL/DB > on one SSD will result a single point of failure". If that SSD goes > down, all OSDs depending on it will also stop working. > > What I'd like to confirm is that, there is no benefit to put WAL/DB > on SSD when there is either cache tire or such primary SSD with HDD > for replications. And distribute WAL/DB on each HDD will eliminate > that single point of failure. > > So in your case, with SSD as the primary OSD, do you put WAL/DB on > a SSD for secondary HDDs, or just distribute it to each HDD? > > > Thanks! > Tony > > -----Original Message----- > > From: 胡 玮文 <huww98@xxxxxxxxxxx> > > Sent: Sunday, November 8, 2020 5:47 AM > > To: Tony Liu <tonyliu0592@xxxxxxxxxxx> > > Cc: ceph-users@xxxxxxx > > Subject: Re: Re: The feasibility of mixed SSD and HDD > > replicated pool > > > > > > > 在 2020年11月8日,11:30,Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道: > > > > > > Is it FileStore or BlueStore? With this SSD-HDD solution, is journal > > > or WAL/DB on SSD or HDD? My understanding is that, there is no benefit > > > to put journal or WAL/DB on SSD with such solution. It will also > > > eliminate the single point of failure when having all WAL/DB on one > > > SSD. Just want to confirm. > > > > We are building a new cluster, so BlueStore. I think put WAL/DB on SSD > > is more about performance. How this is related to eliminating single > > point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And > > of course, just use single device for SSD OSDs > > > > > Another thought is to have separate pools, like all-SSD pool and > > > all-HDD pool. Each pool will be used for different purpose. For > > > example, image, backup, object can be in all-HDD pool and VM volume > > > can be in all-SSD pool. > > > > Yes, I think the same. > > > > > Thanks! > > > Tony > > >> -----Original Message----- > > >> From: 胡 玮文 <huww98@xxxxxxxxxxx> > > >> Sent: Monday, October 26, 2020 9:20 AM > > >> To: Frank Schilder <frans@xxxxxx> > > >> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx > > >> Subject: Re: The feasibility of mixed SSD and HDD > > >> replicated pool > > >> > > >> > > >>>> 在 2020年10月26日,15:43,Frank Schilder <frans@xxxxxx> 写道: > > >>> > > >>> > > >>>> I’ve never seen anything that implies that lead OSDs within an > > >>>> acting > > >> set are a function of CRUSH rule ordering. > > >>> > > >>> This is actually a good question. I believed that I had seen/heard > > >> that somewhere, but I might be wrong. > > >>> > > >>> Looking at the definition of a PG, is states that a PG is an ordered > > >> set of OSD (IDs) and the first up OSD will be the primary. In other > > >> words, it seems that the lowest OSD ID is decisive. If the SSDs were > > >> deployed before the HDDs, they have the smallest IDs and, hence, will > > >> be preferred as primary OSDs. > > >> > > >> I don’t think this is correct. From my experiments, using previously > > >> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the > > >> primary OSDs are always SSD. > > >> > > >> I also have a look at the code, if I understand it correctly: > > >> > > >> * If the default primary affinity is not changed, then the logic > > >> about primary affinity is skipped, and the primary would be the first > > >> one returned by CRUSH algorithm [1]. > > >> > > >> * The order of OSDs returned by CRUSH still matters if you changed > > >> the primary affinity. The affinity represents the probability of a > > >> test to be success. The first OSD will be tested first, and will have > > >> higher probability to become primary. [2] > > >> * If any OSD has primary affinity = 1.0, the test will always > > >> success, and any OSD after it will never be primary. > > >> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set > > >> to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd > > >> one has probability of 0.125. Otherwise, 1st will be primary. > > >> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD > > >> will be primary as fallback. > > >> > > >> [1]: > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > > >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012 > > >> &data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f > > >> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG > > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > > >> %3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D > > >> &reserved=0 > > >> 53/src/osd/OSDMap.cc#L2456 > > >> [2]: > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > > >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012 > > >> &data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f > > >> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG > > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > > >> %3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D > > >> &reserved=0 > > >> 53/src/osd/OSDMap.cc#L2561 > > >> > > >> So, set the primary affinity of all SSD OSDs to 1.0 should be > > >> sufficient for it to be the primary in my case. > > >> > > >> Do you think I should contribute these to documentation? > > >> > > >>> This, however, is not a sustainable situation. Any addition of OSDs > > >> will mess this up and the distribution scheme will fail in the > > >> future. A way out seem to be: > > >>> > > >>> - subdivide your HDD storage using device classes: > > >>> * define a device class for HDDs with primary affinity=0, for > > >>> example, pick 5 HDDs and change their device class to hdd_np (for no > > >>> primary) > > >>> * set the primary affinity of these HDD OSDs to 0 > > >>> * modify your crush rule to use "step take default class hdd_np" > > >>> * this will create a pool with primaries on SSD and balanced storage > > >>> distribution between SSD and HDD > > >>> * all-HDD pools deployed as usual on class hdd > > >>> * when increasing capacity, one needs to take care of adding disks > > >>> to hdd_np class and set their primary affinity to 0 > > >>> * somewhat increased admin effort, but fully working solution > > >>> > > >>> Best regards, > > >>> ================= > > >>> Frank Schilder > > >>> AIT Risø Campus > > >>> Bygning 109, rum S14 > > >>> > > >>> ________________________________________ > > >>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx> > > >>> Sent: 25 October 2020 17:07:15 > > >>> To: ceph-users@xxxxxxx > > >>> Subject: Re: The feasibility of mixed SSD and HDD > > >>> replicated pool > > >>> > > >>>> I'm not entirely sure if primary on SSD will actually make the read > > >> happen on SSD. > > >>> > > >>> My understanding is that by default reads always happen from the > > >>> lead > > >> OSD in the acting set. Octopus seems to (finally) have an option to > > >> spread the reads around, which IIRC defaults to false. > > >>> > > >>> I’ve never seen anything that implies that lead OSDs within an > > >>> acting > > >> set are a function of CRUSH rule ordering. I’m not asserting that > > >> they aren’t though, but I’m … skeptical. > > >>> > > >>> Setting primary affinity would do the job, and you’d want to have > > >>> cron > > >> continually update it across the cluster to react to topology changes. > > >> I was told of this strategy back in 2014, but haven’t personally seen > > >> it implemented. > > >>> > > >>> That said, HDDs are more of a bottleneck for writes than reads and > > >> just might be fine for your application. Tiny reads are going to > > >> limit you to some degree regardless of drive type, and you do mention > > >> throughput, not IOPS. > > >>> > > >>> I must echo Frank’s notes about capacity too. Ceph can do a lot of > > >> things, but that doesn’t mean something exotic is necessarily the > > >> best choice. You’re concerned about 3R only yielding 1/3 of raw > > >> capacity if using an all-SSD cluster, but the architecture you > > >> propose limits you anyway because drive size. Consider also chassis, > > >> CPU, RAM, RU, switch port costs as well, and the cost of you fussing > > >> over an exotic solution instead of the hundreds of other things in > > your backlog. > > >>> > > >>> And your cluster as described is *tiny*. Honestly I’d suggest > > >> considering one of these alternatives: > > >>> > > >>> * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are > > >>> really > > >> promising for replacing HDDs for density in this kind of application. > > >> You might even consider ARM if IOPs aren’t a concern. > > >>> * An NVMeoF solution > > >>> > > >>> > > >>> Cache tiers are “deprecated”, but then so are custom cluster names. > > >>> Neither appears > > >>> > > >>>> For EC pools there is an option "fast_read" > > >> > > (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs. > > >> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3D > > >> fas > > >> t_read%23set-pool- > > >> values&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9 > > >> e7f > > >> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWF > > >> pbG > > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > > >> %3D > > >> %7C1000&sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D& > > >> amp ;reserved=0), which states that a read will return as soon as the > > >> first k shards have arrived. The default is to wait for all k+m > > >> shards (all replicas). This option is not available for replicated > > >> pools. > > >>>> Now, not sure if this option is not available for replicated pools > > >> because the read will always be served by the acting primary, or if > > >> it currently waits for all replicas. In the latter case, reads will > > >> wait for the slowest device. > > >>>> I'm not sure if I interpret this correctly. I think you should test > > >> the setup with HDD only and SSD+HDD to see if read speed improves. > > >> Note that write speed will always depend on the slowest device. > > >>>> Best regards, > > >>>> ================= > > >>>> Frank Schilder > > >>>> AIT Risø Campus > > >>>> Bygning 109, rum S14 > > >>>> ________________________________________ > > >>>> From: Frank Schilder <frans@xxxxxx> > > >>>> Sent: 25 October 2020 15:03:16 > > >>>> To: 胡 玮文; Alexander E. Patrakov > > >>>> Cc: ceph-users@xxxxxxx > > >>>> Subject: Re: The feasibility of mixed SSD and HDD > > >>>> replicated pool A cache pool might be an alternative, heavily > > >> depending on how much data is hot. However, then you will have much > > >> less SSD capacity available, because it also requires replication. > > >>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = > > >> 120T HDD you will probably run short of SSD capacity. Or, looking at > > >> it the other way around, with copies on 1 SSD+3HDD, you will only be > > >> able to use about 30T out of 120T HDD capacity. > > >>>> With this replication, the usable storage will be 10T and raw used > > >> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD > > >> space, you will need more SSDs. If your servers have more free disk > > >> slots, you can add SSDs over time until you have at least 40T SSD > > >> capacity to balance SSD and HDD capacity. > > >>>> Personally, I think the 1SSD + 3HDD is a good option compared with > > >>>> a > > >> cache pool. You have the data security of 3-times replication and, if > > >> everything is up, need only 1 copy in the SSD cache, which means that > > >> you have 3 times the cache capacity. > > >>>> Best regards, > > >>>> ================= > > >>>> Frank Schilder > > >>>> AIT Risø Campus > > >>>> Bygning 109, rum S14 > > >>>> ________________________________________ > > >>>> From: 胡 玮文 <huww98@xxxxxxxxxxx> > > >>>> Sent: 25 October 2020 13:40:55 > > >>>> To: Alexander E. Patrakov > > >>>> Cc: ceph-users@xxxxxxx > > >>>> Subject: Re: The feasibility of mixed SSD and HDD > > >>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in > > >>>> my > > >> mind. In order to guard against 2 host failures, I’m going to use 4 > > >> replications, 1 on SSD and 3 on HDD. This will work as intended, > > right? > > >> Because at least I can ensure 3 HDDs are from different hosts. > > >>>>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov@xxxxxxxxx> > > >> 写道: > > >>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx > > >> <huww98@xxxxxxxxxxx> wrote: > > >>>>>> Hi all, > > >>>>>> We are planning for a new pool to store our dataset using CephFS. > > >> These data are almost read-only (but not guaranteed) and consist of a > > >> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * > > >> 6T HDD, and we will deploy about 10 such nodes. We aim at getting the > > >> highest read throughput. > > >>>>>> If we just use a replicated pool of size 3 on SSD, we should get > > >> the best performance, however, that only leave us 1/3 of usable SSD > > >> space. And EC pools are not friendly to such small object read > > >> workload, I think. > > >>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy. > > >> Ideally, I want 3 data replications, each on a different host (fail > > >> domain). 1 of them on SSD, the other 2 on HDD. And normally every > > >> read request is directed to SSD. So, if every SSD OSD is up, I’d > > >> expect the same read throughout as the all SSD deployment. > > >>>>>> I’ve read the documents and did some tests. Here is the crush > > >>>>>> rule > > >> I’m testing with: > > >>>>>> rule mixed_replicated_rule { > > >>>>>> id 3 > > >>>>>> type replicated > > >>>>>> min_size 1 > > >>>>>> max_size 10 > > >>>>>> step take default class ssd > > >>>>>> step chooseleaf firstn 1 type host > > >>>>>> step emit > > >>>>>> step take default class hdd > > >>>>>> step chooseleaf firstn -1 type host > > >>>>>> step emit > > >>>>>> } > > >>>>>> Now I have the following conclusions, but I’m not very sure: > > >>>>>> * The first OSD produced by crush will be the primary OSD (at > > >>>>>> least > > >> if I don’t change the “primary affinity”). So, the above rule is > > >> guaranteed to map SSD OSD as primary in pg. And every read request > > >> will read from SSD if it is up. > > >>>>>> * It is currently not possible to enforce SSD and HDD OSD to be > > >> chosen from different hosts. So, if I want to ensure data > > >> availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD > > >> OSD. That means setting the replication size to 4, instead of the > > >> ideal value 3, on the pool using the above crush rule. > > >>>>>> Am I correct about the above statements? How would this work from > > >> your experience? Thanks. > > >>>>> This works (i.e. guards against host failures) only if you have > > >>>>> strictly separate sets of hosts that have SSDs and that have HDDs. > > >>>>> I.e., there should be no host that has both, otherwise there is a > > >>>>> chance that one hdd and one ssd from that host will be picked. > > >>>>> -- > > >>>>> Alexander E. Patrakov > > >>>>> CV: > > >>>>> > > https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc. > > >>>>> cd%2FPLz7&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7 > > >>>>> C8 > > >>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnk > > >>>>> no > > >>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha > > >>>>> Ww > > >>>>> iLCJXVCI6Mn0%3D%7C1000&sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfq > > >>>>> FH > > >>>>> NS8F6IIchsrk%3D&reserved=0 > > >>>> _______________________________________________ > > >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send > > >>>> an email to ceph-users-leave@xxxxxxx > > >>>> _______________________________________________ > > >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send > > >>>> an email to ceph-users-leave@xxxxxxx > > >>>> _______________________________________________ > > >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send > > >>>> an email to ceph-users-leave@xxxxxxx > > >>> _______________________________________________ > > >>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > > >>> email to ceph-users-leave@xxxxxxx > > >> _______________________________________________ > > >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > > >> email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx