For read-only workload, this should make no difference, since all read are from SSD normally. But I think it’s still beneficial to writing, backfilling, recovering. And also I will have some HDD only pools, so WAL/DB on SSD will definitely improve performance for these pools. I will always put WAL/DB on SSD if we have any SSD installed on that host. After all, disk failures are rare, and performance is more important. In case the SSD fails, I need to rebuild more than one OSD, that will take longer, but should not result in data loss, right? > 在 2020年11月9日,04:17,Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道: > > Sorry for confusing, what I meant to say is that "having all WAL/DB > on one SSD will result a single point of failure". If that SSD goes > down, all OSDs depending on it will also stop working. > > What I'd like to confirm is that, there is no benefit to put WAL/DB > on SSD when there is either cache tire or such primary SSD with HDD > for replications. And distribute WAL/DB on each HDD will eliminate > that single point of failure. > > So in your case, with SSD as the primary OSD, do you put WAL/DB on > a SSD for secondary HDDs, or just distribute it to each HDD? > > > Thanks! > Tony >> -----Original Message----- >> From: 胡 玮文 <huww98@xxxxxxxxxxx> >> Sent: Sunday, November 8, 2020 5:47 AM >> To: Tony Liu <tonyliu0592@xxxxxxxxxxx> >> Cc: ceph-users@xxxxxxx >> Subject: Re: Re: The feasibility of mixed SSD and HDD >> replicated pool >> >> >>>> 在 2020年11月8日,11:30,Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道: >>> >>> Is it FileStore or BlueStore? With this SSD-HDD solution, is journal >>> or WAL/DB on SSD or HDD? My understanding is that, there is no benefit >>> to put journal or WAL/DB on SSD with such solution. It will also >>> eliminate the single point of failure when having all WAL/DB on one >>> SSD. Just want to confirm. >> >> We are building a new cluster, so BlueStore. I think put WAL/DB on SSD >> is more about performance. How this is related to eliminating single >> point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And >> of course, just use single device for SSD OSDs >> >>> Another thought is to have separate pools, like all-SSD pool and >>> all-HDD pool. Each pool will be used for different purpose. For >>> example, image, backup, object can be in all-HDD pool and VM volume >>> can be in all-SSD pool. >> >> Yes, I think the same. >> >>> Thanks! >>> Tony >>>> -----Original Message----- >>>> From: 胡 玮文 <huww98@xxxxxxxxxxx> >>>> Sent: Monday, October 26, 2020 9:20 AM >>>> To: Frank Schilder <frans@xxxxxx> >>>> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx >>>> Subject: Re: The feasibility of mixed SSD and HDD >>>> replicated pool >>>> >>>> >>>>>> 在 2020年10月26日,15:43,Frank Schilder <frans@xxxxxx> 写道: >>>>> >>>>> >>>>>> I’ve never seen anything that implies that lead OSDs within an >>>>>> acting >>>> set are a function of CRUSH rule ordering. >>>>> >>>>> This is actually a good question. I believed that I had seen/heard >>>> that somewhere, but I might be wrong. >>>>> >>>>> Looking at the definition of a PG, is states that a PG is an ordered >>>> set of OSD (IDs) and the first up OSD will be the primary. In other >>>> words, it seems that the lowest OSD ID is decisive. If the SSDs were >>>> deployed before the HDDs, they have the smallest IDs and, hence, will >>>> be preferred as primary OSDs. >>>> >>>> I don’t think this is correct. From my experiments, using previously >>>> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the >>>> primary OSDs are always SSD. >>>> >>>> I also have a look at the code, if I understand it correctly: >>>> >>>> * If the default primary affinity is not changed, then the logic >>>> about primary affinity is skipped, and the primary would be the first >>>> one returned by CRUSH algorithm [1]. >>>> >>>> * The order of OSDs returned by CRUSH still matters if you changed >>>> the primary affinity. The affinity represents the probability of a >>>> test to be success. The first OSD will be tested first, and will have >>>> higher probability to become primary. [2] >>>> * If any OSD has primary affinity = 1.0, the test will always >>>> success, and any OSD after it will never be primary. >>>> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set >>>> to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd >>>> one has probability of 0.125. Otherwise, 1st will be primary. >>>> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD >>>> will be primary as fallback. >>>> >>>> [1]: >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit >>>> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012 >>>> &data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f >>>> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG >>>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 >>>> %3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D >>>> &reserved=0 >>>> 53/src/osd/OSDMap.cc#L2456 >>>> [2]: >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit >>>> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012 >>>> &data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f >>>> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG >>>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 >>>> %3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D >>>> &reserved=0 >>>> 53/src/osd/OSDMap.cc#L2561 >>>> >>>> So, set the primary affinity of all SSD OSDs to 1.0 should be >>>> sufficient for it to be the primary in my case. >>>> >>>> Do you think I should contribute these to documentation? >>>> >>>>> This, however, is not a sustainable situation. Any addition of OSDs >>>> will mess this up and the distribution scheme will fail in the >>>> future. A way out seem to be: >>>>> >>>>> - subdivide your HDD storage using device classes: >>>>> * define a device class for HDDs with primary affinity=0, for >>>>> example, pick 5 HDDs and change their device class to hdd_np (for no >>>>> primary) >>>>> * set the primary affinity of these HDD OSDs to 0 >>>>> * modify your crush rule to use "step take default class hdd_np" >>>>> * this will create a pool with primaries on SSD and balanced storage >>>>> distribution between SSD and HDD >>>>> * all-HDD pools deployed as usual on class hdd >>>>> * when increasing capacity, one needs to take care of adding disks >>>>> to hdd_np class and set their primary affinity to 0 >>>>> * somewhat increased admin effort, but fully working solution >>>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx> >>>>> Sent: 25 October 2020 17:07:15 >>>>> To: ceph-users@xxxxxxx >>>>> Subject: Re: The feasibility of mixed SSD and HDD >>>>> replicated pool >>>>> >>>>>> I'm not entirely sure if primary on SSD will actually make the read >>>> happen on SSD. >>>>> >>>>> My understanding is that by default reads always happen from the >>>>> lead >>>> OSD in the acting set. Octopus seems to (finally) have an option to >>>> spread the reads around, which IIRC defaults to false. >>>>> >>>>> I’ve never seen anything that implies that lead OSDs within an >>>>> acting >>>> set are a function of CRUSH rule ordering. I’m not asserting that >>>> they aren’t though, but I’m … skeptical. >>>>> >>>>> Setting primary affinity would do the job, and you’d want to have >>>>> cron >>>> continually update it across the cluster to react to topology changes. >>>> I was told of this strategy back in 2014, but haven’t personally seen >>>> it implemented. >>>>> >>>>> That said, HDDs are more of a bottleneck for writes than reads and >>>> just might be fine for your application. Tiny reads are going to >>>> limit you to some degree regardless of drive type, and you do mention >>>> throughput, not IOPS. >>>>> >>>>> I must echo Frank’s notes about capacity too. Ceph can do a lot of >>>> things, but that doesn’t mean something exotic is necessarily the >>>> best choice. You’re concerned about 3R only yielding 1/3 of raw >>>> capacity if using an all-SSD cluster, but the architecture you >>>> propose limits you anyway because drive size. Consider also chassis, >>>> CPU, RAM, RU, switch port costs as well, and the cost of you fussing >>>> over an exotic solution instead of the hundreds of other things in >> your backlog. >>>>> >>>>> And your cluster as described is *tiny*. Honestly I’d suggest >>>> considering one of these alternatives: >>>>> >>>>> * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are >>>>> really >>>> promising for replacing HDDs for density in this kind of application. >>>> You might even consider ARM if IOPs aren’t a concern. >>>>> * An NVMeoF solution >>>>> >>>>> >>>>> Cache tiers are “deprecated”, but then so are custom cluster names. >>>>> Neither appears >>>>> >>>>>> For EC pools there is an option "fast_read" >>>> >> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs. >>>> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3D >>>> fas >>>> t_read%23set-pool- >>>> values&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9 >>>> e7f >>>> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWF >>>> pbG >>>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 >>>> %3D >>>> %7C1000&sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D& >>>> amp ;reserved=0), which states that a read will return as soon as the >>>> first k shards have arrived. The default is to wait for all k+m >>>> shards (all replicas). This option is not available for replicated >>>> pools. >>>>>> Now, not sure if this option is not available for replicated pools >>>> because the read will always be served by the acting primary, or if >>>> it currently waits for all replicas. In the latter case, reads will >>>> wait for the slowest device. >>>>>> I'm not sure if I interpret this correctly. I think you should test >>>> the setup with HDD only and SSD+HDD to see if read speed improves. >>>> Note that write speed will always depend on the slowest device. >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> ________________________________________ >>>>>> From: Frank Schilder <frans@xxxxxx> >>>>>> Sent: 25 October 2020 15:03:16 >>>>>> To: 胡 玮文; Alexander E. Patrakov >>>>>> Cc: ceph-users@xxxxxxx >>>>>> Subject: Re: The feasibility of mixed SSD and HDD >>>>>> replicated pool A cache pool might be an alternative, heavily >>>> depending on how much data is hot. However, then you will have much >>>> less SSD capacity available, because it also requires replication. >>>>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = >>>> 120T HDD you will probably run short of SSD capacity. Or, looking at >>>> it the other way around, with copies on 1 SSD+3HDD, you will only be >>>> able to use about 30T out of 120T HDD capacity. >>>>>> With this replication, the usable storage will be 10T and raw used >>>> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD >>>> space, you will need more SSDs. If your servers have more free disk >>>> slots, you can add SSDs over time until you have at least 40T SSD >>>> capacity to balance SSD and HDD capacity. >>>>>> Personally, I think the 1SSD + 3HDD is a good option compared with >>>>>> a >>>> cache pool. You have the data security of 3-times replication and, if >>>> everything is up, need only 1 copy in the SSD cache, which means that >>>> you have 3 times the cache capacity. >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> ________________________________________ >>>>>> From: 胡 玮文 <huww98@xxxxxxxxxxx> >>>>>> Sent: 25 October 2020 13:40:55 >>>>>> To: Alexander E. Patrakov >>>>>> Cc: ceph-users@xxxxxxx >>>>>> Subject: Re: The feasibility of mixed SSD and HDD >>>>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in >>>>>> my >>>> mind. In order to guard against 2 host failures, I’m going to use 4 >>>> replications, 1 on SSD and 3 on HDD. This will work as intended, >> right? >>>> Because at least I can ensure 3 HDDs are from different hosts. >>>>>>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov@xxxxxxxxx> >>>> 写道: >>>>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx >>>> <huww98@xxxxxxxxxxx> wrote: >>>>>>>> Hi all, >>>>>>>> We are planning for a new pool to store our dataset using CephFS. >>>> These data are almost read-only (but not guaranteed) and consist of a >>>> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * >>>> 6T HDD, and we will deploy about 10 such nodes. We aim at getting the >>>> highest read throughput. >>>>>>>> If we just use a replicated pool of size 3 on SSD, we should get >>>> the best performance, however, that only leave us 1/3 of usable SSD >>>> space. And EC pools are not friendly to such small object read >>>> workload, I think. >>>>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy. >>>> Ideally, I want 3 data replications, each on a different host (fail >>>> domain). 1 of them on SSD, the other 2 on HDD. And normally every >>>> read request is directed to SSD. So, if every SSD OSD is up, I’d >>>> expect the same read throughout as the all SSD deployment. >>>>>>>> I’ve read the documents and did some tests. Here is the crush >>>>>>>> rule >>>> I’m testing with: >>>>>>>> rule mixed_replicated_rule { >>>>>>>> id 3 >>>>>>>> type replicated >>>>>>>> min_size 1 >>>>>>>> max_size 10 >>>>>>>> step take default class ssd >>>>>>>> step chooseleaf firstn 1 type host >>>>>>>> step emit >>>>>>>> step take default class hdd >>>>>>>> step chooseleaf firstn -1 type host >>>>>>>> step emit >>>>>>>> } >>>>>>>> Now I have the following conclusions, but I’m not very sure: >>>>>>>> * The first OSD produced by crush will be the primary OSD (at >>>>>>>> least >>>> if I don’t change the “primary affinity”). So, the above rule is >>>> guaranteed to map SSD OSD as primary in pg. And every read request >>>> will read from SSD if it is up. >>>>>>>> * It is currently not possible to enforce SSD and HDD OSD to be >>>> chosen from different hosts. So, if I want to ensure data >>>> availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD >>>> OSD. That means setting the replication size to 4, instead of the >>>> ideal value 3, on the pool using the above crush rule. >>>>>>>> Am I correct about the above statements? How would this work from >>>> your experience? Thanks. >>>>>>> This works (i.e. guards against host failures) only if you have >>>>>>> strictly separate sets of hosts that have SSDs and that have HDDs. >>>>>>> I.e., there should be no host that has both, otherwise there is a >>>>>>> chance that one hdd and one ssd from that host will be picked. >>>>>>> -- >>>>>>> Alexander E. Patrakov >>>>>>> CV: >>>>>>> >> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc. >>>>>>> cd%2FPLz7&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7 >>>>>>> C8 >>>>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnk >>>>>>> no >>>>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha >>>>>>> Ww >>>>>>> iLCJXVCI6Mn0%3D%7C1000&sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfq >>>>>>> FH >>>>>>> NS8F6IIchsrk%3D&reserved=0 >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send >>>>>> an email to ceph-users-leave@xxxxxxx >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send >>>>>> an email to ceph-users-leave@xxxxxxx >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send >>>>>> an email to ceph-users-leave@xxxxxxx >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>>>> email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>>> email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx