> 在 2020年11月8日,11:30,Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道: > > Is it FileStore or BlueStore? With this SSD-HDD solution, is journal > or WAL/DB on SSD or HDD? My understanding is that, there is no > benefit to put journal or WAL/DB on SSD with such solution. It will > also eliminate the single point of failure when having all WAL/DB > on one SSD. Just want to confirm. We are building a new cluster, so BlueStore. I think put WAL/DB on SSD is more about performance. How this is related to eliminating single point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And of course, just use single device for SSD OSDs > Another thought is to have separate pools, like all-SSD pool and > all-HDD pool. Each pool will be used for different purpose. For example, > image, backup, object can be in all-HDD pool and VM volume can be in > all-SSD pool. Yes, I think the same. > Thanks! > Tony >> -----Original Message----- >> From: 胡 玮文 <huww98@xxxxxxxxxxx> >> Sent: Monday, October 26, 2020 9:20 AM >> To: Frank Schilder <frans@xxxxxx> >> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx >> Subject: Re: The feasibility of mixed SSD and HDD >> replicated pool >> >> >>>> 在 2020年10月26日,15:43,Frank Schilder <frans@xxxxxx> 写道: >>> >>> >>>> I’ve never seen anything that implies that lead OSDs within an acting >> set are a function of CRUSH rule ordering. >>> >>> This is actually a good question. I believed that I had seen/heard >> that somewhere, but I might be wrong. >>> >>> Looking at the definition of a PG, is states that a PG is an ordered >> set of OSD (IDs) and the first up OSD will be the primary. In other >> words, it seems that the lowest OSD ID is decisive. If the SSDs were >> deployed before the HDDs, they have the smallest IDs and, hence, will be >> preferred as primary OSDs. >> >> I don’t think this is correct. From my experiments, using previously >> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the >> primary OSDs are always SSD. >> >> I also have a look at the code, if I understand it correctly: >> >> * If the default primary affinity is not changed, then the logic about >> primary affinity is skipped, and the primary would be the first one >> returned by CRUSH algorithm [1]. >> >> * The order of OSDs returned by CRUSH still matters if you changed the >> primary affinity. The affinity represents the probability of a test to >> be success. The first OSD will be tested first, and will have higher >> probability to become primary. [2] >> * If any OSD has primary affinity = 1.0, the test will always success, >> and any OSD after it will never be primary. >> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to >> 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has >> probability of 0.125. Otherwise, 1st will be primary. >> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD >> will be primary as fallback. >> >> [1]: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012&data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D&reserved=0 >> 53/src/osd/OSDMap.cc#L2456 >> [2]: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012&data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D&reserved=0 >> 53/src/osd/OSDMap.cc#L2561 >> >> So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient >> for it to be the primary in my case. >> >> Do you think I should contribute these to documentation? >> >>> This, however, is not a sustainable situation. Any addition of OSDs >> will mess this up and the distribution scheme will fail in the future. A >> way out seem to be: >>> >>> - subdivide your HDD storage using device classes: >>> * define a device class for HDDs with primary affinity=0, for example, >>> pick 5 HDDs and change their device class to hdd_np (for no primary) >>> * set the primary affinity of these HDD OSDs to 0 >>> * modify your crush rule to use "step take default class hdd_np" >>> * this will create a pool with primaries on SSD and balanced storage >>> distribution between SSD and HDD >>> * all-HDD pools deployed as usual on class hdd >>> * when increasing capacity, one needs to take care of adding disks to >>> hdd_np class and set their primary affinity to 0 >>> * somewhat increased admin effort, but fully working solution >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx> >>> Sent: 25 October 2020 17:07:15 >>> To: ceph-users@xxxxxxx >>> Subject: Re: The feasibility of mixed SSD and HDD >>> replicated pool >>> >>>> I'm not entirely sure if primary on SSD will actually make the read >> happen on SSD. >>> >>> My understanding is that by default reads always happen from the lead >> OSD in the acting set. Octopus seems to (finally) have an option to >> spread the reads around, which IIRC defaults to false. >>> >>> I’ve never seen anything that implies that lead OSDs within an acting >> set are a function of CRUSH rule ordering. I’m not asserting that they >> aren’t though, but I’m … skeptical. >>> >>> Setting primary affinity would do the job, and you’d want to have cron >> continually update it across the cluster to react to topology changes. >> I was told of this strategy back in 2014, but haven’t personally seen it >> implemented. >>> >>> That said, HDDs are more of a bottleneck for writes than reads and >> just might be fine for your application. Tiny reads are going to limit >> you to some degree regardless of drive type, and you do mention >> throughput, not IOPS. >>> >>> I must echo Frank’s notes about capacity too. Ceph can do a lot of >> things, but that doesn’t mean something exotic is necessarily the best >> choice. You’re concerned about 3R only yielding 1/3 of raw capacity if >> using an all-SSD cluster, but the architecture you propose limits you >> anyway because drive size. Consider also chassis, CPU, RAM, RU, switch >> port costs as well, and the cost of you fussing over an exotic solution >> instead of the hundreds of other things in your backlog. >>> >>> And your cluster as described is *tiny*. Honestly I’d suggest >> considering one of these alternatives: >>> >>> * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really >> promising for replacing HDDs for density in this kind of application. >> You might even consider ARM if IOPs aren’t a concern. >>> * An NVMeoF solution >>> >>> >>> Cache tiers are “deprecated”, but then so are custom cluster names. >>> Neither appears >>> >>>> For EC pools there is an option "fast_read" >> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs. >> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfas >> t_read%23set-pool- >> values&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7f >> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbG >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D >> %7C1000&sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D& >> ;reserved=0), which states that a read will return as soon as the first >> k shards have arrived. The default is to wait for all k+m shards (all >> replicas). This option is not available for replicated pools. >>>> Now, not sure if this option is not available for replicated pools >> because the read will always be served by the acting primary, or if it >> currently waits for all replicas. In the latter case, reads will wait >> for the slowest device. >>>> I'm not sure if I interpret this correctly. I think you should test >> the setup with HDD only and SSD+HDD to see if read speed improves. Note >> that write speed will always depend on the slowest device. >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> ________________________________________ >>>> From: Frank Schilder <frans@xxxxxx> >>>> Sent: 25 October 2020 15:03:16 >>>> To: 胡 玮文; Alexander E. Patrakov >>>> Cc: ceph-users@xxxxxxx >>>> Subject: Re: The feasibility of mixed SSD and HDD >>>> replicated pool A cache pool might be an alternative, heavily >> depending on how much data is hot. However, then you will have much less >> SSD capacity available, because it also requires replication. >>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = >> 120T HDD you will probably run short of SSD capacity. Or, looking at it >> the other way around, with copies on 1 SSD+3HDD, you will only be able >> to use about 30T out of 120T HDD capacity. >>>> With this replication, the usable storage will be 10T and raw used >> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD >> space, you will need more SSDs. If your servers have more free disk >> slots, you can add SSDs over time until you have at least 40T SSD >> capacity to balance SSD and HDD capacity. >>>> Personally, I think the 1SSD + 3HDD is a good option compared with a >> cache pool. You have the data security of 3-times replication and, if >> everything is up, need only 1 copy in the SSD cache, which means that >> you have 3 times the cache capacity. >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> ________________________________________ >>>> From: 胡 玮文 <huww98@xxxxxxxxxxx> >>>> Sent: 25 October 2020 13:40:55 >>>> To: Alexander E. Patrakov >>>> Cc: ceph-users@xxxxxxx >>>> Subject: Re: The feasibility of mixed SSD and HDD >>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in my >> mind. In order to guard against 2 host failures, I’m going to use 4 >> replications, 1 on SSD and 3 on HDD. This will work as intended, right? >> Because at least I can ensure 3 HDDs are from different hosts. >>>>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov@xxxxxxxxx> >> 写道: >>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx >> <huww98@xxxxxxxxxxx> wrote: >>>>>> Hi all, >>>>>> We are planning for a new pool to store our dataset using CephFS. >> These data are almost read-only (but not guaranteed) and consist of a >> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T >> HDD, and we will deploy about 10 such nodes. We aim at getting the >> highest read throughput. >>>>>> If we just use a replicated pool of size 3 on SSD, we should get >> the best performance, however, that only leave us 1/3 of usable SSD >> space. And EC pools are not friendly to such small object read workload, >> I think. >>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy. >> Ideally, I want 3 data replications, each on a different host (fail >> domain). 1 of them on SSD, the other 2 on HDD. And normally every read >> request is directed to SSD. So, if every SSD OSD is up, I’d expect the >> same read throughout as the all SSD deployment. >>>>>> I’ve read the documents and did some tests. Here is the crush rule >> I’m testing with: >>>>>> rule mixed_replicated_rule { >>>>>> id 3 >>>>>> type replicated >>>>>> min_size 1 >>>>>> max_size 10 >>>>>> step take default class ssd >>>>>> step chooseleaf firstn 1 type host >>>>>> step emit >>>>>> step take default class hdd >>>>>> step chooseleaf firstn -1 type host >>>>>> step emit >>>>>> } >>>>>> Now I have the following conclusions, but I’m not very sure: >>>>>> * The first OSD produced by crush will be the primary OSD (at least >> if I don’t change the “primary affinity”). So, the above rule is >> guaranteed to map SSD OSD as primary in pg. And every read request will >> read from SSD if it is up. >>>>>> * It is currently not possible to enforce SSD and HDD OSD to be >> chosen from different hosts. So, if I want to ensure data availability >> even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means >> setting the replication size to 4, instead of the ideal value 3, on the >> pool using the above crush rule. >>>>>> Am I correct about the above statements? How would this work from >> your experience? Thanks. >>>>> This works (i.e. guards against host failures) only if you have >>>>> strictly separate sets of hosts that have SSDs and that have HDDs. >>>>> I.e., there should be no host that has both, otherwise there is a >>>>> chance that one hdd and one ssd from that host will be picked. >>>>> -- >>>>> Alexander E. Patrakov >>>>> CV: >>>>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc. >>>>> cd%2FPLz7&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C8 >>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnkno >>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw >>>>> iLCJXVCI6Mn0%3D%7C1000&sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFH >>>>> NS8F6IIchsrk%3D&reserved=0 >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>>> email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>>> email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>>> email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>> email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >> email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx