Is it FileStore or BlueStore? With this SSD-HDD solution, is journal or WAL/DB on SSD or HDD? My understanding is that, there is no benefit to put journal or WAL/DB on SSD with such solution. It will also eliminate the single point of failure when having all WAL/DB on one SSD. Just want to confirm. Another thought is to have separate pools, like all-SSD pool and all-HDD pool. Each pool will be used for different purpose. For example, image, backup, object can be in all-HDD pool and VM volume can be in all-SSD pool. Thanks! Tony > -----Original Message----- > From: 胡 玮文 <huww98@xxxxxxxxxxx> > Sent: Monday, October 26, 2020 9:20 AM > To: Frank Schilder <frans@xxxxxx> > Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx > Subject: Re: The feasibility of mixed SSD and HDD > replicated pool > > > > 在 2020年10月26日,15:43,Frank Schilder <frans@xxxxxx> 写道: > > > > > >> I’ve never seen anything that implies that lead OSDs within an acting > set are a function of CRUSH rule ordering. > > > > This is actually a good question. I believed that I had seen/heard > that somewhere, but I might be wrong. > > > > Looking at the definition of a PG, is states that a PG is an ordered > set of OSD (IDs) and the first up OSD will be the primary. In other > words, it seems that the lowest OSD ID is decisive. If the SSDs were > deployed before the HDDs, they have the smallest IDs and, hence, will be > preferred as primary OSDs. > > I don’t think this is correct. From my experiments, using previously > mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the > primary OSDs are always SSD. > > I also have a look at the code, if I understand it correctly: > > * If the default primary affinity is not changed, then the logic about > primary affinity is skipped, and the primary would be the first one > returned by CRUSH algorithm [1]. > > * The order of OSDs returned by CRUSH still matters if you changed the > primary affinity. The affinity represents the probability of a test to > be success. The first OSD will be tested first, and will have higher > probability to become primary. [2] > * If any OSD has primary affinity = 1.0, the test will always success, > and any OSD after it will never be primary. > * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to > 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has > probability of 0.125. Otherwise, 1st will be primary. > * If no test success (Suppose all OSDs have affinity of 0), 1st OSD > will be primary as fallback. > > [1]: > https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012 > 53/src/osd/OSDMap.cc#L2456 > [2]: > https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012 > 53/src/osd/OSDMap.cc#L2561 > > So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient > for it to be the primary in my case. > > Do you think I should contribute these to documentation? > > > This, however, is not a sustainable situation. Any addition of OSDs > will mess this up and the distribution scheme will fail in the future. A > way out seem to be: > > > > - subdivide your HDD storage using device classes: > > * define a device class for HDDs with primary affinity=0, for example, > > pick 5 HDDs and change their device class to hdd_np (for no primary) > > * set the primary affinity of these HDD OSDs to 0 > > * modify your crush rule to use "step take default class hdd_np" > > * this will create a pool with primaries on SSD and balanced storage > > distribution between SSD and HDD > > * all-HDD pools deployed as usual on class hdd > > * when increasing capacity, one needs to take care of adding disks to > > hdd_np class and set their primary affinity to 0 > > * somewhat increased admin effort, but fully working solution > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Anthony D'Atri <anthony.datri@xxxxxxxxx> > > Sent: 25 October 2020 17:07:15 > > To: ceph-users@xxxxxxx > > Subject: Re: The feasibility of mixed SSD and HDD > > replicated pool > > > >> I'm not entirely sure if primary on SSD will actually make the read > happen on SSD. > > > > My understanding is that by default reads always happen from the lead > OSD in the acting set. Octopus seems to (finally) have an option to > spread the reads around, which IIRC defaults to false. > > > > I’ve never seen anything that implies that lead OSDs within an acting > set are a function of CRUSH rule ordering. I’m not asserting that they > aren’t though, but I’m … skeptical. > > > > Setting primary affinity would do the job, and you’d want to have cron > continually update it across the cluster to react to topology changes. > I was told of this strategy back in 2014, but haven’t personally seen it > implemented. > > > > That said, HDDs are more of a bottleneck for writes than reads and > just might be fine for your application. Tiny reads are going to limit > you to some degree regardless of drive type, and you do mention > throughput, not IOPS. > > > > I must echo Frank’s notes about capacity too. Ceph can do a lot of > things, but that doesn’t mean something exotic is necessarily the best > choice. You’re concerned about 3R only yielding 1/3 of raw capacity if > using an all-SSD cluster, but the architecture you propose limits you > anyway because drive size. Consider also chassis, CPU, RAM, RU, switch > port costs as well, and the cost of you fussing over an exotic solution > instead of the hundreds of other things in your backlog. > > > > And your cluster as described is *tiny*. Honestly I’d suggest > considering one of these alternatives: > > > > * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really > promising for replacing HDDs for density in this kind of application. > You might even consider ARM if IOPs aren’t a concern. > > * An NVMeoF solution > > > > > > Cache tiers are “deprecated”, but then so are custom cluster names. > > Neither appears > > > >> For EC pools there is an option "fast_read" > (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs. > ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfas > t_read%23set-pool- > values&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7f > e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbG > Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D > %7C1000&sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D& > ;reserved=0), which states that a read will return as soon as the first > k shards have arrived. The default is to wait for all k+m shards (all > replicas). This option is not available for replicated pools. > >> Now, not sure if this option is not available for replicated pools > because the read will always be served by the acting primary, or if it > currently waits for all replicas. In the latter case, reads will wait > for the slowest device. > >> I'm not sure if I interpret this correctly. I think you should test > the setup with HDD only and SSD+HDD to see if read speed improves. Note > that write speed will always depend on the slowest device. > >> Best regards, > >> ================= > >> Frank Schilder > >> AIT Risø Campus > >> Bygning 109, rum S14 > >> ________________________________________ > >> From: Frank Schilder <frans@xxxxxx> > >> Sent: 25 October 2020 15:03:16 > >> To: 胡 玮文; Alexander E. Patrakov > >> Cc: ceph-users@xxxxxxx > >> Subject: Re: The feasibility of mixed SSD and HDD > >> replicated pool A cache pool might be an alternative, heavily > depending on how much data is hot. However, then you will have much less > SSD capacity available, because it also requires replication. > >> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = > 120T HDD you will probably run short of SSD capacity. Or, looking at it > the other way around, with copies on 1 SSD+3HDD, you will only be able > to use about 30T out of 120T HDD capacity. > >> With this replication, the usable storage will be 10T and raw used > will be 10T SSD and 30T HDD. If you can't do anything else on the HDD > space, you will need more SSDs. If your servers have more free disk > slots, you can add SSDs over time until you have at least 40T SSD > capacity to balance SSD and HDD capacity. > >> Personally, I think the 1SSD + 3HDD is a good option compared with a > cache pool. You have the data security of 3-times replication and, if > everything is up, need only 1 copy in the SSD cache, which means that > you have 3 times the cache capacity. > >> Best regards, > >> ================= > >> Frank Schilder > >> AIT Risø Campus > >> Bygning 109, rum S14 > >> ________________________________________ > >> From: 胡 玮文 <huww98@xxxxxxxxxxx> > >> Sent: 25 October 2020 13:40:55 > >> To: Alexander E. Patrakov > >> Cc: ceph-users@xxxxxxx > >> Subject: Re: The feasibility of mixed SSD and HDD > >> replicated pool Yes. This is the limitation of CRUSH algorithm, in my > mind. In order to guard against 2 host failures, I’m going to use 4 > replications, 1 on SSD and 3 on HDD. This will work as intended, right? > Because at least I can ensure 3 HDDs are from different hosts. > >>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov@xxxxxxxxx> > 写道: > >>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx > <huww98@xxxxxxxxxxx> wrote: > >>>> Hi all, > >>>> We are planning for a new pool to store our dataset using CephFS. > These data are almost read-only (but not guaranteed) and consist of a > lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T > HDD, and we will deploy about 10 such nodes. We aim at getting the > highest read throughput. > >>>> If we just use a replicated pool of size 3 on SSD, we should get > the best performance, however, that only leave us 1/3 of usable SSD > space. And EC pools are not friendly to such small object read workload, > I think. > >>>> Now I’m evaluating a mixed SSD and HDD replication strategy. > Ideally, I want 3 data replications, each on a different host (fail > domain). 1 of them on SSD, the other 2 on HDD. And normally every read > request is directed to SSD. So, if every SSD OSD is up, I’d expect the > same read throughout as the all SSD deployment. > >>>> I’ve read the documents and did some tests. Here is the crush rule > I’m testing with: > >>>> rule mixed_replicated_rule { > >>>> id 3 > >>>> type replicated > >>>> min_size 1 > >>>> max_size 10 > >>>> step take default class ssd > >>>> step chooseleaf firstn 1 type host > >>>> step emit > >>>> step take default class hdd > >>>> step chooseleaf firstn -1 type host > >>>> step emit > >>>> } > >>>> Now I have the following conclusions, but I’m not very sure: > >>>> * The first OSD produced by crush will be the primary OSD (at least > if I don’t change the “primary affinity”). So, the above rule is > guaranteed to map SSD OSD as primary in pg. And every read request will > read from SSD if it is up. > >>>> * It is currently not possible to enforce SSD and HDD OSD to be > chosen from different hosts. So, if I want to ensure data availability > even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means > setting the replication size to 4, instead of the ideal value 3, on the > pool using the above crush rule. > >>>> Am I correct about the above statements? How would this work from > your experience? Thanks. > >>> This works (i.e. guards against host failures) only if you have > >>> strictly separate sets of hosts that have SSDs and that have HDDs. > >>> I.e., there should be no host that has both, otherwise there is a > >>> chance that one hdd and one ssd from that host will be picked. > >>> -- > >>> Alexander E. Patrakov > >>> CV: > >>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc. > >>> cd%2FPLz7&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C8 > >>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnkno > >>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw > >>> iLCJXVCI6Mn0%3D%7C1000&sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFH > >>> NS8F6IIchsrk%3D&reserved=0 > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > >> email to ceph-users-leave@xxxxxxx > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > >> email to ceph-users-leave@xxxxxxx > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > >> email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > > email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx