Re: The feasibility of mixed SSD and HDD replicated pool

Tony Liu <tonyliu0592@xxxxxxxxxxx> · Sun, 8 Nov 2020 03:30:06 +0000

Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
or WAL/DB on SSD or HDD? My understanding is that, there is no
benefit to put journal or WAL/DB on SSD with such solution. It will
also eliminate the single point of failure when having all WAL/DB
on one SSD. Just want to confirm.

Another thought is to have separate pools, like all-SSD pool and
all-HDD pool. Each pool will be used for different purpose. For example,
image, backup, object can be in all-HDD pool and VM volume can be in
all-SSD pool.

Thanks!
Tony
> -----Original Message-----
> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> Sent: Monday, October 26, 2020 9:20 AM
> To: Frank Schilder <frans@xxxxxx>
> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx
> Subject:  Re: The feasibility of mixed SSD and HDD
> replicated pool
> 
> 
> > 在 2020年10月26日，15:43，Frank Schilder <frans@xxxxxx> 写道：
> >
> > 
> >> I’ve never seen anything that implies that lead OSDs within an acting
> set are a function of CRUSH rule ordering.
> >
> > This is actually a good question. I believed that I had seen/heard
> that somewhere, but I might be wrong.
> >
> > Looking at the definition of a PG, is states that a PG is an ordered
> set of OSD (IDs) and the first up OSD will be the primary. In other
> words, it seems that the lowest OSD ID is decisive. If the SSDs were
> deployed before the HDDs, they have the smallest IDs and, hence, will be
> preferred as primary OSDs.
> 
> I don’t think this is correct. From my experiments, using previously
> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
> primary OSDs are always SSD.
> 
> I also have a look at the code, if I understand it correctly:
> 
> * If the default primary affinity is not changed, then the logic about
> primary affinity is skipped, and the primary would be the first one
> returned by CRUSH algorithm [1].
> 
> * The order of OSDs returned by CRUSH still matters if you changed the
> primary affinity. The affinity represents the probability of a test to
> be success. The first OSD will be tested first, and will have higher
> probability to become primary. [2]
>   * If any OSD has primary affinity = 1.0, the test will always success,
> and any OSD after it will never be primary.
>   * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to
> 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has
> probability of 0.125. Otherwise, 1st will be primary.
>   * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
> will be primary as fallback.
> 
> [1]:
> https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012
> 53/src/osd/OSDMap.cc#L2456
> [2]:
> https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012
> 53/src/osd/OSDMap.cc#L2561
> 
> So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient
> for it to be the primary in my case.
> 
> Do you think I should contribute these to documentation?
> 
> > This, however, is not a sustainable situation. Any addition of OSDs
> will mess this up and the distribution scheme will fail in the future. A
> way out seem to be:
> >
> > - subdivide your HDD storage using device classes:
> > * define a device class for HDDs with primary affinity=0, for example,
> > pick 5 HDDs and change their device class to hdd_np (for no primary)
> > * set the primary affinity of these HDD OSDs to 0
> > * modify your crush rule to use "step take default class hdd_np"
> > * this will create a pool with primaries on SSD and balanced storage
> > distribution between SSD and HDD
> > * all-HDD pools deployed as usual on class hdd
> > * when increasing capacity, one needs to take care of adding disks to
> > hdd_np class and set their primary affinity to 0
> > * somewhat increased admin effort, but fully working solution
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> > Sent: 25 October 2020 17:07:15
> > To: ceph-users@xxxxxxx
> > Subject:  Re: The feasibility of mixed SSD and HDD
> > replicated pool
> >
> >> I'm not entirely sure if primary on SSD will actually make the read
> happen on SSD.
> >
> > My understanding is that by default reads always happen from the lead
> OSD in the acting set.  Octopus seems to (finally) have an option to
> spread the reads around, which IIRC defaults to false.
> >
> > I’ve never seen anything that implies that lead OSDs within an acting
> set are a function of CRUSH rule ordering. I’m not asserting that they
> aren’t though, but I’m … skeptical.
> >
> > Setting primary affinity would do the job, and you’d want to have cron
> continually update it across the cluster to react to topology changes.
> I was told of this strategy back in 2014, but haven’t personally seen it
> implemented.
> >
> > That said, HDDs are more of a bottleneck for writes than reads and
> just might be fine for your application.  Tiny reads are going to limit
> you to some degree regardless of drive type, and you do mention
> throughput, not IOPS.
> >
> > I must echo Frank’s notes about capacity too.  Ceph can do a lot of
> things, but that doesn’t mean something exotic is necessarily the best
> choice.  You’re concerned about 3R only yielding 1/3 of raw capacity if
> using an all-SSD cluster, but the architecture you propose limits you
> anyway because drive size. Consider also chassis, CPU, RAM, RU, switch
> port costs as well, and the cost of you fussing over an exotic solution
> instead of the hundreds of other things in your backlog.
> >
> > And your cluster as described is *tiny*.  Honestly I’d suggest
> considering one of these alternatives:
> >
> > * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are really
> promising for replacing HDDs for density in this kind of application.
> You might even consider ARM if IOPs aren’t a concern.
> > * An NVMeoF solution
> >
> >
> > Cache tiers are “deprecated”, but then so are custom cluster names.
> > Neither appears
> >
> >> For EC pools there is an option "fast_read"
> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfas
> t_read%23set-pool-
> values&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7f
> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbG
> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D
> %7C1000&amp;sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&amp
> ;reserved=0), which states that a read will return as soon as the first
> k shards have arrived. The default is to wait for all k+m shards (all
> replicas). This option is not available for replicated pools.
> >> Now, not sure if this option is not available for replicated pools
> because the read will always be served by the acting primary, or if it
> currently waits for all replicas. In the latter case, reads will wait
> for the slowest device.
> >> I'm not sure if I interpret this correctly. I think you should test
> the setup with HDD only and SSD+HDD to see if read speed improves. Note
> that write speed will always depend on the slowest device.
> >> Best regards,
> >> =================
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >> ________________________________________
> >> From: Frank Schilder <frans@xxxxxx>
> >> Sent: 25 October 2020 15:03:16
> >> To: 胡 玮文; Alexander E. Patrakov
> >> Cc: ceph-users@xxxxxxx
> >> Subject:  Re: The feasibility of mixed SSD and HDD
> >> replicated pool A cache pool might be an alternative, heavily
> depending on how much data is hot. However, then you will have much less
> SSD capacity available, because it also requires replication.
> >> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
> 120T HDD you will probably run short of SSD capacity. Or, looking at it
> the other way around, with copies on 1 SSD+3HDD, you will only be able
> to use about 30T out of 120T HDD capacity.
> >> With this replication, the usable storage will be 10T and raw used
> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD
> space, you will need more SSDs. If your servers have more free disk
> slots, you can add SSDs over time until you have at least 40T SSD
> capacity to balance SSD and HDD capacity.
> >> Personally, I think the 1SSD + 3HDD is a good option compared with a
> cache pool. You have the data security of 3-times replication and, if
> everything is up, need only 1 copy in the SSD cache, which means that
> you have 3 times the cache capacity.
> >> Best regards,
> >> =================
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >> ________________________________________
> >> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> >> Sent: 25 October 2020 13:40:55
> >> To: Alexander E. Patrakov
> >> Cc: ceph-users@xxxxxxx
> >> Subject:  Re: The feasibility of mixed SSD and HDD
> >> replicated pool Yes. This is the limitation of CRUSH algorithm, in my
> mind. In order to guard against 2 host failures, I’m going to use 4
> replications, 1 on SSD and 3 on HDD. This will work as intended, right?
> Because at least I can ensure 3 HDDs are from different hosts.
> >>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patrakov@xxxxxxxxx>
> 写道：
> >>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx
> <huww98@xxxxxxxxxxx> wrote:
> >>>> Hi all,
> >>>> We are planning for a new pool to store our dataset using CephFS.
> These data are almost read-only (but not guaranteed) and consist of a
> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T
> HDD, and we will deploy about 10 such nodes. We aim at getting the
> highest read throughput.
> >>>> If we just use a replicated pool of size 3 on SSD, we should get
> the best performance, however, that only leave us 1/3 of usable SSD
> space. And EC pools are not friendly to such small object read workload,
> I think.
> >>>> Now I’m evaluating a mixed SSD and HDD replication strategy.
> Ideally, I want 3 data replications, each on a different host (fail
> domain). 1 of them on SSD, the other 2 on HDD. And normally every read
> request is directed to SSD. So, if every SSD OSD is up, I’d expect the
> same read throughout as the all SSD deployment.
> >>>> I’ve read the documents and did some tests. Here is the crush rule
> I’m testing with:
> >>>> rule mixed_replicated_rule {
> >>>>     id 3
> >>>>     type replicated
> >>>>     min_size 1
> >>>>     max_size 10
> >>>>     step take default class ssd
> >>>>     step chooseleaf firstn 1 type host
> >>>>     step emit
> >>>>     step take default class hdd
> >>>>     step chooseleaf firstn -1 type host
> >>>>     step emit
> >>>> }
> >>>> Now I have the following conclusions, but I’m not very sure:
> >>>> * The first OSD produced by crush will be the primary OSD (at least
> if I don’t change the “primary affinity”). So, the above rule is
> guaranteed to map SSD OSD as primary in pg. And every read request will
> read from SSD if it is up.
> >>>> * It is currently not possible to enforce SSD and HDD OSD to be
> chosen from different hosts. So, if I want to ensure data availability
> even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means
> setting the replication size to 4, instead of the ideal value 3, on the
> pool using the above crush rule.
> >>>> Am I correct about the above statements? How would this work from
> your experience? Thanks.
> >>> This works (i.e. guards against host failures) only if you have
> >>> strictly separate sets of hosts that have SSDs and that have HDDs.
> >>> I.e., there should be no host that has both, otherwise there is a
> >>> chance that one hdd and one ssd from that host will be picked.
> >>> --
> >>> Alexander E. Patrakov
> >>> CV:
> >>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.
> >>> cd%2FPLz7&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C8
> >>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnkno
> >>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw
> >>> iLCJXVCI6Mn0%3D%7C1000&amp;sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFH
> >>> NS8F6IIchsrk%3D&amp;reserved=0
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> >> email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> >> email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> >> email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx