Re: The feasibility of mixed SSD and HDD replicated pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for confusing, what I meant to say is that "having all WAL/DB
on one SSD will result a single point of failure". If that SSD goes
down, all OSDs depending on it will also stop working.

What I'd like to confirm is that, there is no benefit to put WAL/DB
on SSD when there is either cache tire or such primary SSD with HDD
for replications. And distribute WAL/DB on each HDD will eliminate
that single point of failure.

So in your case, with SSD as the primary OSD, do you put WAL/DB on
a SSD for secondary HDDs, or just distribute it to each HDD?


Thanks!
Tony
> -----Original Message-----
> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> Sent: Sunday, November 8, 2020 5:47 AM
> To: Tony Liu <tonyliu0592@xxxxxxxxxxx>
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Re: The feasibility of mixed SSD and HDD
> replicated pool
> 
> 
> > 在 2020年11月8日,11:30,Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道:
> >
> > Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
> > or WAL/DB on SSD or HDD? My understanding is that, there is no benefit
> > to put journal or WAL/DB on SSD with such solution. It will also
> > eliminate the single point of failure when having all WAL/DB on one
> > SSD. Just want to confirm.
> 
> We are building a new cluster, so BlueStore. I think put WAL/DB on SSD
> is more about performance. How this is related to eliminating single
> point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And
> of course, just use single device for SSD OSDs
> 
> > Another thought is to have separate pools, like all-SSD pool and
> > all-HDD pool. Each pool will be used for different purpose. For
> > example, image, backup, object can be in all-HDD pool and VM volume
> > can be in all-SSD pool.
> 
> Yes, I think the same.
> 
> > Thanks!
> > Tony
> >> -----Original Message-----
> >> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> >> Sent: Monday, October 26, 2020 9:20 AM
> >> To: Frank Schilder <frans@xxxxxx>
> >> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx
> >> Subject:  Re: The feasibility of mixed SSD and HDD
> >> replicated pool
> >>
> >>
> >>>> 在 2020年10月26日,15:43,Frank Schilder <frans@xxxxxx> 写道:
> >>>
> >>>
> >>>> I’ve never seen anything that implies that lead OSDs within an
> >>>> acting
> >> set are a function of CRUSH rule ordering.
> >>>
> >>> This is actually a good question. I believed that I had seen/heard
> >> that somewhere, but I might be wrong.
> >>>
> >>> Looking at the definition of a PG, is states that a PG is an ordered
> >> set of OSD (IDs) and the first up OSD will be the primary. In other
> >> words, it seems that the lowest OSD ID is decisive. If the SSDs were
> >> deployed before the HDDs, they have the smallest IDs and, hence, will
> >> be preferred as primary OSDs.
> >>
> >> I don’t think this is correct. From my experiments, using previously
> >> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
> >> primary OSDs are always SSD.
> >>
> >> I also have a look at the code, if I understand it correctly:
> >>
> >> * If the default primary affinity is not changed, then the logic
> >> about primary affinity is skipped, and the primary would be the first
> >> one returned by CRUSH algorithm [1].
> >>
> >> * The order of OSDs returned by CRUSH still matters if you changed
> >> the primary affinity. The affinity represents the probability of a
> >> test to be success. The first OSD will be tested first, and will have
> >> higher probability to become primary. [2]
> >>  * If any OSD has primary affinity = 1.0, the test will always
> >> success, and any OSD after it will never be primary.
> >>  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set
> >> to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd
> >> one has probability of 0.125. Otherwise, 1st will be primary.
> >>  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
> >> will be primary as fallback.
> >>
> >> [1]:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012
> >> &amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f
> >> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG
> >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> >> %3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D
> >> &amp;reserved=0
> >> 53/src/osd/OSDMap.cc#L2456
> >> [2]:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012
> >> &amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f
> >> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG
> >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> >> %3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D
> >> &amp;reserved=0
> >> 53/src/osd/OSDMap.cc#L2561
> >>
> >> So, set the primary affinity of all SSD OSDs to 1.0 should be
> >> sufficient for it to be the primary in my case.
> >>
> >> Do you think I should contribute these to documentation?
> >>
> >>> This, however, is not a sustainable situation. Any addition of OSDs
> >> will mess this up and the distribution scheme will fail in the
> >> future. A way out seem to be:
> >>>
> >>> - subdivide your HDD storage using device classes:
> >>> * define a device class for HDDs with primary affinity=0, for
> >>> example, pick 5 HDDs and change their device class to hdd_np (for no
> >>> primary)
> >>> * set the primary affinity of these HDD OSDs to 0
> >>> * modify your crush rule to use "step take default class hdd_np"
> >>> * this will create a pool with primaries on SSD and balanced storage
> >>> distribution between SSD and HDD
> >>> * all-HDD pools deployed as usual on class hdd
> >>> * when increasing capacity, one needs to take care of adding disks
> >>> to hdd_np class and set their primary affinity to 0
> >>> * somewhat increased admin effort, but fully working solution
> >>>
> >>> Best regards,
> >>> =================
> >>> Frank Schilder
> >>> AIT Risø Campus
> >>> Bygning 109, rum S14
> >>>
> >>> ________________________________________
> >>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> >>> Sent: 25 October 2020 17:07:15
> >>> To: ceph-users@xxxxxxx
> >>> Subject:  Re: The feasibility of mixed SSD and HDD
> >>> replicated pool
> >>>
> >>>> I'm not entirely sure if primary on SSD will actually make the read
> >> happen on SSD.
> >>>
> >>> My understanding is that by default reads always happen from the
> >>> lead
> >> OSD in the acting set.  Octopus seems to (finally) have an option to
> >> spread the reads around, which IIRC defaults to false.
> >>>
> >>> I’ve never seen anything that implies that lead OSDs within an
> >>> acting
> >> set are a function of CRUSH rule ordering. I’m not asserting that
> >> they aren’t though, but I’m … skeptical.
> >>>
> >>> Setting primary affinity would do the job, and you’d want to have
> >>> cron
> >> continually update it across the cluster to react to topology changes.
> >> I was told of this strategy back in 2014, but haven’t personally seen
> >> it implemented.
> >>>
> >>> That said, HDDs are more of a bottleneck for writes than reads and
> >> just might be fine for your application.  Tiny reads are going to
> >> limit you to some degree regardless of drive type, and you do mention
> >> throughput, not IOPS.
> >>>
> >>> I must echo Frank’s notes about capacity too.  Ceph can do a lot of
> >> things, but that doesn’t mean something exotic is necessarily the
> >> best choice.  You’re concerned about 3R only yielding 1/3 of raw
> >> capacity if using an all-SSD cluster, but the architecture you
> >> propose limits you anyway because drive size. Consider also chassis,
> >> CPU, RAM, RU, switch port costs as well, and the cost of you fussing
> >> over an exotic solution instead of the hundreds of other things in
> your backlog.
> >>>
> >>> And your cluster as described is *tiny*.  Honestly I’d suggest
> >> considering one of these alternatives:
> >>>
> >>> * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are
> >>> really
> >> promising for replacing HDDs for density in this kind of application.
> >> You might even consider ARM if IOPs aren’t a concern.
> >>> * An NVMeoF solution
> >>>
> >>>
> >>> Cache tiers are “deprecated”, but then so are custom cluster names.
> >>> Neither appears
> >>>
> >>>> For EC pools there is an option "fast_read"
> >>
> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> >> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3D
> >> fas
> >> t_read%23set-pool-
> >> values&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9
> >> e7f
> >> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWF
> >> pbG
> >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> >> %3D
> >> %7C1000&amp;sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&
> >> amp ;reserved=0), which states that a read will return as soon as the
> >> first k shards have arrived. The default is to wait for all k+m
> >> shards (all replicas). This option is not available for replicated
> >> pools.
> >>>> Now, not sure if this option is not available for replicated pools
> >> because the read will always be served by the acting primary, or if
> >> it currently waits for all replicas. In the latter case, reads will
> >> wait for the slowest device.
> >>>> I'm not sure if I interpret this correctly. I think you should test
> >> the setup with HDD only and SSD+HDD to see if read speed improves.
> >> Note that write speed will always depend on the slowest device.
> >>>> Best regards,
> >>>> =================
> >>>> Frank Schilder
> >>>> AIT Risø Campus
> >>>> Bygning 109, rum S14
> >>>> ________________________________________
> >>>> From: Frank Schilder <frans@xxxxxx>
> >>>> Sent: 25 October 2020 15:03:16
> >>>> To: 胡 玮文; Alexander E. Patrakov
> >>>> Cc: ceph-users@xxxxxxx
> >>>> Subject:  Re: The feasibility of mixed SSD and HDD
> >>>> replicated pool A cache pool might be an alternative, heavily
> >> depending on how much data is hot. However, then you will have much
> >> less SSD capacity available, because it also requires replication.
> >>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
> >> 120T HDD you will probably run short of SSD capacity. Or, looking at
> >> it the other way around, with copies on 1 SSD+3HDD, you will only be
> >> able to use about 30T out of 120T HDD capacity.
> >>>> With this replication, the usable storage will be 10T and raw used
> >> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD
> >> space, you will need more SSDs. If your servers have more free disk
> >> slots, you can add SSDs over time until you have at least 40T SSD
> >> capacity to balance SSD and HDD capacity.
> >>>> Personally, I think the 1SSD + 3HDD is a good option compared with
> >>>> a
> >> cache pool. You have the data security of 3-times replication and, if
> >> everything is up, need only 1 copy in the SSD cache, which means that
> >> you have 3 times the cache capacity.
> >>>> Best regards,
> >>>> =================
> >>>> Frank Schilder
> >>>> AIT Risø Campus
> >>>> Bygning 109, rum S14
> >>>> ________________________________________
> >>>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> >>>> Sent: 25 October 2020 13:40:55
> >>>> To: Alexander E. Patrakov
> >>>> Cc: ceph-users@xxxxxxx
> >>>> Subject:  Re: The feasibility of mixed SSD and HDD
> >>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in
> >>>> my
> >> mind. In order to guard against 2 host failures, I’m going to use 4
> >> replications, 1 on SSD and 3 on HDD. This will work as intended,
> right?
> >> Because at least I can ensure 3 HDDs are from different hosts.
> >>>>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov@xxxxxxxxx>
> >> 写道:
> >>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx
> >> <huww98@xxxxxxxxxxx> wrote:
> >>>>>> Hi all,
> >>>>>> We are planning for a new pool to store our dataset using CephFS.
> >> These data are almost read-only (but not guaranteed) and consist of a
> >> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 *
> >> 6T HDD, and we will deploy about 10 such nodes. We aim at getting the
> >> highest read throughput.
> >>>>>> If we just use a replicated pool of size 3 on SSD, we should get
> >> the best performance, however, that only leave us 1/3 of usable SSD
> >> space. And EC pools are not friendly to such small object read
> >> workload, I think.
> >>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy.
> >> Ideally, I want 3 data replications, each on a different host (fail
> >> domain). 1 of them on SSD, the other 2 on HDD. And normally every
> >> read request is directed to SSD. So, if every SSD OSD is up, I’d
> >> expect the same read throughout as the all SSD deployment.
> >>>>>> I’ve read the documents and did some tests. Here is the crush
> >>>>>> rule
> >> I’m testing with:
> >>>>>> rule mixed_replicated_rule {
> >>>>>>    id 3
> >>>>>>    type replicated
> >>>>>>    min_size 1
> >>>>>>    max_size 10
> >>>>>>    step take default class ssd
> >>>>>>    step chooseleaf firstn 1 type host
> >>>>>>    step emit
> >>>>>>    step take default class hdd
> >>>>>>    step chooseleaf firstn -1 type host
> >>>>>>    step emit
> >>>>>> }
> >>>>>> Now I have the following conclusions, but I’m not very sure:
> >>>>>> * The first OSD produced by crush will be the primary OSD (at
> >>>>>> least
> >> if I don’t change the “primary affinity”). So, the above rule is
> >> guaranteed to map SSD OSD as primary in pg. And every read request
> >> will read from SSD if it is up.
> >>>>>> * It is currently not possible to enforce SSD and HDD OSD to be
> >> chosen from different hosts. So, if I want to ensure data
> >> availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD
> >> OSD. That means setting the replication size to 4, instead of the
> >> ideal value 3, on the pool using the above crush rule.
> >>>>>> Am I correct about the above statements? How would this work from
> >> your experience? Thanks.
> >>>>> This works (i.e. guards against host failures) only if you have
> >>>>> strictly separate sets of hosts that have SSDs and that have HDDs.
> >>>>> I.e., there should be no host that has both, otherwise there is a
> >>>>> chance that one hdd and one ssd from that host will be picked.
> >>>>> --
> >>>>> Alexander E. Patrakov
> >>>>> CV:
> >>>>>
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.
> >>>>> cd%2FPLz7&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7
> >>>>> C8
> >>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnk
> >>>>> no
> >>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> >>>>> Ww
> >>>>> iLCJXVCI6Mn0%3D%7C1000&amp;sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfq
> >>>>> FH
> >>>>> NS8F6IIchsrk%3D&amp;reserved=0
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>> an email to ceph-users-leave@xxxxxxx
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>> an email to ceph-users-leave@xxxxxxx
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>> an email to ceph-users-leave@xxxxxxx
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> >>> email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> >> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux