Re: The feasibility of mixed SSD and HDD replicated pool

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Mon, 9 Nov 2020 13:25:47 -0500

This thread caught my attention.  I have a smaller cluster with a lot of
OSDs sharing the same SSD on each OSD node.  I mentioned in an earlier post
that I found a statement in
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/
indicating that if the SSD/NVMe in a node is not very big, one could put
the DB in the HDD and only have WAL and/or Journal on the NVMe.

In the context of this thread we're talking about resilience in the face of
an SSD/NVMe failure.  It would be interesting to know which parts are
critical for recovering from an OSD failure: DB, for sure, but is it
possible to recover an OSD if  looses it's WAL or journal (or both)?

I'm sure that to some extent  this depends on the replication set-up:  With
3-way replication and 3 OSD nodes, if all of the SSDs on one node fail, you
can replace the SSD and recover.  However for an EC-pool and failure-domain
= OSD, data loss may be possible due to the failure of a shared SSD/NVMe.

Maybe it's important with a small cluster to suggest to place WAL/DB on the
HDD and use SSD/NVMe only for journal?

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
607-760-2328 (Cell)
607-777-4641 (Office)

On Sun, Nov 8, 2020 at 3:19 PM Tony Liu <tonyliu0592@xxxxxxxxxxx> wrote:

> Sorry for confusing, what I meant to say is that "having all WAL/DB
> on one SSD will result a single point of failure". If that SSD goes
> down, all OSDs depending on it will also stop working.
>
> What I'd like to confirm is that, there is no benefit to put WAL/DB
> on SSD when there is either cache tire or such primary SSD with HDD
> for replications. And distribute WAL/DB on each HDD will eliminate
> that single point of failure.
>
> So in your case, with SSD as the primary OSD, do you put WAL/DB on
> a SSD for secondary HDDs, or just distribute it to each HDD?
>
>
> Thanks!
> Tony
> > -----Original Message-----
> > From: 胡 玮文 <huww98@xxxxxxxxxxx>
> > Sent: Sunday, November 8, 2020 5:47 AM
> > To: Tony Liu <tonyliu0592@xxxxxxxxxxx>
> > Cc: ceph-users@xxxxxxx
> > Subject: Re:  Re: The feasibility of mixed SSD and HDD
> > replicated pool
> >
> >
> > > 在 2020年11月8日，11:30，Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道：
> > >
> > > Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
> > > or WAL/DB on SSD or HDD? My understanding is that, there is no benefit
> > > to put journal or WAL/DB on SSD with such solution. It will also
> > > eliminate the single point of failure when having all WAL/DB on one
> > > SSD. Just want to confirm.
> >
> > We are building a new cluster, so BlueStore. I think put WAL/DB on SSD
> > is more about performance. How this is related to eliminating single
> > point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And
> > of course, just use single device for SSD OSDs
> >
> > > Another thought is to have separate pools, like all-SSD pool and
> > > all-HDD pool. Each pool will be used for different purpose. For
> > > example, image, backup, object can be in all-HDD pool and VM volume
> > > can be in all-SSD pool.
> >
> > Yes, I think the same.
> >
> > > Thanks!
> > > Tony
> > >> -----Original Message-----
> > >> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> > >> Sent: Monday, October 26, 2020 9:20 AM
> > >> To: Frank Schilder <frans@xxxxxx>
> > >> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx
> > >> Subject:  Re: The feasibility of mixed SSD and HDD
> > >> replicated pool
> > >>
> > >>
> > >>>> 在 2020年10月26日，15:43，Frank Schilder <frans@xxxxxx> 写道：
> > >>>
> > >>>
> > >>>> I’ve never seen anything that implies that lead OSDs within an
> > >>>> acting
> > >> set are a function of CRUSH rule ordering.
> > >>>
> > >>> This is actually a good question. I believed that I had seen/heard
> > >> that somewhere, but I might be wrong.
> > >>>
> > >>> Looking at the definition of a PG, is states that a PG is an ordered
> > >> set of OSD (IDs) and the first up OSD will be the primary. In other
> > >> words, it seems that the lowest OSD ID is decisive. If the SSDs were
> > >> deployed before the HDDs, they have the smallest IDs and, hence, will
> > >> be preferred as primary OSDs.
> > >>
> > >> I don’t think this is correct. From my experiments, using previously
> > >> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
> > >> primary OSDs are always SSD.
> > >>
> > >> I also have a look at the code, if I understand it correctly:
> > >>
> > >> * If the default primary affinity is not changed, then the logic
> > >> about primary affinity is skipped, and the primary would be the first
> > >> one returned by CRUSH algorithm [1].
> > >>
> > >> * The order of OSDs returned by CRUSH still matters if you changed
> > >> the primary affinity. The affinity represents the probability of a
> > >> test to be success. The first OSD will be tested first, and will have
> > >> higher probability to become primary. [2]
> > >>  * If any OSD has primary affinity = 1.0, the test will always
> > >> success, and any OSD after it will never be primary.
> > >>  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set
> > >> to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd
> > >> one has probability of 0.125. Otherwise, 1st will be primary.
> > >>  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
> > >> will be primary as fallback.
> > >>
> > >> [1]:
> > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012
> > >> &amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f
> > >> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG
> > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> > >> %3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D
> > >> &amp;reserved=0
> > >> 53/src/osd/OSDMap.cc#L2456
> > >> [2]:
> > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012
> > >> &amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f
> > >> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG
> > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> > >> %3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D
> > >> &amp;reserved=0
> > >> 53/src/osd/OSDMap.cc#L2561
> > >>
> > >> So, set the primary affinity of all SSD OSDs to 1.0 should be
> > >> sufficient for it to be the primary in my case.
> > >>
> > >> Do you think I should contribute these to documentation?
> > >>
> > >>> This, however, is not a sustainable situation. Any addition of OSDs
> > >> will mess this up and the distribution scheme will fail in the
> > >> future. A way out seem to be:
> > >>>
> > >>> - subdivide your HDD storage using device classes:
> > >>> * define a device class for HDDs with primary affinity=0, for
> > >>> example, pick 5 HDDs and change their device class to hdd_np (for no
> > >>> primary)
> > >>> * set the primary affinity of these HDD OSDs to 0
> > >>> * modify your crush rule to use "step take default class hdd_np"
> > >>> * this will create a pool with primaries on SSD and balanced storage
> > >>> distribution between SSD and HDD
> > >>> * all-HDD pools deployed as usual on class hdd
> > >>> * when increasing capacity, one needs to take care of adding disks
> > >>> to hdd_np class and set their primary affinity to 0
> > >>> * somewhat increased admin effort, but fully working solution
> > >>>
> > >>> Best regards,
> > >>> =================
> > >>> Frank Schilder
> > >>> AIT Risø Campus
> > >>> Bygning 109, rum S14
> > >>>
> > >>> ________________________________________
> > >>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> > >>> Sent: 25 October 2020 17:07:15
> > >>> To: ceph-users@xxxxxxx
> > >>> Subject:  Re: The feasibility of mixed SSD and HDD
> > >>> replicated pool
> > >>>
> > >>>> I'm not entirely sure if primary on SSD will actually make the read
> > >> happen on SSD.
> > >>>
> > >>> My understanding is that by default reads always happen from the
> > >>> lead
> > >> OSD in the acting set.  Octopus seems to (finally) have an option to
> > >> spread the reads around, which IIRC defaults to false.
> > >>>
> > >>> I’ve never seen anything that implies that lead OSDs within an
> > >>> acting
> > >> set are a function of CRUSH rule ordering. I’m not asserting that
> > >> they aren’t though, but I’m … skeptical.
> > >>>
> > >>> Setting primary affinity would do the job, and you’d want to have
> > >>> cron
> > >> continually update it across the cluster to react to topology changes.
> > >> I was told of this strategy back in 2014, but haven’t personally seen
> > >> it implemented.
> > >>>
> > >>> That said, HDDs are more of a bottleneck for writes than reads and
> > >> just might be fine for your application.  Tiny reads are going to
> > >> limit you to some degree regardless of drive type, and you do mention
> > >> throughput, not IOPS.
> > >>>
> > >>> I must echo Frank’s notes about capacity too.  Ceph can do a lot of
> > >> things, but that doesn’t mean something exotic is necessarily the
> > >> best choice.  You’re concerned about 3R only yielding 1/3 of raw
> > >> capacity if using an all-SSD cluster, but the architecture you
> > >> propose limits you anyway because drive size. Consider also chassis,
> > >> CPU, RAM, RU, switch port costs as well, and the cost of you fussing
> > >> over an exotic solution instead of the hundreds of other things in
> > your backlog.
> > >>>
> > >>> And your cluster as described is *tiny*.  Honestly I’d suggest
> > >> considering one of these alternatives:
> > >>>
> > >>> * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are
> > >>> really
> > >> promising for replacing HDDs for density in this kind of application.
> > >> You might even consider ARM if IOPs aren’t a concern.
> > >>> * An NVMeoF solution
> > >>>
> > >>>
> > >>> Cache tiers are “deprecated”, but then so are custom cluster names.
> > >>> Neither appears
> > >>>
> > >>>> For EC pools there is an option "fast_read"
> > >>
> > (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> > >> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3D
> > >> fas
> > >> t_read%23set-pool-
> > >> values&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9
> > >> e7f
> > >> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWF
> > >> pbG
> > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> > >> %3D
> > >> %7C1000&amp;sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&
> > >> amp ;reserved=0), which states that a read will return as soon as the
> > >> first k shards have arrived. The default is to wait for all k+m
> > >> shards (all replicas). This option is not available for replicated
> > >> pools.
> > >>>> Now, not sure if this option is not available for replicated pools
> > >> because the read will always be served by the acting primary, or if
> > >> it currently waits for all replicas. In the latter case, reads will
> > >> wait for the slowest device.
> > >>>> I'm not sure if I interpret this correctly. I think you should test
> > >> the setup with HDD only and SSD+HDD to see if read speed improves.
> > >> Note that write speed will always depend on the slowest device.
> > >>>> Best regards,
> > >>>> =================
> > >>>> Frank Schilder
> > >>>> AIT Risø Campus
> > >>>> Bygning 109, rum S14
> > >>>> ________________________________________
> > >>>> From: Frank Schilder <frans@xxxxxx>
> > >>>> Sent: 25 October 2020 15:03:16
> > >>>> To: 胡 玮文; Alexander E. Patrakov
> > >>>> Cc: ceph-users@xxxxxxx
> > >>>> Subject:  Re: The feasibility of mixed SSD and HDD
> > >>>> replicated pool A cache pool might be an alternative, heavily
> > >> depending on how much data is hot. However, then you will have much
> > >> less SSD capacity available, because it also requires replication.
> > >>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
> > >> 120T HDD you will probably run short of SSD capacity. Or, looking at
> > >> it the other way around, with copies on 1 SSD+3HDD, you will only be
> > >> able to use about 30T out of 120T HDD capacity.
> > >>>> With this replication, the usable storage will be 10T and raw used
> > >> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD
> > >> space, you will need more SSDs. If your servers have more free disk
> > >> slots, you can add SSDs over time until you have at least 40T SSD
> > >> capacity to balance SSD and HDD capacity.
> > >>>> Personally, I think the 1SSD + 3HDD is a good option compared with
> > >>>> a
> > >> cache pool. You have the data security of 3-times replication and, if
> > >> everything is up, need only 1 copy in the SSD cache, which means that
> > >> you have 3 times the cache capacity.
> > >>>> Best regards,
> > >>>> =================
> > >>>> Frank Schilder
> > >>>> AIT Risø Campus
> > >>>> Bygning 109, rum S14
> > >>>> ________________________________________
> > >>>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
> > >>>> Sent: 25 October 2020 13:40:55
> > >>>> To: Alexander E. Patrakov
> > >>>> Cc: ceph-users@xxxxxxx
> > >>>> Subject:  Re: The feasibility of mixed SSD and HDD
> > >>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in
> > >>>> my
> > >> mind. In order to guard against 2 host failures, I’m going to use 4
> > >> replications, 1 on SSD and 3 on HDD. This will work as intended,
> > right?
> > >> Because at least I can ensure 3 HDDs are from different hosts.
> > >>>>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patrakov@xxxxxxxxx>
> > >> 写道：
> > >>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx
> > >> <huww98@xxxxxxxxxxx> wrote:
> > >>>>>> Hi all,
> > >>>>>> We are planning for a new pool to store our dataset using CephFS.
> > >> These data are almost read-only (but not guaranteed) and consist of a
> > >> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 *
> > >> 6T HDD, and we will deploy about 10 such nodes. We aim at getting the
> > >> highest read throughput.
> > >>>>>> If we just use a replicated pool of size 3 on SSD, we should get
> > >> the best performance, however, that only leave us 1/3 of usable SSD
> > >> space. And EC pools are not friendly to such small object read
> > >> workload, I think.
> > >>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy.
> > >> Ideally, I want 3 data replications, each on a different host (fail
> > >> domain). 1 of them on SSD, the other 2 on HDD. And normally every
> > >> read request is directed to SSD. So, if every SSD OSD is up, I’d
> > >> expect the same read throughout as the all SSD deployment.
> > >>>>>> I’ve read the documents and did some tests. Here is the crush
> > >>>>>> rule
> > >> I’m testing with:
> > >>>>>> rule mixed_replicated_rule {
> > >>>>>>    id 3
> > >>>>>>    type replicated
> > >>>>>>    min_size 1
> > >>>>>>    max_size 10
> > >>>>>>    step take default class ssd
> > >>>>>>    step chooseleaf firstn 1 type host
> > >>>>>>    step emit
> > >>>>>>    step take default class hdd
> > >>>>>>    step chooseleaf firstn -1 type host
> > >>>>>>    step emit
> > >>>>>> }
> > >>>>>> Now I have the following conclusions, but I’m not very sure:
> > >>>>>> * The first OSD produced by crush will be the primary OSD (at
> > >>>>>> least
> > >> if I don’t change the “primary affinity”). So, the above rule is
> > >> guaranteed to map SSD OSD as primary in pg. And every read request
> > >> will read from SSD if it is up.
> > >>>>>> * It is currently not possible to enforce SSD and HDD OSD to be
> > >> chosen from different hosts. So, if I want to ensure data
> > >> availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD
> > >> OSD. That means setting the replication size to 4, instead of the
> > >> ideal value 3, on the pool using the above crush rule.
> > >>>>>> Am I correct about the above statements? How would this work from
> > >> your experience? Thanks.
> > >>>>> This works (i.e. guards against host failures) only if you have
> > >>>>> strictly separate sets of hosts that have SSDs and that have HDDs.
> > >>>>> I.e., there should be no host that has both, otherwise there is a
> > >>>>> chance that one hdd and one ssd from that host will be picked.
> > >>>>> --
> > >>>>> Alexander E. Patrakov
> > >>>>> CV:
> > >>>>>
> > https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.
> > >>>>> cd%2FPLz7&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7
> > >>>>> C8
> > >>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnk
> > >>>>> no
> > >>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> > >>>>> Ww
> > >>>>> iLCJXVCI6Mn0%3D%7C1000&amp;sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfq
> > >>>>> FH
> > >>>>> NS8F6IIchsrk%3D&amp;reserved=0
> > >>>> _______________________________________________
> > >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> > >>>> an email to ceph-users-leave@xxxxxxx
> > >>>> _______________________________________________
> > >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> > >>>> an email to ceph-users-leave@xxxxxxx
> > >>>> _______________________________________________
> > >>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> > >>>> an email to ceph-users-leave@xxxxxxx
> > >>> _______________________________________________
> > >>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > >>> email to ceph-users-leave@xxxxxxx
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > >> email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx