Re: The feasibility of mixed SSD and HDD replicated pool

胡玮文 <huww98@xxxxxxxxxxx> · Tue, 10 Nov 2020 04:48:11 +0000

For read-only workload, this should make no difference, since all read are from SSD normally. But I think it’s still beneficial to writing, backfilling, recovering. And also I will have some HDD only pools, so WAL/DB on SSD will definitely improve performance for these pools. I will always put WAL/DB on SSD if we have any SSD installed on that host. After all, disk failures are rare, and performance is more important. In case the SSD fails, I need to rebuild more than one OSD, that will take longer, but should not result in data loss, right?

> 在 2020年11月9日，04:17，Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道：
> 
> Sorry for confusing, what I meant to say is that "having all WAL/DB
> on one SSD will result a single point of failure". If that SSD goes
> down, all OSDs depending on it will also stop working.
> 
> What I'd like to confirm is that, there is no benefit to put WAL/DB
> on SSD when there is either cache tire or such primary SSD with HDD
> for replications. And distribute WAL/DB on each HDD will eliminate
> that single point of failure.
> 
> So in your case, with SSD as the primary OSD, do you put WAL/DB on
> a SSD for secondary HDDs, or just distribute it to each HDD?
> 
> 
> Thanks!
> Tony
>> -----Original Message-----
>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
>> Sent: Sunday, November 8, 2020 5:47 AM
>> To: Tony Liu <tonyliu0592@xxxxxxxxxxx>
>> Cc: ceph-users@xxxxxxx
>> Subject: Re:  Re: The feasibility of mixed SSD and HDD
>> replicated pool
>> 
>> 
>>>> 在 2020年11月8日，11:30，Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道：
>>> 
>>> Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
>>> or WAL/DB on SSD or HDD? My understanding is that, there is no benefit
>>> to put journal or WAL/DB on SSD with such solution. It will also
>>> eliminate the single point of failure when having all WAL/DB on one
>>> SSD. Just want to confirm.
>> 
>> We are building a new cluster, so BlueStore. I think put WAL/DB on SSD
>> is more about performance. How this is related to eliminating single
>> point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And
>> of course, just use single device for SSD OSDs
>> 
>>> Another thought is to have separate pools, like all-SSD pool and
>>> all-HDD pool. Each pool will be used for different purpose. For
>>> example, image, backup, object can be in all-HDD pool and VM volume
>>> can be in all-SSD pool.
>> 
>> Yes, I think the same.
>> 
>>> Thanks!
>>> Tony
>>>> -----Original Message-----
>>>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
>>>> Sent: Monday, October 26, 2020 9:20 AM
>>>> To: Frank Schilder <frans@xxxxxx>
>>>> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx
>>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>>> replicated pool
>>>> 
>>>> 
>>>>>> 在 2020年10月26日，15:43，Frank Schilder <frans@xxxxxx> 写道：
>>>>> 
>>>>> 
>>>>>> I’ve never seen anything that implies that lead OSDs within an
>>>>>> acting
>>>> set are a function of CRUSH rule ordering.
>>>>> 
>>>>> This is actually a good question. I believed that I had seen/heard
>>>> that somewhere, but I might be wrong.
>>>>> 
>>>>> Looking at the definition of a PG, is states that a PG is an ordered
>>>> set of OSD (IDs) and the first up OSD will be the primary. In other
>>>> words, it seems that the lowest OSD ID is decisive. If the SSDs were
>>>> deployed before the HDDs, they have the smallest IDs and, hence, will
>>>> be preferred as primary OSDs.
>>>> 
>>>> I don’t think this is correct. From my experiments, using previously
>>>> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
>>>> primary OSDs are always SSD.
>>>> 
>>>> I also have a look at the code, if I understand it correctly:
>>>> 
>>>> * If the default primary affinity is not changed, then the logic
>>>> about primary affinity is skipped, and the primary would be the first
>>>> one returned by CRUSH algorithm [1].
>>>> 
>>>> * The order of OSDs returned by CRUSH still matters if you changed
>>>> the primary affinity. The affinity represents the probability of a
>>>> test to be success. The first OSD will be tested first, and will have
>>>> higher probability to become primary. [2]
>>>> * If any OSD has primary affinity = 1.0, the test will always
>>>> success, and any OSD after it will never be primary.
>>>> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set
>>>> to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd
>>>> one has probability of 0.125. Otherwise, 1st will be primary.
>>>> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
>>>> will be primary as fallback.
>>>> 
>>>> [1]:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
>>>> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012
>>>> &amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f
>>>> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG
>>>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
>>>> %3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D
>>>> &amp;reserved=0
>>>> 53/src/osd/OSDMap.cc#L2456
>>>> [2]:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
>>>> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012
>>>> &amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f
>>>> 640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG
>>>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
>>>> %3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D
>>>> &amp;reserved=0
>>>> 53/src/osd/OSDMap.cc#L2561
>>>> 
>>>> So, set the primary affinity of all SSD OSDs to 1.0 should be
>>>> sufficient for it to be the primary in my case.
>>>> 
>>>> Do you think I should contribute these to documentation?
>>>> 
>>>>> This, however, is not a sustainable situation. Any addition of OSDs
>>>> will mess this up and the distribution scheme will fail in the
>>>> future. A way out seem to be:
>>>>> 
>>>>> - subdivide your HDD storage using device classes:
>>>>> * define a device class for HDDs with primary affinity=0, for
>>>>> example, pick 5 HDDs and change their device class to hdd_np (for no
>>>>> primary)
>>>>> * set the primary affinity of these HDD OSDs to 0
>>>>> * modify your crush rule to use "step take default class hdd_np"
>>>>> * this will create a pool with primaries on SSD and balanced storage
>>>>> distribution between SSD and HDD
>>>>> * all-HDD pools deployed as usual on class hdd
>>>>> * when increasing capacity, one needs to take care of adding disks
>>>>> to hdd_np class and set their primary affinity to 0
>>>>> * somewhat increased admin effort, but fully working solution
>>>>> 
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>> 
>>>>> ________________________________________
>>>>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
>>>>> Sent: 25 October 2020 17:07:15
>>>>> To: ceph-users@xxxxxxx
>>>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>>>> replicated pool
>>>>> 
>>>>>> I'm not entirely sure if primary on SSD will actually make the read
>>>> happen on SSD.
>>>>> 
>>>>> My understanding is that by default reads always happen from the
>>>>> lead
>>>> OSD in the acting set.  Octopus seems to (finally) have an option to
>>>> spread the reads around, which IIRC defaults to false.
>>>>> 
>>>>> I’ve never seen anything that implies that lead OSDs within an
>>>>> acting
>>>> set are a function of CRUSH rule ordering. I’m not asserting that
>>>> they aren’t though, but I’m … skeptical.
>>>>> 
>>>>> Setting primary affinity would do the job, and you’d want to have
>>>>> cron
>>>> continually update it across the cluster to react to topology changes.
>>>> I was told of this strategy back in 2014, but haven’t personally seen
>>>> it implemented.
>>>>> 
>>>>> That said, HDDs are more of a bottleneck for writes than reads and
>>>> just might be fine for your application.  Tiny reads are going to
>>>> limit you to some degree regardless of drive type, and you do mention
>>>> throughput, not IOPS.
>>>>> 
>>>>> I must echo Frank’s notes about capacity too.  Ceph can do a lot of
>>>> things, but that doesn’t mean something exotic is necessarily the
>>>> best choice.  You’re concerned about 3R only yielding 1/3 of raw
>>>> capacity if using an all-SSD cluster, but the architecture you
>>>> propose limits you anyway because drive size. Consider also chassis,
>>>> CPU, RAM, RU, switch port costs as well, and the cost of you fussing
>>>> over an exotic solution instead of the hundreds of other things in
>> your backlog.
>>>>> 
>>>>> And your cluster as described is *tiny*.  Honestly I’d suggest
>>>> considering one of these alternatives:
>>>>> 
>>>>> * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are
>>>>> really
>>>> promising for replacing HDDs for density in this kind of application.
>>>> You might even consider ARM if IOPs aren’t a concern.
>>>>> * An NVMeoF solution
>>>>> 
>>>>> 
>>>>> Cache tiers are “deprecated”, but then so are custom cluster names.
>>>>> Neither appears
>>>>> 
>>>>>> For EC pools there is an option "fast_read"
>>>> 
>> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
>>>> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3D
>>>> fas
>>>> t_read%23set-pool-
>>>> values&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9
>>>> e7f
>>>> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWF
>>>> pbG
>>>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
>>>> %3D
>>>> %7C1000&amp;sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&
>>>> amp ;reserved=0), which states that a read will return as soon as the
>>>> first k shards have arrived. The default is to wait for all k+m
>>>> shards (all replicas). This option is not available for replicated
>>>> pools.
>>>>>> Now, not sure if this option is not available for replicated pools
>>>> because the read will always be served by the acting primary, or if
>>>> it currently waits for all replicas. In the latter case, reads will
>>>> wait for the slowest device.
>>>>>> I'm not sure if I interpret this correctly. I think you should test
>>>> the setup with HDD only and SSD+HDD to see if read speed improves.
>>>> Note that write speed will always depend on the slowest device.
>>>>>> Best regards,
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>> ________________________________________
>>>>>> From: Frank Schilder <frans@xxxxxx>
>>>>>> Sent: 25 October 2020 15:03:16
>>>>>> To: 胡 玮文; Alexander E. Patrakov
>>>>>> Cc: ceph-users@xxxxxxx
>>>>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>>>>> replicated pool A cache pool might be an alternative, heavily
>>>> depending on how much data is hot. However, then you will have much
>>>> less SSD capacity available, because it also requires replication.
>>>>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
>>>> 120T HDD you will probably run short of SSD capacity. Or, looking at
>>>> it the other way around, with copies on 1 SSD+3HDD, you will only be
>>>> able to use about 30T out of 120T HDD capacity.
>>>>>> With this replication, the usable storage will be 10T and raw used
>>>> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD
>>>> space, you will need more SSDs. If your servers have more free disk
>>>> slots, you can add SSDs over time until you have at least 40T SSD
>>>> capacity to balance SSD and HDD capacity.
>>>>>> Personally, I think the 1SSD + 3HDD is a good option compared with
>>>>>> a
>>>> cache pool. You have the data security of 3-times replication and, if
>>>> everything is up, need only 1 copy in the SSD cache, which means that
>>>> you have 3 times the cache capacity.
>>>>>> Best regards,
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>> ________________________________________
>>>>>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
>>>>>> Sent: 25 October 2020 13:40:55
>>>>>> To: Alexander E. Patrakov
>>>>>> Cc: ceph-users@xxxxxxx
>>>>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>>>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in
>>>>>> my
>>>> mind. In order to guard against 2 host failures, I’m going to use 4
>>>> replications, 1 on SSD and 3 on HDD. This will work as intended,
>> right?
>>>> Because at least I can ensure 3 HDDs are from different hosts.
>>>>>>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patrakov@xxxxxxxxx>
>>>> 写道：
>>>>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx
>>>> <huww98@xxxxxxxxxxx> wrote:
>>>>>>>> Hi all,
>>>>>>>> We are planning for a new pool to store our dataset using CephFS.
>>>> These data are almost read-only (but not guaranteed) and consist of a
>>>> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 *
>>>> 6T HDD, and we will deploy about 10 such nodes. We aim at getting the
>>>> highest read throughput.
>>>>>>>> If we just use a replicated pool of size 3 on SSD, we should get
>>>> the best performance, however, that only leave us 1/3 of usable SSD
>>>> space. And EC pools are not friendly to such small object read
>>>> workload, I think.
>>>>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy.
>>>> Ideally, I want 3 data replications, each on a different host (fail
>>>> domain). 1 of them on SSD, the other 2 on HDD. And normally every
>>>> read request is directed to SSD. So, if every SSD OSD is up, I’d
>>>> expect the same read throughout as the all SSD deployment.
>>>>>>>> I’ve read the documents and did some tests. Here is the crush
>>>>>>>> rule
>>>> I’m testing with:
>>>>>>>> rule mixed_replicated_rule {
>>>>>>>>   id 3
>>>>>>>>   type replicated
>>>>>>>>   min_size 1
>>>>>>>>   max_size 10
>>>>>>>>   step take default class ssd
>>>>>>>>   step chooseleaf firstn 1 type host
>>>>>>>>   step emit
>>>>>>>>   step take default class hdd
>>>>>>>>   step chooseleaf firstn -1 type host
>>>>>>>>   step emit
>>>>>>>> }
>>>>>>>> Now I have the following conclusions, but I’m not very sure:
>>>>>>>> * The first OSD produced by crush will be the primary OSD (at
>>>>>>>> least
>>>> if I don’t change the “primary affinity”). So, the above rule is
>>>> guaranteed to map SSD OSD as primary in pg. And every read request
>>>> will read from SSD if it is up.
>>>>>>>> * It is currently not possible to enforce SSD and HDD OSD to be
>>>> chosen from different hosts. So, if I want to ensure data
>>>> availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD
>>>> OSD. That means setting the replication size to 4, instead of the
>>>> ideal value 3, on the pool using the above crush rule.
>>>>>>>> Am I correct about the above statements? How would this work from
>>>> your experience? Thanks.
>>>>>>> This works (i.e. guards against host failures) only if you have
>>>>>>> strictly separate sets of hosts that have SSDs and that have HDDs.
>>>>>>> I.e., there should be no host that has both, otherwise there is a
>>>>>>> chance that one hdd and one ssd from that host will be picked.
>>>>>>> --
>>>>>>> Alexander E. Patrakov
>>>>>>> CV:
>>>>>>> 
>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.
>>>>>>> cd%2FPLz7&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7
>>>>>>> C8
>>>>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnk
>>>>>>> no
>>>>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
>>>>>>> Ww
>>>>>>> iLCJXVCI6Mn0%3D%7C1000&amp;sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfq
>>>>>>> FH
>>>>>>> NS8F6IIchsrk%3D&amp;reserved=0
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
>>>>>> an email to ceph-users-leave@xxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
>>>>>> an email to ceph-users-leave@xxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
>>>>>> an email to ceph-users-leave@xxxxxxx
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>>>> email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>>> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx