Re: The feasibility of mixed SSD and HDD replicated pool

胡玮文 <huww98@xxxxxxxxxxx> · Sun, 8 Nov 2020 13:47:07 +0000

> 在 2020年11月8日，11:30，Tony Liu <tonyliu0592@xxxxxxxxxxx> 写道：
> 
> Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
> or WAL/DB on SSD or HDD? My understanding is that, there is no
> benefit to put journal or WAL/DB on SSD with such solution. It will
> also eliminate the single point of failure when having all WAL/DB
> on one SSD. Just want to confirm.

We are building a new cluster, so BlueStore. I think put WAL/DB on SSD is more about performance. How this is related to eliminating single point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And of course, just use single device for SSD OSDs

> Another thought is to have separate pools, like all-SSD pool and
> all-HDD pool. Each pool will be used for different purpose. For example,
> image, backup, object can be in all-HDD pool and VM volume can be in
> all-SSD pool.

Yes, I think the same.

> Thanks!
> Tony
>> -----Original Message-----
>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
>> Sent: Monday, October 26, 2020 9:20 AM
>> To: Frank Schilder <frans@xxxxxx>
>> Cc: Anthony D'Atri <anthony.datri@xxxxxxxxx>; ceph-users@xxxxxxx
>> Subject:  Re: The feasibility of mixed SSD and HDD
>> replicated pool
>> 
>> 
>>>> 在 2020年10月26日，15:43，Frank Schilder <frans@xxxxxx> 写道：
>>> 
>>> 
>>>> I’ve never seen anything that implies that lead OSDs within an acting
>> set are a function of CRUSH rule ordering.
>>> 
>>> This is actually a good question. I believed that I had seen/heard
>> that somewhere, but I might be wrong.
>>> 
>>> Looking at the definition of a PG, is states that a PG is an ordered
>> set of OSD (IDs) and the first up OSD will be the primary. In other
>> words, it seems that the lowest OSD ID is decisive. If the SSDs were
>> deployed before the HDDs, they have the smallest IDs and, hence, will be
>> preferred as primary OSDs.
>> 
>> I don’t think this is correct. From my experiments, using previously
>> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
>> primary OSDs are always SSD.
>> 
>> I also have a look at the code, if I understand it correctly:
>> 
>> * If the default primary affinity is not changed, then the logic about
>> primary affinity is skipped, and the primary would be the first one
>> returned by CRUSH algorithm [1].
>> 
>> * The order of OSDs returned by CRUSH still matters if you changed the
>> primary affinity. The affinity represents the probability of a test to
>> be success. The first OSD will be tested first, and will have higher
>> probability to become primary. [2]
>>  * If any OSD has primary affinity = 1.0, the test will always success,
>> and any OSD after it will never be primary.
>>  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to
>> 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has
>> probability of 0.125. Otherwise, 1st will be primary.
>>  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
>> will be primary as fallback.
>> 
>> [1]:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012&amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D&amp;reserved=0
>> 53/src/osd/OSDMap.cc#L2456
>> [2]:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012&amp;data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D&amp;reserved=0
>> 53/src/osd/OSDMap.cc#L2561
>> 
>> So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient
>> for it to be the primary in my case.
>> 
>> Do you think I should contribute these to documentation?
>> 
>>> This, however, is not a sustainable situation. Any addition of OSDs
>> will mess this up and the distribution scheme will fail in the future. A
>> way out seem to be:
>>> 
>>> - subdivide your HDD storage using device classes:
>>> * define a device class for HDDs with primary affinity=0, for example,
>>> pick 5 HDDs and change their device class to hdd_np (for no primary)
>>> * set the primary affinity of these HDD OSDs to 0
>>> * modify your crush rule to use "step take default class hdd_np"
>>> * this will create a pool with primaries on SSD and balanced storage
>>> distribution between SSD and HDD
>>> * all-HDD pools deployed as usual on class hdd
>>> * when increasing capacity, one needs to take care of adding disks to
>>> hdd_np class and set their primary affinity to 0
>>> * somewhat increased admin effort, but fully working solution
>>> 
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> 
>>> ________________________________________
>>> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
>>> Sent: 25 October 2020 17:07:15
>>> To: ceph-users@xxxxxxx
>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>> replicated pool
>>> 
>>>> I'm not entirely sure if primary on SSD will actually make the read
>> happen on SSD.
>>> 
>>> My understanding is that by default reads always happen from the lead
>> OSD in the acting set.  Octopus seems to (finally) have an option to
>> spread the reads around, which IIRC defaults to false.
>>> 
>>> I’ve never seen anything that implies that lead OSDs within an acting
>> set are a function of CRUSH rule ordering. I’m not asserting that they
>> aren’t though, but I’m … skeptical.
>>> 
>>> Setting primary affinity would do the job, and you’d want to have cron
>> continually update it across the cluster to react to topology changes.
>> I was told of this strategy back in 2014, but haven’t personally seen it
>> implemented.
>>> 
>>> That said, HDDs are more of a bottleneck for writes than reads and
>> just might be fine for your application.  Tiny reads are going to limit
>> you to some degree regardless of drive type, and you do mention
>> throughput, not IOPS.
>>> 
>>> I must echo Frank’s notes about capacity too.  Ceph can do a lot of
>> things, but that doesn’t mean something exotic is necessarily the best
>> choice.  You’re concerned about 3R only yielding 1/3 of raw capacity if
>> using an all-SSD cluster, but the architecture you propose limits you
>> anyway because drive size. Consider also chassis, CPU, RAM, RU, switch
>> port costs as well, and the cost of you fussing over an exotic solution
>> instead of the hundreds of other things in your backlog.
>>> 
>>> And your cluster as described is *tiny*.  Honestly I’d suggest
>> considering one of these alternatives:
>>> 
>>> * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are really
>> promising for replacing HDDs for density in this kind of application.
>> You might even consider ARM if IOPs aren’t a concern.
>>> * An NVMeoF solution
>>> 
>>> 
>>> Cache tiers are “deprecated”, but then so are custom cluster names.
>>> Neither appears
>>> 
>>>> For EC pools there is an option "fast_read"
>> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
>> ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfas
>> t_read%23set-pool-
>> values&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7f
>> e9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbG
>> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D
>> %7C1000&amp;sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&amp
>> ;reserved=0), which states that a read will return as soon as the first
>> k shards have arrived. The default is to wait for all k+m shards (all
>> replicas). This option is not available for replicated pools.
>>>> Now, not sure if this option is not available for replicated pools
>> because the read will always be served by the acting primary, or if it
>> currently waits for all replicas. In the latter case, reads will wait
>> for the slowest device.
>>>> I'm not sure if I interpret this correctly. I think you should test
>> the setup with HDD only and SSD+HDD to see if read speed improves. Note
>> that write speed will always depend on the slowest device.
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>> ________________________________________
>>>> From: Frank Schilder <frans@xxxxxx>
>>>> Sent: 25 October 2020 15:03:16
>>>> To: 胡 玮文; Alexander E. Patrakov
>>>> Cc: ceph-users@xxxxxxx
>>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>>> replicated pool A cache pool might be an alternative, heavily
>> depending on how much data is hot. However, then you will have much less
>> SSD capacity available, because it also requires replication.
>>>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
>> 120T HDD you will probably run short of SSD capacity. Or, looking at it
>> the other way around, with copies on 1 SSD+3HDD, you will only be able
>> to use about 30T out of 120T HDD capacity.
>>>> With this replication, the usable storage will be 10T and raw used
>> will be 10T SSD and 30T HDD. If you can't do anything else on the HDD
>> space, you will need more SSDs. If your servers have more free disk
>> slots, you can add SSDs over time until you have at least 40T SSD
>> capacity to balance SSD and HDD capacity.
>>>> Personally, I think the 1SSD + 3HDD is a good option compared with a
>> cache pool. You have the data security of 3-times replication and, if
>> everything is up, need only 1 copy in the SSD cache, which means that
>> you have 3 times the cache capacity.
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>> ________________________________________
>>>> From: 胡 玮文 <huww98@xxxxxxxxxxx>
>>>> Sent: 25 October 2020 13:40:55
>>>> To: Alexander E. Patrakov
>>>> Cc: ceph-users@xxxxxxx
>>>> Subject:  Re: The feasibility of mixed SSD and HDD
>>>> replicated pool Yes. This is the limitation of CRUSH algorithm, in my
>> mind. In order to guard against 2 host failures, I’m going to use 4
>> replications, 1 on SSD and 3 on HDD. This will work as intended, right?
>> Because at least I can ensure 3 HDDs are from different hosts.
>>>>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patrakov@xxxxxxxxx>
>> 写道：
>>>>> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx
>> <huww98@xxxxxxxxxxx> wrote:
>>>>>> Hi all,
>>>>>> We are planning for a new pool to store our dataset using CephFS.
>> These data are almost read-only (but not guaranteed) and consist of a
>> lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T
>> HDD, and we will deploy about 10 such nodes. We aim at getting the
>> highest read throughput.
>>>>>> If we just use a replicated pool of size 3 on SSD, we should get
>> the best performance, however, that only leave us 1/3 of usable SSD
>> space. And EC pools are not friendly to such small object read workload,
>> I think.
>>>>>> Now I’m evaluating a mixed SSD and HDD replication strategy.
>> Ideally, I want 3 data replications, each on a different host (fail
>> domain). 1 of them on SSD, the other 2 on HDD. And normally every read
>> request is directed to SSD. So, if every SSD OSD is up, I’d expect the
>> same read throughout as the all SSD deployment.
>>>>>> I’ve read the documents and did some tests. Here is the crush rule
>> I’m testing with:
>>>>>> rule mixed_replicated_rule {
>>>>>>    id 3
>>>>>>    type replicated
>>>>>>    min_size 1
>>>>>>    max_size 10
>>>>>>    step take default class ssd
>>>>>>    step chooseleaf firstn 1 type host
>>>>>>    step emit
>>>>>>    step take default class hdd
>>>>>>    step chooseleaf firstn -1 type host
>>>>>>    step emit
>>>>>> }
>>>>>> Now I have the following conclusions, but I’m not very sure:
>>>>>> * The first OSD produced by crush will be the primary OSD (at least
>> if I don’t change the “primary affinity”). So, the above rule is
>> guaranteed to map SSD OSD as primary in pg. And every read request will
>> read from SSD if it is up.
>>>>>> * It is currently not possible to enforce SSD and HDD OSD to be
>> chosen from different hosts. So, if I want to ensure data availability
>> even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means
>> setting the replication size to 4, instead of the ideal value 3, on the
>> pool using the above crush rule.
>>>>>> Am I correct about the above statements? How would this work from
>> your experience? Thanks.
>>>>> This works (i.e. guards against host failures) only if you have
>>>>> strictly separate sets of hosts that have SSDs and that have HDDs.
>>>>> I.e., there should be no host that has both, otherwise there is a
>>>>> chance that one hdd and one ssd from that host will be picked.
>>>>> --
>>>>> Alexander E. Patrakov
>>>>> CV:
>>>>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.
>>>>> cd%2FPLz7&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C8
>>>>> 4df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnkno
>>>>> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw
>>>>> iLCJXVCI6Mn0%3D%7C1000&amp;sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFH
>>>>> NS8F6IIchsrk%3D&amp;reserved=0
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>>> email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>>> email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>>> email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>>> email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx