Re: The feasibility of mixed SSD and HDD replicated pool

胡玮文 <huww98@xxxxxxxxxxx> · Sun, 25 Oct 2020 12:40:55 +0000

Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts.

> 在 2020年10月25日，20:04，Alexander E. Patrakov <patrakov@xxxxxxxxx> 写道：
> 
> On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx <huww98@xxxxxxxxxxx> wrote:
>> 
>> Hi all,
>> 
>> We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput.
>> 
>> If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think.
>> 
>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment.
>> 
>> I’ve read the documents and did some tests. Here is the crush rule I’m testing with:
>> 
>> rule mixed_replicated_rule {
>>        id 3
>>        type replicated
>>        min_size 1
>>        max_size 10
>>        step take default class ssd
>>        step chooseleaf firstn 1 type host
>>        step emit
>>        step take default class hdd
>>        step chooseleaf firstn -1 type host
>>        step emit
>> }
>> 
>> Now I have the following conclusions, but I’m not very sure:
>> * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up.
>> * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule.
>> 
>> Am I correct about the above statements? How would this work from your experience? Thanks.
> 
> This works (i.e. guards against host failures) only if you have
> strictly separate sets of hosts that have SSDs and that have HDDs.
> I.e., there should be no host that has both, otherwise there is a
> chance that one hdd and one ssd from that host will be picked.
> 
> -- 
> Alexander E. Patrakov
> CV: https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&amp;data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3D&amp;reserved=0
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx