Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts. > 在 2020年10月25日,20:04,Alexander E. Patrakov <patrakov@xxxxxxxxx> 写道: > > On Sun, Oct 25, 2020 at 12:11 PM huww98@xxxxxxxxxxx <huww98@xxxxxxxxxxx> wrote: >> >> Hi all, >> >> We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput. >> >> If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think. >> >> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment. >> >> I’ve read the documents and did some tests. Here is the crush rule I’m testing with: >> >> rule mixed_replicated_rule { >> id 3 >> type replicated >> min_size 1 >> max_size 10 >> step take default class ssd >> step chooseleaf firstn 1 type host >> step emit >> step take default class hdd >> step chooseleaf firstn -1 type host >> step emit >> } >> >> Now I have the following conclusions, but I’m not very sure: >> * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up. >> * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule. >> >> Am I correct about the above statements? How would this work from your experience? Thanks. > > This works (i.e. guards against host failures) only if you have > strictly separate sets of hosts that have SSDs and that have HDDs. > I.e., there should be no host that has both, otherwise there is a > chance that one hdd and one ssd from that host will be picked. > > -- > Alexander E. Patrakov > CV: https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3D&reserved=0 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx