Late reply, but I have been using what I refer to as a "hybrid" crush topology for some data for a while now. Initially with just rados objects, and later with RBD. We found that we were able to accelerate reads to roughly all-ssd performance levels, while bringing up the tail end of the write performance a bit. Write performance wasn't orders of magnitude improvements, but the ssd write + replicate to hdd cycle seemed to be an improvement in reducing slow ops, etc. I will see if I can follow up with some rough benchmarks I can dig up. As for implementation, I have SSD-only hosts, and HDD-only hosts, bifurcated at the root level of crush. > { > "rule_id": 2, > "rule_name": "hybrid_ruleset", > "ruleset": 2, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -13, > "item_name": "ssd" > }, > { > "op": "chooseleaf_firstn", > "num": 1, > "type": "host" > }, > { > "op": "emit" > }, > { > "op": "take", > "item": -1, > "item_name": "default" > }, > { > "op": "chooseleaf_firstn", > "num": -1, > "type": "chassis" > }, > { > "op": "emit" > } > ] > }, I'm not remembering having to do any type of primary affinity stuff to make it work, it seemed to *just work* for the most part with making the SSD copy the primary. One thing to keep in mind is that I find balancer distribution to be a bit skewed due to the hybrid pools, though that could just be my perception. I've got 3x rep hdd, 3x rep hybrid, 3x rep ssd, and ec73 hdd pools, so I have a bit wonky pool topology, and that could lead to issues as well with distribution. Hope this is helpful. Reed > On Oct 25, 2020, at 2:10 AM, huww98@xxxxxxxxxxx wrote: > > Hi all, > > We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput. > > If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think. > > Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment. > > I’ve read the documents and did some tests. Here is the crush rule I’m testing with: > > rule mixed_replicated_rule { > id 3 > type replicated > min_size 1 > max_size 10 > step take default class ssd > step chooseleaf firstn 1 type host > step emit > step take default class hdd > step chooseleaf firstn -1 type host > step emit > } > > Now I have the following conclusions, but I’m not very sure: > * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up. > * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule. > > Am I correct about the above statements? How would this work from your experience? Thanks. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx