On Mon, Nov 9, 2015 at 9:42 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: > I don't have much experience with crush rules but wanted one that does the following: > > On a 3-node cluster, I wanted a rule where I could have an erasure-coded pool of k=3,m=2 > and where the first 3 chunks (the read chunks) are all on different hosts but the last 2 chunks > step to different osds but can reuse the hosts (since we don't have enough hosts in this cluster > to have the 5 chunks all on different hosts). > > Here was my attempt at a rule, > > rule combo-rule-ecrule-3-2 { > ruleset 9 > type erasure > min_size 5 > max_size 5 > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default > step chooseleaf indep 3 type host > step emit > step take default > step chooseleaf indep -3 type osd > step emit > } > > which was fine for the first 3 osds, but had a problem in that the last 2 osds > often were chosen to be the same as the first 2 osds for example > (hosts have 5 osds each so 0-4, 5-9, 10-14 are the osd numbers per host). > > 18.7c 0 0 0 0 0 0 0 0 active+clean 2015-11-09 09:28:40.744509 0'0 227:9 [11,1,6,11,12] 11 > 18.7d 0 0 0 0 0 0 0 0 active+clean 2015-11-09 09:28:42.734292 0'0 227:9 [4,11,5,4,0] 4 > 18.7e 0 0 0 0 0 0 0 0 active+clean 2015-11-09 09:28:42.569645 0'0 227:9 [5,0,12,5,0] 5 > 18.7f 0 0 0 0 0 0 0 0 active+clean 2015-11-09 09:28:41.897589 0'0 227:9 [2,12,6,2,12] 2 > > How should such a rule be written? In *general* there's not a good way to specify what you're after. In specific cases you can often do something like: rule combo-rule-ecrule-3-2 { ruleset 9 type erasure min_size 5 max_size 5 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step chooseleaf indep 3 type host step chooseleaf indep 2 type osd step emit } That will generate 6 OSD IDs across your three hosts, but the last one will get cut off of the list (You need sufficiently new clients or they won't like this, but it is supported now.) and you won't have any duplicates. It will not spread the full read set for each PG across hosts, but since it will be choosing them randomly anyway it should balance out in the end. I guess I should note that people have done this with replicated pools but I'm not sure about EC ones so there might be some weird side effects. In particular, if you lose an entire node, CRUSH will fail to map fully and things won't be able to repair. (That will be the case in general though, if you require copies across 3 hosts and only have 3.) -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com