Hi, I have a small cluster of 3 nodes. Each node has 10 or 11 OSDs, mostly HDDs with a couple of SSDs for faster pools. I am trying to set up an erasure coded pool with m=6 k=6, with each node storing 4 chunks on seperate OSDs. Since this seems not possible with the CLI tooling I have written my own CRUSH rule to achieve this, which looks like this: ``` rule 3host4osd { id 3 type erasure min_size 12 max_size 12 step set_chooseleaf_tries 20 step set_choose_tries 100 step take default class hdd step choose indep 3 type host step choose indep 4 type osd step emit } ``` I've set up my erasure code profile and pool: ``` root@virt02:~# ceph osd pool get rbd_erasure crush_rule crush_rule: 3host4osd root@virt02:~# ceph osd pool get rbd_erasure size size: 12 root@virt02:~# ceph osd pool get rbd_erasure min_size min_size: 7 root@virt02:~# ceph osd pool get rbd_erasure erasure_code_profile erasure_code_profile: default root@virt02:~# ceph osd erasure-code-profile get default crush-device-class= crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=6 m=6 plugin=jerasure technique=reed_sol_van w=8 ``` Based on my understanding of ceph, this should pick 3 hosts, then pick 4 OSDs for each of those hosts. This is *almost* the case. However when testing taking out a host after putting a bunch of data on there, it seems 5 PGs (out of 512) seem to have more than 4 chunks placed on the same host. In all cases it's the same host that gets the extra pieces. When the host is out, I see errors: ``` [WRN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive, 5 pgs down pg 2.87 is down, acting [2147483647,2147483647,2147483647,2147483647,22,2147483647,2147483647,20,16,2147483647,17,18] pg 2.f3 is down, acting [2147483647,22,2147483647,2147483647,23,2147483647,18,17,2147483647,2147483647,2147483647,2147483647] pg 2.100 is down, acting [2147483647,18,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,9,20,22,4] pg 2.141 is down, acting [2147483647,18,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,20,7,4,22] pg 2.1bb is down, acting [20,2147483647,2147483647,2147483647,18,2147483647,23,17,2147483647,2147483647,2147483647,2147483647] ``` As an example, PG 2.87 (rbd_erasure has pool ID 2 according to `ceph osd lspools`): ``` root@virt02:~# ceph pg 2.87 query [...] "up": [ 2, 0, 6, 5, 22, 1, 8, 20, 16, 14, 17, 18 ], "acting": [ 2, 0, 6, 5, 22, 1, 8, 20, 16, 14, 17, 18 ], [...] ``` OSDs 0, 1, 2, 5, 6, 8 and 14 are all running on the same OSD host. All hosts are running ceph octopus 15.2.9. I've put the output of various diagnostic commands into files accessible over HTTPS here: https://dsg.is/ceph_placement_problem_data/ceph_osd_crush_rule_dump_3host4osd.txt https://dsg.is/ceph_placement_problem_data/ceph_osd_lspools.txt https://dsg.is/ceph_placement_problem_data/ceph_osd_pool_get_rbd_erasure_all.txt https://dsg.is/ceph_placement_problem_data/ceph_pg_2.87_query.txt https://dsg.is/ceph_placement_problem_data/ceph_pg_dump_all.txt https://dsg.is/ceph_placement_problem_data/ceph_pg_ls.txt Any thoughts or ideas what I'm doing wrong? Kind regards, Davíð
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx