Hello everyone, I think I figured out the reason, why in my setup the three small hosts are almost nearly full while there is plenty of free space on the only big one. (ceph osd tree output below) It's simply hitting the limit. The algorithm selects 3 hosts. Even if one copy was always on the big one (le10970), the last 2 copies would reside on one oft he small ones. Now summing up the used disk sizes of the small ones devided by 2 (since 2/3 on all small ones, 1/3 on le10970) almost matches the sum of used disk sizes of the big one. Since I strongly believe that this is the reason, I am considering to change the ruleset to something like: step default step choose firstn 2 type room step choose firstn 1 type osd # one for each room? step emit set choose firstn 1 type osd step emit This should now enable a PG to be mapped to two OSDs located on the same host, right? Hopefully, the affected host will be le10790 in most oft he cases. Is there a way, to tell CRUSH to only allow the selection of two OSDs on host le10970? Tanks and all the best, Frank > In my rather heterogeneous setup ... > > -1 54.36 root default > -2 42.44 room 2.162 > -4 6.09 host le09091 > 3 2.03 osd.3 up 1 > 1 2.03 osd.1 up 1 > 9 2.03 osd.9 up 1 > -6 36.35 host le10970 > 4 7.27 osd.4 up 1 > 5 7.27 osd.5 up 1 > 6 7.27 osd.6 up 1 > 7 7.27 osd.7 up 1 > 8 7.27 osd.8 up 1 > -3 11.92 room 2.166 > -5 5.83 host [> ] le09086 > 2 2.03 osd.2 up 1 > 0 2.03 osd.0 up 1 > 10 1.77 osd.10 up 1 > -7 6.09 host le08544 > 11 2.03 osd.11 up 1 > 12 2.03 osd.12 up 1 > 13 2.03 osd.13 up 1 > > ... using size =3 for all pools, > the OSDs are not filled according to the weights > (which correspond to the disk sizes in TB). > > le09086 (osd) > /dev/sdb1 1.8T 1.5T 323G 83% /var/lib/ceph/osd/ceph-10 > /dev/sdc1 2.1T 1.7T 408G 81% /var/lib/ceph/osd/ceph-0 > /dev/sdd1 2.1T 1.7T 344G 84% /var/lib/ceph/osd/ceph-2 > le09091 (osd) > /dev/sda1 2.1T 1.6T 447G 79% /var/lib/ceph/osd/ceph-9 > /dev/sdc1 2.1T 1.8T 317G 85% /var/lib/ceph/osd/ceph-3 > /dev/sdb1 2.1T 1.7T 384G 82% /var/lib/ceph/osd/ceph-1 > le10970 (osd) > /dev/sdd1 7.3T 1.4T 5.9T 19% /var/lib/ceph/osd/ceph-6 > /dev/sdf1 7.3T 1.5T 5.9T 21% /var/lib/ceph/osd/ceph-8 > /dev/sde1 7.3T 1.6T 5.7T 22% /var/lib/ceph/osd/ceph-7 > /dev/sdc1 7.3T 1.4T 6.0T 19% /var/lib/ceph/osd/ceph-5 > /dev/sdb1 7.3T 1.5T 5.8T 21% /var/lib/ceph/osd/ceph-4 > le08544 (osd) > /dev/sdc1 2.1T 1.6T 443G 79% /var/lib/ceph/osd/ceph-13 > /dev/sdb1 2.1T 1.7T 339G 84% /var/lib/ceph/osd/ceph-12 > /dev/sda1 2.1T 1.7T 375G 82% /var/lib/ceph/osd/ceph-11 > > Clearly, I would like le10970 to be selected more often! > > Increasing pg(p)_num from 256 to 512 of all the pools > didn't help. > > Optimal tunables are used ... > > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > tunable chooseleaf_descend_once 1 > tunable chooseleaf_vary_r 1 > > ... as well as the default algo (straw) and hash (0) > for all buckets. The ruleset is pretty much standard, > too ... > > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > > ... so I would assume it should select 3 hosts > (originating from root) according to the weights. > > It is ceph version 0.87. > > Is it the room bucket, which circumvents > a distribution according tot he weights? > > Thanks and all the best, > > Frank > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com