CRUSH straw2 can not handle big weight differences

Niklas <niklas+ceph@xxxxxxxxxxxxxx> · Mon, 29 Jan 2018 13:14:57 +0100

Ceph luminous 12.2.2
$: ceph osd pool create hybrid 1024 1024 replicated hybrid
$: ceph -s
  cluster:
    id:     e07f568d-056c-4e01-9292-732c64ab4f8e
    health: HEALTH_WARN
            Degraded data redundancy: 431 pgs unclean, 431 pgs 
degraded, 431 pgs undersized

  services:
    mon: 3 daemons, quorum s11,s12,s13
    mgr: s11(active), standbys: s12, s13
    osd: 54 osds: 54 up, 54 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0 bytes
    usage:   61749 MB used, 707 GB / 767 GB avail
    pgs:     593 active+clean
             431 active+undersized+degraded

This can be solved if I set same weight 1.000 (or 50.000, or 100.000 or 
..) manually on all items (hosts) in datacenter vDC, vDC1, vDC2 and vDC3 
(see below). Problem is when adding new OSDs, the crush map is getting 
new weights in these datacenters resulting in a broken cluster until 
fixed manually.

*My question is, in a datacenter where it exist only 3 hosts, why is 
ceph not mapping pgs to these 3 hosts? Clearly it has something to do 
with the big weight difference between hosts but why?*

------------------------
Below is a simplified ceph setup of a Hybrid solution with NVMe and HDD 
drivers. 1 copy on NVMe and 2 copies on HDD. Advantage is great read 
performance and cost savings. Disadvantages is low write performance. 
Still the write performance is good thanks to rockdb on Intel Optane 
disks in HDD servers.

I have six servers in this virtualized lab ceph cluster.
Only NVMe drives on s11, s12 and s13.
Only HDD drives on s21, s22 and s23.

# buckets
host s11 {
    # weight 0.016
    alg straw2
    hash 0
    item osd.0 weight 0.002
    item osd.1 weight 0.002
    item osd.2 weight 0.006
    item osd.3 weight 0.002
    item osd.4 weight 0.002
    item osd.18 weight 0.002
}

host s12 {
    # weight 0.016
    alg straw2
    hash 0
    item osd.5 weight 0.006
    item osd.6 weight 0.002
    item osd.7 weight 0.002
    item osd.8 weight 0.002
    item osd.9 weight 0.002
    item osd.53 weight 0.002
}

host s13 {
    # weight 0.016
    alg straw2
    hash 0
    item osd.10 weight 0.006
    item osd.11 weight 0.002
    item osd.12 weight 0.002
    item osd.13 weight 0.002
    item osd.14 weight 0.002
    item osd.54 weight 0.002
}

host s21 {
    # weight 0.228
    alg straw2
    hash 0
    item osd.15 weight 0.019
    item osd.16 weight 0.019
    item osd.17 weight 0.019
    item osd.19 weight 0.019
    item osd.20 weight 0.019
    item osd.21 weight 0.019
    item osd.22 weight 0.019
    item osd.23 weight 0.019
    item osd.24 weight 0.019
    item osd.25 weight 0.019
    item osd.26 weight 0.019
    item osd.51 weight 0.019
}

host s22 {
    # weight 0.228
    alg straw2
    hash 0
    item osd.27 weight 0.019
    item osd.28 weight 0.019
    item osd.29 weight 0.019
    item osd.30 weight 0.019
    item osd.31 weight 0.019
    item osd.32 weight 0.019
    item osd.33 weight 0.019
    item osd.34 weight 0.019
    item osd.35 weight 0.019
    item osd.36 weight 0.019
    item osd.37 weight 0.019
    item osd.38 weight 0.019
}
host s23 {
    # weight 0.228
    alg straw2
    hash 0
    item osd.39 weight 0.019
    item osd.40 weight 0.019
    item osd.41 weight 0.019
    item osd.42 weight 0.019
    item osd.43 weight 0.019
    item osd.44 weight 0.019
    item osd.45 weight 0.019
    item osd.46 weight 0.019
    item osd.47 weight 0.019
    item osd.48 weight 0.019
    item osd.49 weight 0.019
    item osd.50 weight 0.019
}
datacenter vDC1 {
    # weight 0.472
    alg straw2
    hash 0
    item s11 weight 0.016
    item s22 weight 10.000
    item s23 weight 10.000
}

datacenter vDC2 {
    # weight 0.472
    alg straw2
    hash 0
    item s12 weight 0.016
    item s21 weight 0.228
    item s23 weight 0.228
}

datacenter vDC3 {
    # weight 0.472
    alg straw2
    hash 0
    item s13 weight 0.016
    item s21 weight 0.228
    item s22 weight 0.228
}
datacenter vDC {
    # weight 1,416
    alg straw2
    hash 0
    item vDC1 weight 0.472
    item vDC2 weight 0.472
    item vDC3 weight 0.472
}

# rules
rule hybrid {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take vDC
    step choose firstn 1 type datacenter
    step chooseleaf firstn 0 type host
    step emit
}

Again, setting same weight like 1.000 OR 100.000 on all items in 
datacenters vDC, vDC1, vDC2 and vDC3 makes the cluster work.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com