Weird issues related to (large/small) weights in mixed nvme/hdd pool

peter.linder@xxxxxxxxxxxxxx · Sat, 20 Jan 2018 21:30:23 +0100

Hi all,

I'm getting such weird problems when we for instance re-add a server, 
add disks etc! Most of the time some PGs end up in 
"active+clean+remapped" mode, but today some of them got stuck 
"activating" which meant that some PGs were offline for a while. I'm 
able to fix things, but the fix is so weird that I'm wondering whats 
going on...

Background is we have a pool (rep=3,min=2) where for each pg we select 1 
osd from a server with only nvme-osds, and 2 osds from servers with only 
hdd's. There are a total of 9 servers, with 3 (1 nvme + 2 hdd) in 3 
separate data centers. We always select servers from different data 
centers (latency is not an issue), so we would select for instance 
dc2:nvme, dc1.hdd, dc3:hdd, in 3 separate permutations.

Here is the relevant part of our crushmap. I will explain layout and my 
fix (that I have no idea why I'm doing) below it:

hostgroup hg1-1 {
        id -30          # do not change unnecessarily
        id -28 class nvme               # do not change unnecessarily
        id -54 class hdd                # do not change unnecessarily
        id -71 class ssd                # do not change unnecessarily
        # weight 2.911
        alg straw2
        hash 0  # rjenkins1
        item storage11 weight 2.911
}
hostgroup hg1-2 {
        id -31          # do not change unnecessarily
        id -29 class nvme               # do not change unnecessarily
        id -55 class hdd                # do not change unnecessarily
        id -73 class ssd                # do not change unnecessarily
        # weight 65.789
        alg straw2
        hash 0  # rjenkins1
        item storage22 weight 65.789
}
hostgroup hg1-3 {
        id -32          # do not change unnecessarily
        id -43 class nvme               # do not change unnecessarily
        id -56 class hdd                # do not change unnecessarily
        id -75 class ssd                # do not change unnecessarily
        # weight 65.789
        alg straw2
        hash 0  # rjenkins1
        item storage23 weight 65.789
}
hostgroup hg2-1 {
        id -33          # do not change unnecessarily
        id -45 class nvme               # do not change unnecessarily
        id -58 class hdd                # do not change unnecessarily
        id -78 class ssd                # do not change unnecessarily
        # weight 2.911
        alg straw2
        hash 0  # rjenkins1
        item storage12 weight 2.911
}
hostgroup hg2-2 {
        id -34          # do not change unnecessarily
        id -46 class nvme               # do not change unnecessarily
        id -59 class hdd                # do not change unnecessarily
        id -80 class ssd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item storage21 weight 65.496
}
hostgroup hg2-3 {
        id -35          # do not change unnecessarily
        id -47 class nvme               # do not change unnecessarily
        id -60 class hdd                # do not change unnecessarily
        id -81 class ssd                # do not change unnecessarily
        # weight 65.789
        alg straw2
        hash 0  # rjenkins1
        item storage23 weight 65.789
}
hostgroup hg3-1 {
        id -36          # do not change unnecessarily
        id -49 class nvme               # do not change unnecessarily
        id -62 class hdd                # do not change unnecessarily
        id -84 class ssd                # do not change unnecessarily
        # weight 2.911
        alg straw2
        hash 0  # rjenkins1
        item storage13 weight 2.911
}
hostgroup hg3-2 {
        id -37          # do not change unnecessarily
        id -50 class nvme               # do not change unnecessarily
        id -63 class hdd                # do not change unnecessarily
        id -85 class ssd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item storage21 weight 65.496
}
hostgroup hg3-3 {
        id -38          # do not change unnecessarily
        id -51 class nvme               # do not change unnecessarily
        id -64 class hdd                # do not change unnecessarily
        id -86 class ssd                # do not change unnecessarily
        # weight 65.789
        alg straw2
        hash 0  # rjenkins1
        item storage22 weight 65.789
}
datacenter ldc1 {
        id -39          # do not change unnecessarily
        id -44 class nvme               # do not change unnecessarily
        id -57 class hdd                # do not change unnecessarily
        id -76 class ssd                # do not change unnecessarily
        # weight 134.489
        alg straw2
        hash 0  # rjenkins1
        item hg1-1 weight 65.496
        item hg1-2 weight 65.789
        item hg1-3 weight 65.789
}
datacenter ldc2 {
        id -40          # do not change unnecessarily
        id -48 class nvme               # do not change unnecessarily
        id -61 class hdd                # do not change unnecessarily
        id -82 class ssd                # do not change unnecessarily
        # weight 196.781
        alg straw2
        hash 0  # rjenkins1
        item hg2-1 weight 65.496
        item hg2-2 weight 65.496
        item hg2-3 weight 65.789
}
datacenter ldc3 {
        id -41          # do not change unnecessarily
        id -52 class nvme               # do not change unnecessarily
        id -65 class hdd                # do not change unnecessarily
        id -87 class ssd                # do not change unnecessarily
        # weight 197.197
        alg straw2
        hash 0  # rjenkins1
        item hg3-1 weight 65.912
        item hg3-2 weight 65.496
        item hg3-3 weight 65.789
}
root ldc {
        id -42          # do not change unnecessarily
        id -53 class nvme               # do not change unnecessarily
        id -66 class hdd                # do not change unnecessarily
        id -88 class ssd                # do not change unnecessarily

        # weight 528.881
        alg straw2
        hash 0  # rjenkins1
        item ldc1 weight 97.489
        item ldc2 weight 97.196
        item ldc3 weight 97.196
}

# rules
rule hybrid {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take ldc
        step choose firstn 1 type datacenter
        step chooseleaf firstn 0 type hostgroup
        step emit
}

Ok, so there are 9 hostgroups (i changed "type 2"). Each hostgroup 
currently holds 1 server, but may in the future hold more. These are 
grouped in 3, and called a "datacenter" even though the set is spread 
out onto 3 physical data centers. These are then put in a separate root 
called "ldc".

The "hybrid" rule then proceeds to select 1 datacenter, and then 3 osds 
from that datacenter. The end result is that 3 OSDs from different 
physical datacenters are selected, with 1 nvme and 2 hdd (hdds have 
reduced primary affinity to 0.00099, and yes this might be a problem?). 
If one datacenter is lost, only 1/3'rd of the nvmes are in fact offline 
so capacity loss is manageable compared to having all nvme's in one 
datacenter.

Because nvmes are much smaller, after adding one the "datacenter" looks 
like this:

        item hg1-1 weight 2.911
        item hg1-2 weight 65.789
        item hg1-3 weight 65.789

This causes PGs to go into "active+clean+remapped" state forever. If I 
manually change the weights so that they are all almost the same, the 
problem goes away! I would have though that the weights does not matter, 
since we have to choose 3 of these anyways. So I'm really confused over 
this.

Today I also had to change

        item ldc1 weight 197.489
        item ldc2 weight 197.196
        item ldc3 weight 197.196
to
        item ldc1 weight 97.489
        item ldc2 weight 97.196
        item ldc3 weight 97.196

or some PGs wouldn't activate at all! I'm really not aware how the 
hashing/selection process works though, it does somehow seem that if the 
values are too far apart, things seem to break. crushtool --test seems 
to correctly calculate my PGs.

Basically when this happens I just randomly change some weights and most 
of the time it starts working. Why?

Regards,
Peter

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com