Unexpected pg placement in degraded mode with custom crush rule

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a 4 osd system (4 hosts, 1 osd per host), in two (imagined) racks (osd 0 and 1 in rack 0, osd 2 and 3 in rack1). All pools have number of replicas = 2. I have a crush rule that puts one pg copy on each rack (see notes) - but is essentially:

        step take root
        step chooseleaf firstn 0 type rack
        step emit

I created a pool (called obj) with 200 pgs, and created 100 objects each of size 20MB.

I simulate a rack failure by stopping ceph on the hosts in one rack. I *expected* that the system would continue to run in 50% degraded mode, as we would not place replicas on/in the same rack. Indeed initially I see:

2013-07-04 16:38:36.403975 mon.0 [INF] pgmap v760: 1160 pgs: 3 peering, 1157 active+degraded; 2000 MB data, 6488 MB used, 3075 MB / 10075 MB avail; 100/200 degraded (50.000%)

However what I see is (after a while) is:

2013-07-04 16:39:07.071987 mon.0 [INF] pgmap v775: 1160 pgs: 308 active+remapped, 852 active+degraded; 2000 MB data, 6934 MB used, 2629 MB / 10075 MB avail; 78/200 degraded (39.000%)

Hmm - sure enough if I dump the pg map for each object in the pool, most look like:

osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) -> up [1] acting [1]

but some are like:

osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) -> up [1] acting [1,0]


Clearly I have misunderstood something here! How am I getting replicas on osd.0 and osd.1, when they are in the same rack?


Notes:

The version

ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
on Ubuntu 12.04 (KVM guest).

The osd tree

# id    weight  type name       up/down reweight
-7      4       root root
-5      2               rack rack0
-1      1                       host ceph1
0       1                               osd.0   up      1
-2      1                       host ceph2
1       1                               osd.1   up      1
-6      2               rack rack1
-3      1                       host ceph3
2       1                               osd.2   down    0
-4      1                       host ceph4
3       1                               osd.3   down    0

The crushmap

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 device
type 1 host
type 2 rack
type 3 root

# buckets
host ceph1 {
        id -1           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.000
}
host ceph2 {
        id -2           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 1.000
}
host ceph3 {
        id -3           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 1.000
}
host ceph4 {
        id -4           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.3 weight 1.000
}
rack rack0 {
        id -5           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item ceph1 weight 1.000
        item ceph2 weight 1.000
}
rack rack1 {
        id -6           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item ceph3 weight 1.000
        item ceph4 weight 1.000
}
root root {
        id -7           # do not change unnecessarily
        # weight 4.000
        alg straw
        hash 0  # rjenkins1
        item rack0 weight 2.000
        item rack1 weight 2.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take root
        step chooseleaf firstn 0 type rack
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take root
        step chooseleaf firstn 0 type rack
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take root
        step chooseleaf firstn 0 type rack
        step emit
}

# end crush map


The pg map with all 4 osd up:

osdmap e39 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) -> up [1,2] acting [1,2] osdmap e39 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd (3.7d) -> up [2,1] acting [2,1] osdmap e39 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d (3.9d) -> up [0,2] acting [0,2] osdmap e39 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8 (3.48) -> up [1,3] acting [1,3] osdmap e39 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42 (3.42) -> up [1,2] acting [1,2] osdmap e39 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) -> up [3,0] acting [3,0] osdmap e39 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0 (3.c0) -> up [0,3] acting [0,3] osdmap e39 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675 (3.75) -> up [0,2] acting [0,2] osdmap e39 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892 (3.92) -> up [3,0] acting [3,0]
...

The pg map with 2 osd (1 rack) up:

osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) -> up [1] acting [1] osdmap e55 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd (3.7d) -> up [1] acting [1] osdmap e55 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d (3.9d) -> up [0] acting [0] osdmap e55 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8 (3.48) -> up [1] acting [1] osdmap e55 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42 (3.42) -> up [1] acting [1] osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) -> up [1] acting [1,0] osdmap e55 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0 (3.c0) -> up [0] acting [0] osdmap e55 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675 (3.75) -> up [0] acting [0] osdmap e55 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892 (3.92) -> up [1] acting [1,0]
...

Cheers

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux