Re: Unexpected pg placement in degraded mode with custom crush rule

Sage Weil <sage@xxxxxxxxxxx> · Thu, 4 Jul 2013 21:01:31 -0700 (PDT)

Hi Mark,

If you're not using a kernel cephfs or rbd client older than ~3.9, or 
ceph-fuse/librbd/librados older than bobtail, then you should

 ceph osd crush tunables optimal

and I suspect that this will suddenly work perfectly.  The defaults are 
still using semi-broken legacy values because client support is pretty 
new.  Trees like yours, with sparsely populated leaves, tend to be most 
affected.

(I bet you're seeing the rack separation rule violated because the 
previous copy of the PG was already there and ceph won't throw out old 
copies before creating new ones.)

sage

On Fri, 5 Jul 2013, Mark Kirkwood wrote:

> I have a 4 osd system (4 hosts, 1 osd per host), in two (imagined) racks (osd
> 0 and 1 in rack 0, osd 2 and 3 in rack1). All pools have number of replicas =
> 2. I have a crush rule that puts one pg copy on each rack (see notes) - but is
> essentially:
> 
>         step take root
>         step chooseleaf firstn 0 type rack
>         step emit
> 
> I created a pool (called obj) with 200 pgs, and created 100 objects each of
> size 20MB.
> 
> I simulate a rack failure by stopping ceph on the hosts in one rack. I
> *expected* that the system would continue to run in 50% degraded mode, as we
> would not place replicas on/in the same rack. Indeed initially I see:
> 
> 2013-07-04 16:38:36.403975 mon.0 [INF] pgmap v760: 1160 pgs: 3 peering, 1157
> active+degraded; 2000 MB data, 6488 MB used, 3075 MB / 10075 MB avail; 100/200
> degraded (50.000%)
> 
> However what I see is (after a while) is:
> 
> 2013-07-04 16:39:07.071987 mon.0 [INF] pgmap v775: 1160 pgs: 308
> active+remapped, 852 active+degraded; 2000 MB data, 6934 MB used, 2629 MB /
> 10075 MB avail; 78/200 degraded (39.000%)
> 
> Hmm - sure enough if I dump the pg map for each object in the pool, most look
> like:
> 
> osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) ->
> up [1] acting [1]
> 
> but some are like:
> 
> osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) ->
> up [1] acting [1,0]
> 
> 
> Clearly I have misunderstood something here! How am I getting replicas on
> osd.0 and osd.1, when they are in the same rack?
> 
> 
> Notes:
> 
> The version
> 
> ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
> on Ubuntu 12.04 (KVM guest).
> 
> The osd tree
> 
> # id    weight  type name       up/down reweight
> -7      4       root root
> -5      2               rack rack0
> -1      1                       host ceph1
> 0       1                               osd.0   up      1
> -2      1                       host ceph2
> 1       1                               osd.1   up      1
> -6      2               rack rack1
> -3      1                       host ceph3
> 2       1                               osd.2   down    0
> -4      1                       host ceph4
> 3       1                               osd.3   down    0
> 
> The crushmap
> 
> # begin crush map
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> 
> # types
> type 0 device
> type 1 host
> type 2 rack
> type 3 root
> 
> # buckets
> host ceph1 {
>         id -1           # do not change unnecessarily
>         # weight 1.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.0 weight 1.000
> }
> host ceph2 {
>         id -2           # do not change unnecessarily
>         # weight 1.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.1 weight 1.000
> }
> host ceph3 {
>         id -3           # do not change unnecessarily
>         # weight 1.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.2 weight 1.000
> }
> host ceph4 {
>         id -4           # do not change unnecessarily
>         # weight 1.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.3 weight 1.000
> }
> rack rack0 {
>         id -5           # do not change unnecessarily
>         # weight 2.000
>         alg straw
>         hash 0  # rjenkins1
>         item ceph1 weight 1.000
>         item ceph2 weight 1.000
> }
> rack rack1 {
>         id -6           # do not change unnecessarily
>         # weight 2.000
>         alg straw
>         hash 0  # rjenkins1
>         item ceph3 weight 1.000
>         item ceph4 weight 1.000
> }
> root root {
>         id -7           # do not change unnecessarily
>         # weight 4.000
>         alg straw
>         hash 0  # rjenkins1
>         item rack0 weight 2.000
>         item rack1 weight 2.000
> }
> 
> # rules
> rule data {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take root
>         step chooseleaf firstn 0 type rack
>         step emit
> }
> rule metadata {
>         ruleset 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take root
>         step chooseleaf firstn 0 type rack
>         step emit
> }
> rule rbd {
>         ruleset 2
>         type replicated
>         min_size 1
>         max_size 10
>         step take root
>         step chooseleaf firstn 0 type rack
>         step emit
> }
> 
> # end crush map
> 
> 
> The pg map with all 4 osd up:
> 
> osdmap e39 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) ->
> up [1,2] acting [1,2]
> osdmap e39 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd (3.7d) ->
> up [2,1] acting [2,1]
> osdmap e39 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d (3.9d) ->
> up [0,2] acting [0,2]
> osdmap e39 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8 (3.48) ->
> up [1,3] acting [1,3]
> osdmap e39 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42 (3.42) ->
> up [1,2] acting [1,2]
> osdmap e39 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) ->
> up [3,0] acting [3,0]
> osdmap e39 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0 (3.c0) ->
> up [0,3] acting [0,3]
> osdmap e39 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675 (3.75) ->
> up [0,2] acting [0,2]
> osdmap e39 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892 (3.92) ->
> up [3,0] acting [3,0]
> ...
> 
> The pg map with 2 osd (1 rack) up:
> 
> osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd (3.7d) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d (3.9d) ->
> up [0] acting [0]
> osdmap e55 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8 (3.48) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42 (3.42) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) ->
> up [1] acting [1,0]
> osdmap e55 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0 (3.c0) ->
> up [0] acting [0]
> osdmap e55 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675 (3.75) ->
> up [0] acting [0]
> osdmap e55 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892 (3.92) ->
> up [1] acting [1,0]
> ...
> 
> Cheers
> 
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html