Re: uneven placement

Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> · Wed, 01 Aug 2012 09:55:42 +0200

Le 30/07/2012 19:53, Tommi Virtanen a écrit :

On Fri, Jul 27, 2012 at 6:07 AM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> wrote:
My ceph cluster is made of 8 OSD with quite big storage attached.
All OSD nodes are equal, except 4 OSD have 6,2 TB, 4 have 8 TB storage.
Sounds like you should just set the weights yourself, based on the
capacities you listed here.

Hi Tommi.

In my previous crush map, I was doing that more or less, I thought it 
was sufficient :

datacenter chantrerie {
 ...
  item carsebridge weight 1.330
    item cameronbridge weight 1.000
}
datacenter loire {
 ...
    item karuizawa weight 1.330
    item hazelburn weight 1.000
}

datacenter lombarderie {
...
    item chichibu weight 1.330
    item glenesk weight 1.000
    item braeval weight 1.330
    item hanyu weight 1.000
}

pool default {
   ...
    item chantrerie weight 2.000
    item loire weight 2.000
    item lombarderie weight 4.000
}

I've been able to grow a little more all my volumes, giving Now 8.6 TB 
for 4 nodes, and 6.8TB for the 4 others ;
Now I've tried to be more precise , here is the crushmap I'm now using :

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host chichibu {
    id -2        # do not change unnecessarily
    # weight 8.600
    alg straw
    hash 0    # rjenkins1
    item osd.0 weight 8.600
}
host glenesk {
    id -4        # do not change unnecessarily
    # weight 6.800
    alg straw
    hash 0    # rjenkins1
    item osd.1 weight 6.800
}
host braeval {
    id -9        # do not change unnecessarily
    # weight 8.600
    alg straw
    hash 0    # rjenkins1
    item osd.7 weight 8.600
}
host hanyu {
    id -10        # do not change unnecessarily
    # weight 6.800
    alg straw
    hash 0    # rjenkins1
    item osd.8 weight 6.800
}
datacenter lombarderie {
    id -13        # do not change unnecessarily
    # weight 30.800
    alg straw
    hash 0    # rjenkins1
    item chichibu weight 8.600
    item glenesk weight 6.800
    item braeval weight 8.600
    item hanyu weight 6.800
}
host carsebridge {
    id -7        # do not change unnecessarily
    # weight 8.600
    alg straw
    hash 0    # rjenkins1
    item osd.5 weight 8.600
}
host cameronbridge {
    id -8        # do not change unnecessarily
    # weight 6.800
    alg straw
    hash 0    # rjenkins1
    item osd.6 weight 6.800
}
datacenter chantrerie {
    id -12        # do not change unnecessarily
    # weight 15.400
    alg straw
    hash 0    # rjenkins1
    item carsebridge weight 8.600
    item cameronbridge weight 6.800
}
host karuizawa {
    id -5        # do not change unnecessarily
    # weight 8.600
    alg straw
    hash 0    # rjenkins1
    item osd.2 weight 8.600
}
host hazelburn {
    id -6        # do not change unnecessarily
    # weight 6.800
    alg straw
    hash 0    # rjenkins1
    item osd.3 weight 6.800
}
datacenter loire {
    id -11        # do not change unnecessarily
    # weight 15.400
    alg straw
    hash 0    # rjenkins1
    item karuizawa weight 8.600
    item hazelburn weight 6.800
}
pool default {
    id -1        # do not change unnecessarily
    # weight 61.600
    alg straw
    hash 0    # rjenkins1
    item lombarderie weight 30.800
    item chantrerie weight 15.400
    item loire weight 15.400
}
rack unknownrack {
    id -3        # do not change unnecessarily
    # weight 8.000
    alg straw
    hash 0    # rjenkins1
    item chichibu weight 1.000
    item glenesk weight 1.000
    item karuizawa weight 1.000
    item hazelburn weight 1.000
    item carsebridge weight 1.000
    item cameronbridge weight 1.000
    item braeval weight 1.000
    item hanyu weight 1.000
}

# rules
rule data {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type datacenter
    step emit
}
rule metadata {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type datacenter
    step emit
}
rule rbd {
    ruleset 2
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type datacenter
    step emit
}

# end crush map

- I suppose the individual osd weight is probably unused, as I only have 
1 osd/host ?

It took several hours to rebalance the data, the result is , no 
surprise, more or less the same :

/dev/mapper/xceph--chichibu-data
                      8,6T  5,3T  3,4T  61% /XCEPH-PROD/data
/dev/mapper/xceph--glenesk-data
                      6,8T  3,3T  3,6T  48% /XCEPH-PROD/data
/dev/mapper/xceph--braeval-data
                      8,6T  4,4T  4,3T  51% /XCEPH-PROD/data
/dev/mapper/xceph--hanyu-data
                      6,8T  4,3T  2,6T  63% /XCEPH-PROD/data
/dev/mapper/xceph--karuizawa-data
                      8,6T  6,7T  2,0T  78% /XCEPH-PROD/data
/dev/mapper/xceph--hazelburn-data
                      6,8T  6,0T  864G  88% /XCEPH-PROD/data
/dev/mapper/xceph--carsebridge-data
                      8,6T  6,9T  1,8T  81% /XCEPH-PROD/data
/dev/mapper/xceph--cameronbridge-data
                      6,8T  5,2T  1,6T  77% /XCEPH-PROD/data

In your precedent message, did you mean I should tweak manually the 
weight based on the observation of those results ?
stochastic, you may not get perfect balance with a small cluster.

Ok, I understand, I suppose my situation is even worse because I use 
datacenter, so placement by "firstn" is only on the 3 datacenters, wich 
gives :

17.3 out of 30.8 (56%) for datacenter lombarderie ;
12.7 out of 15.4 (82%) for datacenter loire ;
12.1 out of 15.4 (78%) for datacenter chantrerie ;

Which is not so bad.

CRUSH evens out on larger clusters quite nicely, but there's still a
lot of statistical variation in the picture.
I need to keep the notion of 3 datacenters ; All my data must be 
replicated on 2 distincts (read, some kilometers away) places.
So, even if I artificially multiplicate osd (by using lots of little LVM 
volumes on my arrays, I could reach 32 osd, for exemple) , I'll probably 
have a better placement inside that datacenter, BUT I'd still only have 
3 datacenters. As the firstn choice will only work on thoses 3 items, It 
will lead to similar problem . Am I wrong ?

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html