ceph editable failure domains

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 02 May 2014 16:53:27 -0700

On 5/2/14 05:15 , Fabrizio G. Ventola wrote:
> Hello everybody,
> I'm making some tests with ceph and its editable cluster map and I'm
> trying to define a "rack" layer for its hierarchy in this way:
>
> ceph osd tree:
>
> # id weight type name up/down reweight
> -1 0.84 root default
> -7 0.28 rack rack1
> -2 0.14 host cephosd1-dev
> 0 0.14 osd.0 up 1
> -3 0.14 host cephosd2-dev
> 1 0.14 osd.1 up 1
> -8 0.28 rack rack2
> -4 0.14 host cephosd3-dev
> 2 0.14 osd.2 up 1
> -5 0.14 host cephosd4-dev
> 3 0.14 osd.3 up 1
> -9 0.28 rack rack3
> -6 0.28 host cephosd5-dev
> 4 0.28 osd.4 up 1
>
> Those are my pools:
> pool 0 'data' rep size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 333 pgp_num 333 last_change 2545 owner 0
> crash_replay_interval 45
> pool 1 'metadata' rep size 3 min_size 2 crush_ruleset 1 object_hash
> rjenkins pg_num 333 pgp_num 333 last_change 2548 owner 0
> pool 2 'rbd' rep size 3 min_size 2 crush_ruleset 2 object_hash
> rjenkins pg_num 333 pgp_num 333 last_change 2529 owner 0
> pool 4 'pool_01' rep size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 333 pgp_num 333 last_change 2542 owner 0
>
> I configured replica 3 for all pools and min_size 2, thus I'm
> expecting when I write new data on ceph-fs (through FUSE) or when I
> make a new RBD to see the same amount of data on every rack (3 racks,
> 3 replicas -> 1 replica per rack). But as you can see the third rack
> has just one OSD (the first two have two by the way) and should have
> the rack1+rack2 amount of data. Instead it has less data than the
> other racks (but more than one single OSD of the first two racks).
> Where am I wrong?
>
> Thank you in advance,
> Fabrizio
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

You also need to edit the crush rules to tell it to choose a leaf from 
each rack, instead of the default host.  If you run
ceph osd crush dump

You'll see that the rules 0, 1, and 2 are operation chooseleaf_firstn, 
type host.  Those rule numbers are referenced in the pool data's 
crush_ruleset above.

This should get you started on editing the crush map:
https://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map

In the rules section of the decompiled map, change your
step chooseleaf firstn 0 type host
to
step chooseleaf firstn 0 type rack

Then compile and set the new crushmap.

A lot of data is going to start moving.  This will give you a chance to 
use your cluster during a heavy recovery operation.

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140502/946d7865/attachment.htm>