Re: [solved] Changing CRUSH rule on a running cluster

Olivier Bonvalet <ceph.list@xxxxxxxxx> · Fri, 08 Mar 2013 14:04:14 +0100

Hi,

yes, the «step chooseleaf firstn 0 type room» is correct. My problem was
that «crushtool --test» was reporting «bad mapping» PG, and some tests
on an *empty* pool gave me stucked PG in «active+remapped».

After some tests and reading, I saw that :
« For large clusters, some small percentages of PGs map to less than the
desired number of OSDs. This is more prevalent when there are several
layers of the hierarchy (e.g., row, rack, host, osd). »

cf
http://ceph.com/docs/master/rados/operations/crush-map/?highlight=crush#impact-of-legacy-values

I don't think my cluster is large (44 OSD referenced), but since I
describe all the 6 physical levels I have
(root→datacenter→room→network→rack→host), it seems that I hit that
«misbehavior».

So after testing with «crushtool --test» and verifying that I use ceph >
0.49 and linux kernel >= 3.6 everywhere, I enabled the «crush_tunables»
by following that : 
http://ceph.com/docs/master/rados/operations/crush-map/?highlight=crush#tuning-crush-the-hard-way

And it fixed the problem.

Olivier

Le vendredi 08 mars 2013 à 09:30 +0100, Marco Aroldi a écrit :
> Hi Oliver,
> 
> can you post here on the mailing list the steps taken?
> 
> From the IRC logs you said "if I use "choose .... osd", it works --
> but "chooseleaf ... host" doesn't work"
> 
> So, to have data balanced between 2 rooms, is the rule "step
> chooseleaf firstn 0 type room" correct? 
> 
> 
> Thanks
> 
> -- 
> Marco
> 
> 
> 2013/3/8 Olivier Bonvalet <ceph.list@xxxxxxxxx>
>         
>         >
>         > Thanks for your answer. So I made some tests on a dedicated
>         spool, I was
>         > able to move data from «platter» to «SSD» very well, it's
>         great.
>         >
>         > But I can't obtain that per "network" neither per "host"
>         setup :
>         > with 2 hosts, each one with 2 OSD, and with a pool with use
>         only 1
>         > replica (so, 2 copies), I tried this rule :
>         >
>         >         rule rbdperhost {
>         >               ruleset 5
>         >               type replicated
>         >               min_size 1
>         >               max_size 10
>         >               step take default
>         >               step chooseleaf firstn 0 type host
>         >               step emit
>         >         }
>         >
>         >
>         > As a result I obtain some PG which stuck in «active
>         +remapped» state.
>         > When querying one of this PG, I see that CRUSH find only one
>         OSD up for
>         > this one and can't find an other OSD to set replica.
>         >
>         > If I well understand, in this case the "chooseleaf firstn 0
>         type host"
>         > say to Ceph to choose 2 differents hosts, then in each of
>         them choose
>         > one OSD. So with 2 hosts, it should works, no ?
>         >
>         > Thanks,
>         > Olivier B.
>         >
>         >
>         
>         So, as said on IRC, it's solved. My rules were not working,
>         and after
>         use of «tunables» it's ok.
>         
>         I love that feature of changing data spreading on live !
>         
>         
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@xxxxxxxxxxxxxx
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com