Re: Erasure code ruleset for small cluster

Sage Weil <sweil@xxxxxxxxxx> · Mon, 5 Feb 2018 17:00:50 +0000 (UTC)

On Mon, 5 Feb 2018, Gregory Farnum wrote:
> On Mon, Feb 5, 2018 at 3:23 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
> 
> > Hi Gregory,
> >
> > Thanks for your answer.
> >
> > I had to add another step emit to your suggestion to make it work:
> >
> > step take default
> > step chooseleaf indep 4 type host
> > step emit
> > step take default
> > step chooseleaf indep 4 type host
> > step emit
> >
> > However, now the same OSD is chosen twice for every PG:
> >
> > # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> > --num-rep 8
> > CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]
> >
> 
> Oh, that must be because it has the exact same inputs on every run.
> Hrmmm...Sage, is there a way to seed them differently? Or do you have any
> other ideas? :/

Nope.  The CRUSH rule isn't meant to work like that..

> > I'm wondering why something like this won't work (crushtool test ends up
> > empty):
> >
> > step take default
> > step chooseleaf indep 4 type host

Yeah, s/chooseleaf/choose/ and it should work!
s

> > step choose indep 2 type osd
> > step emit
> >
> 
> Chooseleaf is telling crush to go all the way down to individual OSDs. I’m
> not quite sure what happens when you then tell it to pick OSDs again but
> obviously it’s failing (as the instruction is nonsense) and emitting an
> empty list.
> 
> 
> 
> >
> > # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> > --num-rep 8
> > CRUSH rule 1 x 1 []
> >
> > Kind regards,
> > Caspar Smit
> >
> > 2018-02-02 19:09 GMT+01:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
> >
> >> On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit <casparsmit@xxxxxxxxxxx>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I'd like to setup a small cluster (5 nodes) using erasure coding. I
> >> would
> >> > like to use k=5 and m=3.
> >> > Normally you would need a minimum of 8 nodes (preferably 9 or more) for
> >> > this.
> >> >
> >> > Then i found this blog:
> >> > https://ceph.com/planet/erasure-code-on-small-clusters/
> >> >
> >> > This sounded ideal to me so i started building a test setup using the
> >> 5+3
> >> > profile
> >> >
> >> > Changed the erasure ruleset to:
> >> >
> >> > rule erasure_ruleset {
> >> >   ruleset X
> >> >   type erasure
> >> >   min_size 8
> >> >   max_size 8
> >> >   step take default
> >> >   step choose indep 4 type host
> >> >   step choose indep 2 type osd
> >> >   step emit
> >> > }
> >> >
> >> > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
> >> each,
> >> > perfect.
> >> >
> >> > But then i tested a node failure, no problem again, all PG's stay active
> >> > (most undersized+degraded, but still active). Then after 10 minutes the
> >> > OSD's on the failed node were all marked as out, as expected.
> >> >
> >> > I waited for the data to be recovered to the other (fifth) node but that
> >> > doesn't happen, there is no recovery whatsoever.
> >> >
> >> > Only when i completely remove the down+out OSD's from the cluster the
> >> data
> >> > is recovered.
> >> >
> >> > My guess is that the "step choose indep 4 type host" chooses 4 hosts
> >> > beforehand to store data on.
> >>
> >> Hmm, basically, yes. The basic process is:
> >>
> >> >   step take default
> >>
> >> take the default root.
> >>
> >> >   step choose indep 4 type host
> >>
> >> Choose four hosts that exist under the root. *Note that at this layer,
> >> it has no idea what OSDs exist under the hosts.*
> >>
> >> >   step choose indep 2 type osd
> >>
> >> Within the host chosen above, choose two OSDs.
> >>
> >>
> >> Marking out an OSD does not change the weight of its host, because
> >> that causes massive data movement across the whole cluster on a single
> >> disk failure. The "chooseleaf" commands deal with this (because if
> >> they fail to pick an OSD within the host, they will back out and go
> >> for a different host), but that doesn't work when you're doing
> >> independent "choose" steps.
> >>
> >> I don't remember the implementation details well enough to be sure,
> >> but you *might* be able to do something like
> >>
> >> step take default
> >> step chooseleaf indep 4 type host
> >> step take default
> >> step chooseleaf indep 4 type host
> >> step emit
> >>
> >> And that will make sure you get at least 4 OSDs involved?
> >> -Greg
> >>
> >> >
> >> > Would it be possible to do something like this:
> >> >
> >> > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
> >> hosts
> >> > are needed), in case of node failure -> recover data from failed node to
> >> > fifth node.
> >> >
> >> > Thank you in advance,
> >> > Caspar
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com