Re: contraining crush placement possibilities

Sage Weil <sage@xxxxxxxxxxx> · Fri, 7 Mar 2014 09:43:06 -0800 (PST)

On Fri, 7 Mar 2014, Gregory Farnum wrote:
> On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> > On Fri, 7 Mar 2014, Dan van der Ster wrote:
> >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> >> > Sheldon just
> >> > pointed out a talk from ATC that discusses the basic problem:
> >> >
> >> >         https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
> >> >
> >> > The situation with CRUSH is slightly better, I think, because the number
> >> > of peers for a given OSD in a large cluster is bounded (pg_num /
> >> > num_osds), but I think we may still be able improve things.
> >>
> >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups?
> >
> > I think so (I didn't listen to the whole talk :).  My ears did perk up
> > when Carlos (who was part of the original team at UCSC) asked the question
> > about the CRUSH paper at the end, though. :)
> >
> > Anyway, now I'm thinking that this *is* really just all about tuning
> > pg_num/pgp_num.  And of course managing failure domains in the CRUSH map
> > as best we can to align placement with expected sources of correlated
> > failure.  But again, I would appreciate any confirmation from others'
> > intuitions or (better yet) a proper mathematical model.  This bit of my
> > brain is full of cobwebs, and wasn't particularly strong here to begin
> > with.
> 
> Well, yes and no. They're constraining data sharing in order to reduce
> the probability of any given data loss event, and we can reduce data
> sharing by reducing the pgp_num. But the example you cited was "place
> all copies in the top third of the selected racks", and that's a
> little different because it means they can independently scale the
> data sharing *within* that grouping to maintain a good data balance,
> which CRUSH would have trouble with.
> Unfortunately my intuition around probability and stats isn't much
> good, so that's about as far as I can take this effectively. ;)

Yeah I'm struggling with this too, but I *think* the top/middle/bottom 
rack analogy is just an easy way to think about constraining the placement 
options, which we're doing anyway with the placement group count--just in 
a way that looks random but is still sampling a small portion of the 
possible combinations.  In the end, whether you eliminate 8/9 of the 
options of the rack layers and *then* scale pg_num, or just scale pg_num, 
I think it still boils down to the number of distinct 3-disk sets out of 
the total possible 3-disk sets.

Also, FWIW, the rack thing is equivalent to making 3 parallel trees so 
that the crush hierarchy goes like:

 root
 layer of rack (top/middle/bottom)
 rack
 host
 osd

and make the crush rule first pick 1 layer before doing the chooseleaf 
over racks.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html