On Fri, Mar 7, 2014 at 7:10 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Fri, 7 Mar 2014, Dan van der Ster wrote: >> On Thu, Mar 6, 2014 at 9:30 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> > Sheldon just >> > pointed out a talk from ATC that discusses the basic problem: >> > >> > https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon >> > >> > The situation with CRUSH is slightly better, I think, because the number >> > of peers for a given OSD in a large cluster is bounded (pg_num / >> > num_osds), but I think we may still be able improve things. >> >> I'm surprised they didn't cite Ceph -- aren't copysets ~= placement groups? > > I think so (I didn't listen to the whole talk :). My ears did perk up > when Carlos (who was part of the original team at UCSC) asked the question > about the CRUSH paper at the end, though. :) > > Anyway, now I'm thinking that this *is* really just all about tuning > pg_num/pgp_num. And of course managing failure domains in the CRUSH map > as best we can to align placement with expected sources of correlated > failure. But again, I would appreciate any confirmation from others' > intuitions or (better yet) a proper mathematical model. This bit of my > brain is full of cobwebs, and wasn't particularly strong here to begin > with. Well, yes and no. They're constraining data sharing in order to reduce the probability of any given data loss event, and we can reduce data sharing by reducing the pgp_num. But the example you cited was "place all copies in the top third of the selected racks", and that's a little different because it means they can independently scale the data sharing *within* that grouping to maintain a good data balance, which CRUSH would have trouble with. Unfortunately my intuition around probability and stats isn't much good, so that's about as far as I can take this effectively. ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html